COMPARISON OF EXAMPLE PROGRAMS

Besides calculating the RMS velocity from a seismogram, ``Vspray'' performs a smearing along the velocity axis. The use of four combined do-loops should make parallelization very effective, but the access of different array elements with four different indices in the innermost loop should lead to extensive interprocessor communication. In my opinion, these features make ``Vspray'' an interesting choice for a test. ``Vspray'' is relatively costly on serial computers, because the computing time is roughly proportional to N_t²N_x/2.
The following shows my first trial, ``Vspray1''. The algorithm is unchanged but is simply translated to CM Fortran; arrays are used for variables needed in the innermost loops. Via compiler directive all arrays are allocated on the CM and are completely parallel, using the so-called NEWS geometry that places subsequent array elements in a nearest neighbor fashion together. Only the offset dimension is serial. This exception means that each time slice is present only in one virtual processor. Making the offsets parallel increases the amount of interprocessor communication and slows down the program. Some table variables are aligned with the input tx(), to save communication time. Aligned arrays have the same ``home''; i.e. they are assigned to the same virtual processor.

My second version, ``Vspray2'', reduces the communication considerably but adds some serialness. Input and output array are aligned with each other. Some table arrays are aligned to them in the same way as in the previous example. The offset and slowness dimensions are serial; i.e. they are stored in one virtual processor. Every time slice is copied to every processor memory so that for given $\tau$ each slowness value can be computed independently within each processor. The number of virtual processors is therefore equal to the number of time samples of the input data.

My final version, ``Vspray3'', is probably not optimal but is faster than the second program. Even the offset dimension is now parallel. The amount of interprocessor communication is not reduced by making the program more serial, but by applying the following trick: After computing the slowness and smearing along the slowness axis for each $\tau$ -slice, which is done parallel in the time and offset domains, the whole seismogram is shifted along its $\tau$ -axis. The result is a ``systolic'' data flow at the end of each loop over $\tau$ . This avoids the transmission of the next time slices necessary for computation of the next $\tau$ -slice and explains the improvement in performance compared to the last program.

The performances of the three program versions differ considerably. The computing time is not a linear function of the length of the I/O-arrays as it is approximately for serial computers, when swapping can be avoided. This is not surprising because of the nonlinear influence of the communication pattern. To provide a more objective comparison, a vectorized version, ``Vspray4'', was run on our Convex C1-XP. Computations made on that machine and on 8k processors on a 32k Connection Machine in Cambridge, Massachusetts, are compared in Table 1. The absolute CPU times shouldn't be taken too seriously because the actual run time depends partly on how many people are working on the front end.

6\|l\| Table 1. Performance Comparison
Parameters	Vspray1	Vspray2	Vspray3	Vspray4
nt=200	361 sec	86 sec	48 sec	9 sec	run time
nx=20	1	1	1	1	X first row run time
ns=20	40	10	5	1	X Convex run time
nt=1000	11541 sec	2199 sec	1479 sec	564 sec	run time
nx=64	32	26	31	63	X first row run time
ns=64	20	4	2.6	1	X Convex run time
nt=2000	several hours	4118 sec	2930 sec	2237 sec	run time
nx=64	unknown	48	61	250	X first row run time
ns=64	unknown	2	1.3	1	X Convex run time
nt=8000	several days(?)	8815 sec	7633 sec	18943 sec	run time
nx=32	unkown	102	159	2104	X first row run time
ns=32	unknown	0.5	0.4	1	X Convex run time

It should be mentioned that if processing many shotgathers -not only one, like I did- it is better to compute the slowness as function of $\tau$ , time and offset in advance, store it in a 3-dimensional array, and shift it over the shotgathers. In that case the performance advantage of the CM will increase rapidly.
The huge discrepancies in computing time suggest that the programmer should think very carefully about the optimal solution before he or she implements an algorithm. This is much more important in parallel than in serial computing.