Besides calculating the RMS velocity from a seismogram, ``Vspray'' performs
a smearing along the velocity axis. The use of four combined do-loops
should make parallelization very effective, but the access of
different array elements with four different indices in the innermost loop
should lead to extensive interprocessor communication.
In my opinion, these features
make ``Vspray'' an interesting choice for a test. ``Vspray'' is relatively
costly on serial computers, because the computing time is roughly proportional to
Nt2Nx/2.
The following shows my first trial, ``Vspray1''.
The algorithm is unchanged but is simply
translated to CM Fortran; arrays are used for variables needed in
the innermost loops. Via compiler directive all arrays are allocated on the
CM and are completely parallel, using the so-called NEWS geometry that
places subsequent array elements in a nearest neighbor fashion together.
Only the offset dimension is serial.
This exception means that each time slice
is present only in one virtual processor. Making the offsets parallel
increases the amount of interprocessor communication and slows down the
program. Some table variables are aligned with the input tx(),
to save communication time. Aligned arrays have the same ``home'';
i.e. they are assigned to the same virtual processor.
My second version, ``Vspray2'', reduces the communication
considerably but adds
some serialness. Input and output array are aligned with each other.
Some table arrays are aligned to them
in the same way as in the previous example. The offset
and slowness
dimensions are serial; i.e. they are stored in one virtual processor.
Every time slice is copied to every processor memory so that
for given each slowness value can be computed independently within
each processor.
The number of virtual processors is therefore equal to the number of time
samples of the input data.
My final version, ``Vspray3'', is probably not optimal but is faster
than the second program. Even the offset dimension is now parallel.
The amount of interprocessor communication is not reduced by making
the program more serial, but by applying the following trick:
After computing the slowness and smearing along the slowness axis
for each -slice, which is done parallel in the time and offset domains, the
whole seismogram is shifted along its
-axis. The result is a ``systolic''
data flow at the end of each loop over
. This avoids the
transmission
of the next time slices necessary for computation of the next
-slice
and explains the improvement in performance compared to the last program.
The performances of the three program versions differ considerably. The
computing time is not a linear function of the length of the I/O-arrays
as it is approximately for serial computers, when swapping can be avoided.
This is not surprising
because of the nonlinear
influence of the communication pattern. To provide a more objective comparison,
a vectorized version, ``Vspray4'',
was run on our Convex C1-XP. Computations made on that machine and on 8k
processors on a 32k Connection Machine in Cambridge, Massachusetts,
are compared in Table 1.
The absolute CPU times shouldn't be taken too seriously because
the actual run time depends partly on how many people are working on the
front end.
6|l| Table 1. Performance Comparison | |||||
Parameters | Vspray1 | Vspray2 | Vspray3 | Vspray4 | |
nt=200 | 361 sec | 86 sec | 48 sec | 9 sec | run time |
nx=20 | 1 | 1 | 1 | 1 | X first row run time |
ns=20 | 40 | 10 | 5 | 1 | X Convex run time |
nt=1000 | 11541 sec | 2199 sec | 1479 sec | 564 sec | run time |
nx=64 | 32 | 26 | 31 | 63 | X first row run time |
ns=64 | 20 | 4 | 2.6 | 1 | X Convex run time |
nt=2000 | several hours | 4118 sec | 2930 sec | 2237 sec | run time |
nx=64 | unknown | 48 | 61 | 250 | X first row run time |
ns=64 | unknown | 2 | 1.3 | 1 | X Convex run time |
nt=8000 | several days(?) | 8815 sec | 7633 sec | 18943 sec | run time |
nx=32 | unkown | 102 | 159 | 2104 | X first row run time |
ns=32 | unknown | 0.5 | 0.4 | 1 | X Convex run time |
It should be mentioned that if processing many shotgathers
-not only one, like I did- it is better to compute the slowness
as function of , time and offset in advance, store it in a
3-dimensional
array, and shift it over the shotgathers. In that case the performance
advantage of the CM will increase rapidly.
The huge discrepancies in computing time suggest that the programmer should
think very carefully about the optimal solution before he or she implements
an algorithm. This is much more important in parallel than in serial
computing.