The NCF/CRG T3E project 1997/1998

Financial support for this research was provided by Cray Research, Inc., under Grant CRG 97.02. Computer time was provided by the Stichting Nationale Computerfaciliteiten (National Computing Facilities foundation, NCF) for the use of supercomputer facilities, with financial support from the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Netherlands Organization for Scientific Research, NWO).

Project Description:

This project is part of research on the subject: how to make an implementation of an off-line air quality model that is efficient on the most advanced computer systems like vector/parallel supercomputers and massively parallel distributed-memory systems and on (a cluster of) workstations.
To have a handy tool to measure the performance of various computer platforms we developed a benchmark code for an off-line AQM (see the description in html or in postscript), which uses as basic parallelization strategy the SPMD (Single Program Multiple Data) approach through domain decomposition (more).

Since a real AQM contains a considerable amount of I/O operations, we added to our benchmark a `realistic' amount of reading data from and writing output to disk. With this code we studied the I/O performance of the Cray T3E for our AQM. We implemented the I/O both in the master/slave approach, i.e., one PE performs all necessary I/O and takes care of the distribution and gathering of the data, and using parallel I/O (more).

People involved:

Papers:

[29] from the full list of publications in the MAS1.1: Numerical Algorithms for Air Quality Modeling project overview.

AQM benchmark performance

We compared the performance of our benchmark on the three different architectural paradigms, viz., a vector/parallel shared memory architecture (a Cray C90 with 12 processors), a massively parallel distributed memory system (a Cray T3E), and a cluster of SGI O2 workstations coupled in a star-shaped ATM network. To value the difference in performance we give below some necessary technical information of these platforms.

The peak performance per node of these computers is: for 1 CPU of the C90 1 Gflop/s for 1 PE of the T3E 0.6 Gflop/s, and for 1 O2 0.1 Gflop/s (all 64 bit Flops).

We ran our benchmark on a 32x32 horizontal grid with 32 layers in the vertical (so with concentration vectors of dimension [32,32,32,66]). For the explicit message passing we make use of PVM routines.
As normalizing computational unit we take the performance on 1 CPU of the C90 (compiler option -Zv). Our benchmark runs there at a speed of 500 Mflop/s, which is half the peak performance.

The table shows first the performance on multiple processors of the C90, once for the complete model using autotasking to divide the workload over the processors, i.e., parallelizing at loop level, and once using the PVM program. On the Cray T3E the scalability is even superlinear with the number of processors (due to cache effects), but here the performance on 1 PE is only 4% of the peak performance. One can calculate from the figures in the table that it needs 16 PEs of the T3E to outperform 1 processor of the C90. Experiments we performed on a 64x64 horizontal grid show that the scalability with the modelsize is also perfect, i.e., a run on N PEs for the 64x64 grid is as expensive as a run on N/4 PEs for the 32x32 grid.
Finally, the results for a cluster of workstations, the SGI O2's coupled with an ATM network. Here one can clearly see the effect of the use of `virtual' memory: the memory on these workstations is not large enough to contain a complete model, so the computer is more swapping than calculating, resulting in a wall clock time which is four times as large as the CPU time (and has a clear day/night rhythm: during daytime it is about 6 times as large). The speed-up in wall clock time using two O2's instead of one is therefore huge: a factor 5.6. The performance is then approximately the same as for 1 PE of the T3E. However, the scalability of a cluster of workstations is much less. The CPU time can also decrease sometimes superlinearly due to cache effects, but this does not show up in the wall clock time.


I/O experiments on the Cray T3E

Since LOTOS is an off-line model we need as input (analyzed) meteo data. At given points in time we have to read at least four different three-dimensional fields: one for the temperature and three for the different components of the wind velocity. We also have to read several two-dimensional fields such as the surface pressure. In our I/O test we are using a 64x64 computational grid. We read each time step four three-dimensional fields and five two-dimensional fields. As output we write each time step all 66 computed concentrations (three-dimensional data). All I/O operations are REAL*4 and unformatted. Computations are done with precision REAL*8.

There are many possible strategies to perform I/O on a parallel computer. For the input of the meteo data we chose the most simple approach: do the input from a single PE. Clearly, this is the most portable approach: it is possible on every parallel architecture and it is not necessary to divide the data up into multiple files before reading. The disadvantage is that it does not scale. Apart from reading the data we have to reorder the data according to the domain decomposition and to scatter the data across the PEs.
For the output we tried three different approaches:

  1. The similar portable approach as for the input: gather all concentrations on one PE, reorder them in one three-dimensional field, and do the output from one single PE.
  2. The parallel output approach: every PE writes its own subdomain concentrations to a private file using standard Fortran output.
  3. The parallel asynchronous approach: every PE writes it own subdomain concentrations to a private file using asynchronous BUFFER OUT requests, where a PE starts an output request and immediately can continue doing computations.
In the table below we show the CPU time of our model without doing any I/O. For the I/O we show the minimum, average and maximum wall clock times. Column I shows the input-only wall clock time, and columns I/O, I/PO, I/PAO show the combined input/output wall clock times for the standard output, parallel output, and parallel asynchronous output approach, respectively. The wall clock times for the I/O are determined by subtracting the CPU time from the measured wall clock times including the I/O. Because of the memory requirements, it is not possible to run our simulation on less than 4 PEs.

The difference between minimum and maximum times is because we are dealing with shared physical devices. Disks always have seek times, but some I/O requests also have to wait until requests from other jobs on other PEs are finished. Isolated timing of a single I/O request that normally takes only 0.3 s showed times up to 5 s.

For the input we see an anomalous behavior for 4 processors, where the time to read the data takes more time than the combined time for input and parallel asynchronous output. We have no explanation for this, yet. But the time to do the input is obviously not significant.

For the portable output we see a combined input/output time of 70 s, say. Since the combined I/O time is almost three times as expensive as the CPU time on 64 PEs, we tried the following two other approaches.

For the parallel output every processor writes to its own private file so we might hope for a parallel speed-up limited by the maximum of the number of PEs and the number of disks, in our case eight. However, we only see a performance gain of 30-50% probably due to the synchronous I/O and the resulting I/O contention. On 64 PEs the combined I/O time still is four times as expensive as the CPU time.

It can be seen that the output and computations plus necessary communications do not overlap completely. The reason for this is unclear: it should have been possible looking at the amount of I/O and the sustained average I/O rates. Doing a separate experiment, where we only wrote the output files without doing any computation/communication at all (just executing a `sleep' statement), we were able to overlap output and `computations' completely.

Again we might have hoped for a parallel speed-up limited by the maximum of the number of PEs and the number of disks, but we do not see a performance gain going from 4 to 8 PEs. We also see that the combined I/O time slightly increases as the number of PEs increases. This must be due to I/O contention. On 64 PEs the combined I/O time is about as much as the CPU time. This seems acceptable, especially since 64 PEs give rise to small subdomains with a maximum amount of I/O. Moreover, it is the most efficient way to handle the output.

The final conclusion is probably the most disappointing one: although the Cray T3E has a scalable I/O architecture, the I/O does not scale. Of course this is due to the limited number of disks, but on the other hand we should have seen a speed-up going from 4 to 8 PEs when doing output.


Go to the CWI MAS1.1: Numerical Algorithms for Air Quality Modeling project overview. or to the CWI home page.
Joke Blom (gollum@cwi.nl)
Last update 98/02/18.