FLASH I/O Benchmark README 3-20-01 This program simulates the I/O employed by FLASH for the purposes of benchmarking the code. Two I/O methods are present in this distribution: parallel HDF 5 and serial f77 binary. Both routines result in a single file being created. In the HDF 5 version, the underlying MPI-IO routines are used to create a single file and have all processors write directly to the file, although the underlying MPI-IO layer may do the writing collectively. In the f77 version, processor 0 creates the file, and the data is distributed to this processor before writing. ** This version of the benchmark routine no longer relies on the HDF5 hyperslab selection routines to pick the interior of the blocks from the memoryspace, as these routines seem very inefficient. Instead, the interior of the blocks are extracted in the FORTRAN checkpoint/plotfile routines before passing to the HDF 5 calls Makefiles are provided will build the code on ASCI Blue Pacific at LLNL (Makefile.blue), ASCI Red at SNL (Makefile.red) and a local SGI, sphere (Makefile.sphere) using HDF 5 v. 1.4 (most likely some beta version of it). To build the code, type gmake -f Makefile.blue flash_benchmark_io replacing Makefile.blue with the Makefile for your platform. FLASH is a block-structured adaptive mesh hydrodynamics code. The computational domain is divided into blocks which are distributed across the processors. Typically a block contains 8 zones in each coordinate direction (x,y,z) and a perimeter of guardcells (presently 4 zones deep) to hold information from the neighbors. We typically carry 24 variables per zone, and fit about 100 blocks on each processor of Blue Pacific. The layout of unk is unk(nvar,2*nguard+nxb,2*nguard+nyb,2*nguard+nzb,blocks) where nvar is the number of variables nguard is the number of guardcells nxb, nyb, and nzb are the number of zones per block in x,y, and z and blocks is the maximum number of blocks to store. When writing the data for checkpointing or analysis, the guardcells are not stored, only the block interiors are stored. This extraction is currently performed via a memory copy into a buffer array in the FORTRAN routines before passing onto the HDF 5 routines. This method proves to be faster than using the HDF 5 memory space functionality to create a hyperslab containing only the interior zones. Additionally, each variable is stored in a separate record. Thus we write unk(i,nguard+1:nguard+nxb,nguard+1:nguard+nyb,nguard+1:nguard+nzb,lblocks) where i is the variable number, and lblocks are the number of actual blocks on the processor. Typically, nxb=nyb=nzb=8, and the data is double precision, so the size of each record from a single processor is 8 bytes / number * 8 zones in x * 8 zones in y * 8 zones in z * 100 blocks or 400 kB. This is a major factor in the poor performance currently achieved in I/O -- the guardcell overhead is a large fraction of the total memory on a processor, thus limiting the size of the record being written to disk to a small size. To run the benchmark code, build it as described above, and submit it with the desired number of processors. The code will put 0.8*maxblocks blocks on each processor (give or take 2 to make the number not constant across processors). The code will produce a checkpoint file (containing all variables in 8-byte precision) and two plotfiles (4 variables, 4-byte precision, one containing corner data, the other containing cell-centered data). The plotfiles are considerably smaller than the checkpoint files. Additionally, since plotfiles need to support data interpolated to the corners before writing, the data is copied into a buffer in FORTRAN code, where it is reduced in precision and interpolated if necessary. This buffer contains all the data for a single variable on a processor in a contiguous portion of memory (i.e. the guardcells have been removed). This buffer is then passed to the C routines that do the actual HDF calls. The checkpoint and plotfile routines are identical to those used in the FLASH Code. Any performance improvements or changes to this will directly fit into FLASH. There is some platform dependent code in the C routines that talk to the HDF 5 library. These are separated via preprocessor directives. These control the I/O mode (collective vs. independent), cache sizes, etc. Tweaking these parameters could improve performance.