FLASH I/O Benchmark Routine -- Parallel HDF 5

[Introduction]  [Download]  [History]  [Performance]  [Problems]

 

 

Introduction

The FLASH I/O benchmark routine measures the performance of the FLASH parallel HDF 5 output. It recreates the primary data structures in FLASH and produces a checkpoint file, a plotfile with centered data, and a plotfile with corner data. The plotfiles have single precision data.

The purpose of this routine is to tune the I/O performance in a controlled environment. The I/O routines are identical to the routines used by FLASH, so any performance improvements made to the benchmark program will be shared by FLASH.

Makefiles for the ASCI Red (TFLOPS) machine at SNL, ASCI Blue Pacific / Frost at LLNL, and SGI platforms are included. Information on the performance and difficulties encountered with parallel HDF 5 will be posted on this page.

Current Issues

Below are some issues that came up at a meeting with the NCSA HDF developers. These are issues that we hope to address with the FLASH I/O benchmark program.

 

Summary of a meeting with Richard Hedges and company of the parallel I/O project at LLNL this past week to discuss I/O performance of FLASH on ASCI Blue Pacific:

Richard is going to look into fixes for the memory bug that is preventing us from writing from a large number of processors. I will look into packing the first few records into a single record. Finally, we need to figure out why the alignment is not working. Apparantely, Kim Yates had it working in an earlier version of the library, and saw good results, but at some point (~ HDF 5 1.2.2?) it stopped working.

View the README file included with the I/O benchmark.

 

 

Download

Download the benchmark routine

Currently, this benchmark routine uses the following libraries on the ASCI platforms

Platform HDF library compiler version MPI version
ASCI Red 1.4.0 (parallel) FORTRAN: if90 Rel 3.1-4i
C: icc Rel 3.1-4i
MPICH 1.2.1
ASCI Blue Pacific 1.4.1 FORTRAN: newmpf90
C: newmpcc
IBM MPI
Frost 1.4.1 FORTRAN: newmpf90
C: newmpcc
IBM MPI (Mohonk)

 

 

History

 

 

Performance

Note: the timings reported below are for the entire I/O routine, not just the writing to disk. For example, the timings include the hyperslab selecting, interpolation to corners (if necessary), and reduction in precision (for plotfiles). Thus they represent lower bounds for the actual bandwidth to disk. However, since all of these operations are necessary each time we write a file, including them in the timing is not wrong, as the actual disk bandwidth will always scale with these numbers.

Timings on ASCI Red:

(3-3-01) These calculations were run on janus writing to /pfs_grande/tmp_2/zingale/. -proc 2 mode was used, with ~ 80 computational blocks per processor. The submission script looked like:

#!/bin/csh
setenv MPI_HEAP_SIZE 204800
setenv MPI_MATCH_LIST_SIZE 20000
setenv MPI_SHORT_MSG_SIZE 1024
setenv MPI_COLL_WORK_SIZE 10240
setenv MPI_GETPUT_ML_SIZE 20
cd $QSUB_WORKDIR
echo "about to run the FLASH I/O benchmark"
yod -masync -proc 2 ./flash_benchmark_io

and the job was submitted with:
qsub -q edu.day -lT 3600 -lP 64 submit

The table below gives the timings and file sizes for the three files generated by the benchmark program.

  checkpoint file plotfile plotfile w/ corners
# of procs size time MB/s size time MB/s size time MB/s
64 509837188 112.4395 4.324 42697532 44.5674 0.9137 60692908 47.3178 1.223
256 2039594116 399.0858 4.874 170783804 164.8542 0.9880 242775724 155.0007 1.494
512 does not complete

As the table shows, the performance is not all that hot. It is better than before (i.e. the old version of the HDF 5 library). These tests were run using collective I/O (see the benchmark program and look at the code fragments in the TFLOPS directives).

On 512 processors, the file size should grow above 512 processors. Right around this point, the code issues a large number of errors:
 
(no attribute caching).
File locking failed in ADIOI_Set_lock. If the file system is NFS, you need to use NFS version 3 and mount the directory with the 'noac' option
 
It turns out that Red does not have support for files > 2 GB (despite the conflicting reports in the documentation). There is no plan to upgrade pfs on Red to support large files.

 

 

Timings on ASCI Blue Pacific:

(3-3-01) These calculations were run on blue writing to /p/gb1/zingale/. All 4 processors on a node were used as MPI tasks. IBM MPI was used with HDF 5 v.1.4. The submission script looked like:

#! /bin/csh -x
#PSUB -s /bin/csh
#PSUB -c "pbatch"
#PSUB -ln 32   # Number of nodes you want to use
#PSUB -g 128   # Number of processors you want (ln * 4)
#PSUB -eo
#PSUB -tM 1:00
#
cd /p/gb1/zingale/io_bench/
set exec=/p/gb1/zingale/io_bench/flash_benchmark_io
#
setenv FLASHLOG flashlog.$$
echo "running: " $exec > $FLASHLOG
#
poe $exec >> $FLASHLOG

and the job was submitted with
psub < runflash

The table below gives the timings and file sizes for the three files generated by the benchmark program.

  checkpoint file plotfile plotfile w/ corners
# of procs size time MB/s size time MB/s size time MB/s
64 509837188 22.7511 21.371 42697532 4.8059 8.473 60692908 3.3724 17.163

Currently, 256 processor jobs do not run on Blue due to a problem in the IBM MPI / HDF 5 interaction. I am uncertain how to resolve this currently.

 

 

Timings on Frost (LLNL):

(3-3-01) These calculations were run on frost writing to /p/gf1/zingale/. All 16 processors on a node were used as MPI tasks. IBM MPI was used with HDF 5 v.1.4.1. The submission script looked like:

#! /bin/csh -x
#PSUB -s /bin/csh
#PSUB -c "pbatch"
#PSUB -ln 4   # Number of nodes you want to use
#PSUB -g 64   # Number of processors you want (ln * 4)
#PSUB -eo
#PSUB -tM 1:00
#
cd /p/gf1/zingale/io_bench/
set exec=/p/gf1/zingale/io_bench/flash_benchmark_io
#
setenv FLASHLOG flashlog.$$
echo "running: " $exec > $FLASHLOG
#
poe $exec >> $FLASHLOG

and the job was submitted with
psub < runflash

The table below gives the timings and file sizes for the three files generated by the benchmark program.

  checkpoint file plotfile plotfile w/ corners
# of procs size time MB/s size time MB/s size time MB/s
64 510374172 11.92 40.8       61064036 2.09 27.9
256 2041748508 15.08 129.1       244266596 11.12 20.9
512 4083580956 28.32 137.5 344549364 10.87 30.2 488536676 8.84 52.7
768 6125511872 27.56 212.0 488536676 12.07 40.8 732818536 13.43 52.0

Currently, there are no know problems on frost.

 

 

Problems

 

 


Contact zingale@flash.uchicago.edu with comments on the benchmark routine.