ROMIO: A High-Performance, Portable MPI-IO Implementation Version 2008-03-09 Major Changes in this version: ------------------------------ * Fixed performance problems with the darray and subarray datatypes when using MPICH. * Better support for building against existing MPICH and MPICH versions. When building against an existing MPICH installation, use the "--with-mpi=mpich" option to ROMIO configure. For MPICH, use the "--with-mpi=mpich" option. These will allow ROMIO to take advantage of internal features of these implementations. * Deprecation of SFS, HFS, and PIOFS implementations. These are no longer actively supported, although the code will continue to be distributed for now. * Initial support for the Panasas PanFS filesystem. PanFS allows users to specify the layout of a file at file-creation time. Layout information includes the number of StorageBlades (SB) across which the data is stored, the number of SBs across which a parity stripe is written, and the number of consecutive stripes that are placed on the same set of SBs. The panfs_layout_* hints are only used if supplied at file-creation time. panfs_layout_type - Specifies the layout of a file: 2 = RAID0 3 = RAID5 Parity Stripes panfs_layout_stripe_unit - The size of the stripe unit in bytes panfs_layout_total_num_comps - The total number of StorageBlades a file is striped across. panfs_layout_parity_stripe_width - If the layout type is RAID5 Parity Stripes, this hint specifies the number of StorageBlades in a parity stripe. panfs_layout_parity_stripe_depth - If the layout type is RAID5 Parity Stripes, this hint specifies the number of contiguous parity stripes written across the same set of SBs. panfs_layout_visit_policy - If the layout type is RAID5 Parity Stripes, the policy used to determine the parity stripe a given file offset is written to: 1 = Round Robin PanFS supports the "concurrent write" (CW) mode, where groups of cooperating clients can disable the PanFS consistency mechanisms and use their own consistency protocol. Clients participating in concurrent write mode use application specific information to improve performance while maintaining file consistency. All clients accessing the file(s) must enable concurrent write mode. If any client does not enable concurrent write mode, then the PanFS consistency protocol will be invoked. Once a file is opened in CW mode on a machine, attempts to open a file in non-CW mode will fail with EACCES. If a file is already opened in non-CW mode, attempts to open the file in CW mode will fail with EACCES. The following hint is used to enable concurrent write mode. panfs_concurrent_write - If set to 1 at file open time, the file is opened using the PanFS concurrent write mode flag. Concurrent write mode is not a persistent attribute of the file. Below is an example PanFS layout using the following parameters: - panfs_layout_type = 3 - panfs_layout_total_num_comps = 100 - panfs_layout_parity_stripe_width = 10 - panfs_layout_parity_stripe_depth = 8 - panfs_layout_visit_policy = 1 Parity Stripe Group 1 Parity Stripe Group 2 . . . Parity Stripe Group 10 ---------------------- ---------------------- -------------------- SB1 SB2 ... SB10 SB11 SB12 ... SB20 ... SB91 SB92 ... SB100 ----------------------- ----------------------- --------------------- D1 D2 ... D10 D91 D92 ... D100 D181 D182 ... D190 D11 D12 D20 D101 D102 D110 D191 D192 D193 D21 D22 D30 . . . . . . D31 D32 D40 D41 D42 D50 D51 D52 D60 D61 D62 D70 D71 D72 D80 D81 D82 D90 D171 D172 D180 D261 D262 D270 D271 D272 D273 . . . . . . ... * Initial support for the Globus GridFTP filesystem. Work contributed by Troy Baer (troy@osc.edu). Major Changes in Version 1.2.5: ------------------------------ * Initial support for MPICH-2 * fix for a bug in which ROMIO would get confused for some permutations of the aggregator list * direct io on IRIX's XFS should work now * fixed an issue with the Fortran bindings that would cause them to fail when some compilers tried to build them. * Initial support for deferred opens Major Changes in Version 1.2.4: ------------------------------ * Added section describing ROMIO MPI_FILE_SYNC and MPI_FILE_CLOSE behavior to User's Guide * Bug removed from PVFS ADIO implementation regarding resize operations * Added support for PVFS listio operations, including hints to control use Major Changes in Version 1.2.3: ------------------------------- * Enhanced aggregation control via cb_config_list, romio_cb_read, and romio_cb_write hints * Asynchronous IO can be enabled under Linux with the --enable-aio argument to configure * Additional PVFS support * Additional control over data sieving with romio_ds_read hint * NTFS ADIO implementation integrated into source tree * testfs ADIO implementation added for debugging purposes Major Changes in Version 1.0.3: ------------------------------- * When used with MPICH 1.2.1, the MPI-IO functions return proper error codes and classes, and the status object is filled in. * On SGI's XFS file system, ROMIO can use direct I/O even if the user's request does not meet the various restrictions needed to use direct I/O. ROMIO does this by doing part of the request with buffered I/O (until all the restrictions are met) and doing the rest with direct I/O. (This feature hasn't been tested rigorously. Please check for errors.) By default, ROMIO will use only buffered I/O. Direct I/O can be enabled either by setting the environment variables MPIO_DIRECT_READ and/or MPIO_DIRECT_WRITE to TRUE, or on a per-file basis by using the info keys "direct_read" and "direct_write". Direct I/O will result in higher performance only if you are accessing a high-bandwidth disk system. Otherwise, buffered I/O is better and is therefore used as the default. * Miscellaneous bug fixes. Major Changes Version 1.0.2: --------------------------- * Implemented the shared file pointer functions and split collective I/O functions. Therefore, the main components of the MPI I/O chapter not yet implemented are file interoperability and error handling. * Added support for using "direct I/O" on SGI's XFS file system. Direct I/O is an optional feature of XFS in which data is moved directly between the user's buffer and the storage devices, bypassing the file-system cache. This can improve performance significantly on systems with high disk bandwidth. Without high disk bandwidth, regular I/O (that uses the file-system cache) perfoms better. ROMIO, therefore, does not use direct I/O by default. The user can turn on direct I/O (separately for reading and writing) either by using environment variables or by using MPI's hints mechanism (info). To use the environment-variables method, do setenv MPIO_DIRECT_READ TRUE setenv MPIO_DIRECT_WRITE TRUE To use the hints method, the two keys are "direct_read" and "direct_write". By default their values are "false". To turn on direct I/O, set the values to "true". The environment variables have priority over the info keys. In other words, if the environment variables are set to TRUE, direct I/O will be used even if the info keys say "false", and vice versa. Note that direct I/O must be turned on separately for reading and writing. The environment-variables method assumes that the environment variables can be read by each process in the MPI job. This is not guaranteed by the MPI Standard, but it works with SGI's MPI and the ch_shmem device of MPICH. * Added support (new ADIO device, ad_pvfs) for the PVFS parallel file system for Linux clusters, developed at Clemson University (see http://www.parl.clemson.edu/pvfs ). To use it, you must first install PVFS and then when configuring ROMIO, specify "-file_system=pvfs" in addition to any other options to "configure". (As usual, you can configure for multiple file systems by using "+"; for example, "-file_system=pvfs+ufs+nfs".) You will need to specify the path to the PVFS include files via the "-cflags" option to configure, for example, "configure -cflags=-I/usr/pvfs/include". You will also need to specify the full path name of the PVFS library. The best way to do this is via the "-lib" option to MPICH's configure script (assuming you are using ROMIO from within MPICH). * Uses weak symbols (where available) for building the profiling version, i.e., the PMPI routines. As a result, the size of the library is reduced considerably. * The Makefiles use "virtual paths" if supported by the make utility. GNU make supports it, for example. This feature allows you to untar the distribution in some directory, say a slow NFS directory, and compile the library (the .o files) in another directory, say on a faster local disk. For example, if the tar file has been untarred in an NFS directory called /home/thakur/romio, one can compile it in a different directory, say /tmp/thakur, as follows: cd /tmp/thakur /home/thakur/romio/configure make The .o files will be created in /tmp/thakur; the library will be created in /home/thakur/romio/lib/$ARCH/libmpio.a . This method works only if the make utility supports virtual paths. If the default make does not, you can install GNU make which does, and specify it to configure as /home/thakur/romio/configure -make=/usr/gnu/bin/gmake (or whatever) * Lots of miscellaneous bug fixes and other enhancements. * This version is included in MPICH 1.2.0. If you are using MPICH, you need not download ROMIO separately; it gets built as part of MPICH. The previous version of ROMIO is included in LAM, HP MPI, SGI MPI, and NEC MPI. NEC has also implemented the MPI-IO functions missing in ROMIO, and therefore NEC MPI has a complete implementation of MPI-IO. Major Changes in Version 1.0.1: ------------------------------ * This version is included in MPICH 1.1.1 and HP MPI 1.4. * Added support for NEC SX-4 and created a new device ad_sfs for NEC SFS file system. * New devices ad_hfs for HP/Convex HFS file system and ad_xfs for SGI XFS file system. * Users no longer need to prefix the filename with the type of file system; ROMIO determines the file-system type on its own. * Added support for 64-bit file sizes on IBM PIOFS, SGI XFS, HP/Convex HFS, and NEC SFS file systems. * MPI_Offset is an 8-byte integer on machines that support 8-byte integers. It is of type "long long" in C and "integer*8" in Fortran. With a Fortran 90 compiler, you can use either integer*8 or integer(kind=MPI_OFFSET_KIND). If you printf an MPI_Offset in C, remember to use %lld or %ld as required by your compiler. (See what is used in the test program romio/test/misc.c.) * On some machines, ROMIO detects at configure time that "long long" is either not supported by the C compiler or it doesn't work properly. In such cases, configure sets MPI_Offset to long in C and integer in Fortran. This happens on Intel Paragon, Sun4, and FreeBSD. * Added support for passing hints to the implementation via the MPI_Info parameter. ROMIO understands the following hints (keys in MPI_Info object): /* on all file systems */ cb_buffer_size - buffer size for collective I/O cb_nodes - no. of processes that actually perform I/O in collective I/O ind_rd_buffer_size - buffer size for data sieving in independent reads /* on all file systems except IBM PIOFS */ ind_wr_buffer_size - buffer size for data sieving in independent writes /* ind_wr_buffer_size is ignored on PIOFS because data sieving cannot be done for writes since PIOFS doesn't support file locking */ /* on Intel PFS and IBM PIOFS only. These hints are understood only if supplied at file-creation time. */ striping_factor - no. of I/O devices to stripe the file across striping_unit - the striping unit in bytes start_iodevice - the number of the I/O device from which to start striping (between 0 and (striping_factor-1)) /* on Intel PFS only. */ pfs_svr_buf - turn on or off PFS server buffering by setting the value to "true" or "false", case-sensitive. If ROMIO doesn't understand a hint, or if the value is invalid, the hint will be ignored. The values of hints being used by ROMIO at any time can be obtained via MPI_File_get_info. General Information ------------------- ROMIO is a high-performance, portable implementation of MPI-IO (the I/O chapter in MPI). ROMIO's home page is at http://www.mcs.anl.gov/romio . The MPI standard is available at http://www.mpi-forum.org/docs/docs.html . This version of ROMIO includes everything defined in the MPI I/O chapter except support for file interoperability and user-defined error handlers for files. The subarray and distributed array datatype constructor functions from Chapter 4 (Sec. 4.14.4 & 4.14.5) have been implemented. They are useful for accessing arrays stored in files. The functions MPI_File_f2c and MPI_File_c2f (Sec. 4.12.4) are also implemented. C, Fortran, and profiling interfaces are provided for all functions that have been implemented. Please read the limitations of this version of ROMIO that are listed below (e.g., MPIO_Request object, restriction to homogeneous environments). This version of ROMIO runs on at least the following machines: IBM SP; Intel Paragon; HP Exemplar; SGI Origin2000; Cray T3E; NEC SX-4; other symmetric multiprocessors from HP, SGI, DEC, Sun, and IBM; and networks of workstations (Sun, SGI, HP, IBM, DEC, Linux, and FreeBSD). Supported file systems are IBM PIOFS, Intel PFS, HP/Convex HFS, SGI XFS, NEC SFS, PVFS, NFS, and any Unix file system (UFS). This version of ROMIO is included in MPICH 1.2.3; an earlier version is included in at least the following MPI implementations: LAM, HP MPI, SGI MPI, and NEC MPI. Note that proper I/O error codes and classes are returned and the status variable is filled only when used with MPICH 1.2.1 or later. You can open files on multiple file systems in the same program. The only restriction is that the directory where the file is to be opened must be accessible from the process opening the file. For example, a process running on one workstation may not be able to access a directory on the local disk of another workstation, and therefore ROMIO will not be able to open a file in such a directory. NFS-mounted files can be accessed. An MPI-IO file created by ROMIO is no different than any other file created by the underlying file system. Therefore, you may use any of the commands provided by the file system to access the file, e.g., ls, mv, cp, rm, ftp. Using ROMIO on NFS ------------------ To use ROMIO on NFS, file locking with fcntl must work correctly on the NFS installation. On some installations, fcntl locks don't work. To get them to work, you need to use Version 3 of NFS, ensure that the lockd daemon is running on all the machines, and have the system administrator mount the NFS file system with the "noac" option (no attribute caching). Turning off attribute caching may reduce performance, but it is necessary for correct behavior. The following are some instructions we received from Ian Wells of HP for setting the noac option on NFS. We have not tried them ourselves. We are including them here because you may find them useful. Note that some of the steps may be specific to HP systems, and you may need root permission to execute some of the commands. >1. first confirm you are running nfs version 3 > >rpcnfo -p `hostname` | grep nfs > >ie > goedel >rpcinfo -p goedel | grep nfs > 100003 2 udp 2049 nfs > 100003 3 udp 2049 nfs > > >2. then edit /etc/fstab for each nfs directory read/written by MPIO > on each machine used for multihost MPIO. > > Here is an example of a correct fstab entry for /epm1: > > ie grep epm1 /etc/fstab > > ROOOOT 11>grep epm1 /etc/fstab > gershwin:/epm1 /rmt/gershwin/epm1 nfs bg,intr,noac 0 0 > > if the noac option is not present, add it > and then remount this directory > on each of the machines that will be used to share MPIO files > >ie > >ROOOOT >umount /rmt/gershwin/epm1 >ROOOOT >mount /rmt/gershwin/epm1 > >3. Confirm that the directory is mounted noac: > >ROOOOT >grep gershwin /etc/mnttab >gershwin:/epm1 /rmt/gershwin/epm1 nfs >noac,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0 0 0 899911504 ROMIO Installation Instructions ------------------------------- Since ROMIO is included in MPICH, LAM, HP MPI, SGI MPI, and NEC MPI, you don't need to install it separately if you are using any of these MPI implementations. If you are using some other MPI, you can configure and build ROMIO as follows: Untar the tar file as gunzip -c romio.tar.gz | tar xvf - OR zcat romio.tar.Z | tar xvf - THEN cd romio ./configure make Some example programs and a Makefile are provided in the romio/test directory. Run the examples the way you would run any MPI program. Each program takes the filename as a command-line argument "-fname filename". The configure script by default configures ROMIO for the file systems most likely to be used on the given machine. If you wish, you can explicitly specify the file systems by using the "-file_system" option to configure. Multiple file systems can be specified by using "+" as a separator. For example, ./configure -file_system=xfs+nfs For the entire list of options to configure do ./configure -h | more After building a specific version as above, you can install it in a particular directory with make install PREFIX=/usr/local/romio (or whatever directory you like) or just make install (if you used -prefix at configure time) If you intend to leave ROMIO where you built it, you should NOT install it (install is used only to move the necessary parts of a built ROMIO to another location). The installed copy will have the include files, libraries, man pages, and a few other odds and ends, but not the whole source tree. It will have a test directory for testing the installation and a location-independent Makefile built during installation, which users can copy and modify to compile and link against the installed copy. To rebuild ROMIO with a different set of configure options, do make distclean to clean everything including the Makefiles created by configure. Then run configure again with the new options, followed by make. Testing ROMIO ------------- To test if the installation works, do make testing in the romio/test directory. This calls a script that runs the test programs and compares the results with what they should be. By default, "make testing" causes the test programs to create files in the current directory and use whatever file system that corresponds to. To test with other file systems, you need to specify a filename in a directory corresponding to that file system as follows: make testing TESTARGS="-fname=/foo/piofs/test" Compiling and Running MPI-IO Programs ------------------------------------- If ROMIO is not already included in the MPI implementation, you need to include the file mpio.h for C or mpiof.h for Fortran in your MPI-IO program. Note that on HP machines running HPUX and on NEC SX-4, you need to compile Fortran programs with mpifort, because the f77 compilers on these machines don't support 8-byte integers. With MPICH, HP MPI, or NEC MPI, you can compile MPI-IO programs as mpicc foo.c or mpif77 foo.f or mpifort foo.f As mentioned above, mpifort is preferred over mpif77 on HPUX and NEC because the f77 compilers on those machines do not support 8-byte integers. With SGI MPI, you can compile MPI-IO programs as cc foo.c -lmpi or f77 foo.f -lmpi or f90 foo.f -lmpi With LAM, you can compile MPI-IO programs as hcc foo.c -lmpi or hf77 foo.f -lmpi If you have built ROMIO with some other MPI implementation, you can compile MPI-IO programs by explicitly giving the path to the include file mpio.h or mpiof.h and explicitly specifying the path to the library libmpio.a, which is located in $(ROMIO_HOME)/lib/$(ARCH)/libmpio.a . Run the program as you would run any MPI program on the machine. If you use mpirun, make sure you use the correct mpirun for the MPI implementation you are using. For example, if you are using MPICH on an SGI machine, make sure that you use MPICH's mpirun and not SGI's mpirun. The Makefile in the romio/test directory illustrates how to compile and link MPI-IO programs. Limitations of this version of ROMIO ------------------------------------ * When used with any MPI implementation other than MPICH 1.2.1 (or later), the "status" argument is not filled in any MPI-IO function. Consequently, MPI_Get_count and MPI_Get_elements will not work when passed the status object from an MPI-IO operation. * All nonblocking I/O functions use a ROMIO-defined "MPIO_Request" object instead of the usual "MPI_Request" object. Accordingly, two functions, MPIO_Test and MPIO_Wait, are provided to wait and test on these MPIO_Request objects. They have the same semantics as MPI_Test and MPI_Wait. int MPIO_Test(MPIO_Request *request, int *flag, MPI_Status *status); int MPIO_Wait(MPIO_Request *request, MPI_Status *status); The usual functions MPI_Test, MPI_Wait, MPI_Testany, etc., will not work for nonblocking I/O. * This version works only on a homogeneous cluster of machines, and only the "native" file data representation is supported. * When used with any MPI implementation other than MPICH 1.2.1 (or later), all MPI-IO functions return only two possible error codes---MPI_SUCCESS on success and MPI_ERR_UNKNOWN on failure. * Shared file pointers are not supported on PVFS and IBM PIOFS file systems because they don't support fcntl file locks, and ROMIO uses that feature to implement shared file pointers. * On HP machines running HPUX and on NEC SX-4, you need to compile Fortran programs with mpifort instead of mpif77, because the f77 compilers on these machines don't support 8-byte integers. * The file-open mode MPI_MODE_EXCL does not work on Intel PFS file system, due to a bug in PFS. Usage Tips ---------- * When using ROMIO with SGI MPI, you may sometimes get an error message from SGI MPI: ``MPI has run out of internal datatype entries. Please set the environment variable MPI_TYPE_MAX for additional space.'' If you get this error message, add this line to your .cshrc file: setenv MPI_TYPE_MAX 65536 Use a larger number if you still get the error message. * If a Fortran program uses a file handle created using ROMIO's C interface, or vice-versa, you must use the functions MPI_File_c2f or MPI_File_f2c. Such a situation occurs, for example, if a Fortran program uses an I/O library written in C with MPI-IO calls. Similar functions MPIO_Request_f2c and MPIO_Request_c2f are also provided. * For Fortran programs on the Intel Paragon, you may need to provide the complete path to mpif.h in the include statement, e.g., include '/usr/local/mpich/include/mpif.h' instead of include 'mpif.h' This is because the -I option to the Paragon Fortran compiler if77 doesn't work correctly. It always looks in the default directories first and, therefore, picks up Intel's mpif.h, which is actually the mpif.h of an older version of MPICH. ROMIO Users Mailing List ------------------------ Please register your copy of ROMIO with us by sending email to majordomo@mcs.anl.gov with the message subscribe romio-users This will enable us to notify you of new releases of ROMIO as well as bug fixes. Reporting Bugs -------------- If you have trouble, first check the users guide (in romio/doc/users-guide.ps.gz). Then check the on-line list of known bugs and patches at http://www.mcs.anl.gov/romio . Finally, if you still have problems, send a detailed message containing: The type of system (often, uname -a) The output of configure The output of make Any programs or tests to romio-maint@mcs.anl.gov . ROMIO Internals --------------- A key component of ROMIO that enables such a portable MPI-IO implementation is an internal abstract I/O device layer called ADIO. Most users of ROMIO will not need to deal with the ADIO layer at all. However, ADIO is useful to those who want to port ROMIO to some other file system. The ROMIO source code and the ADIO paper (see doc/README) will help you get started. MPI-IO implementation issues are discussed in our IOPADS '99 paper, "On Implementing MPI-IO Portably and with High Performance." All ROMIO-related papers are available online from http://www.mcs.anl.gov/romio. Learning MPI-IO --------------- The book "Using MPI-2: Advanced Features of the Message-Passing Interface," published by MIT Press, provides a tutorial introduction to all aspects of MPI-2, including parallel I/O. It has lots of example programs. See http://www.mcs.anl.gov/mpi/usingmpi2 for further information about the book.