README
          ROMIO: A High-Performance, Portable MPI-IO Implementation

                      Version 2008-03-09

Major Changes in this version:
------------------------------
* Fixed performance problems with the darray and subarray datatypes
  when using MPICH.

* Better support for building against existing MPICH and MPICH versions.

  When building against an existing MPICH installation, use the
  "--with-mpi=mpich" option to ROMIO configure.  For MPICH, use the
  "--with-mpi=mpich" option.  These will allow ROMIO to take advantage
  of internal features of these implementations.

* Deprecation of SFS, HFS, and PIOFS implementations.

  These are no longer actively supported, although the code will continue
  to be distributed for now.

* Initial support for the Panasas PanFS filesystem.

  PanFS allows users to specify the layout of a file at file-creation time.
  Layout information includes the number of StorageBlades (SB)
  across which the data is stored, the number of SBs across which a 
  parity stripe is written, and the number of consecutive stripes that 
  are placed on the same set of SBs.   The panfs_layout_* hints are only 
  used if supplied at file-creation time.
 
  panfs_layout_type - Specifies the layout of a file:
                      2 = RAID0
                      3 = RAID5 Parity Stripes
  panfs_layout_stripe_unit - The size of the stripe unit in bytes
  panfs_layout_total_num_comps - The total number of StorageBlades a file
                                 is striped across.
  panfs_layout_parity_stripe_width - If the layout type is RAID5 Parity
                                     Stripes, this hint specifies the 
                                     number of StorageBlades in a parity
                                     stripe.
  panfs_layout_parity_stripe_depth - If the layout type is RAID5 Parity
                                     Stripes, this hint specifies the
                                     number of contiguous parity stripes written 
                                     across the same set of SBs.
  panfs_layout_visit_policy - If the layout type is RAID5 Parity Stripes, 
                              the policy used to determine the parity 
                              stripe a given file offset is written to:
                              1 = Round Robin

  PanFS supports the "concurrent write" (CW) mode, where groups of cooperating 
  clients can disable the PanFS consistency mechanisms and use their own 
  consistency protocol.  Clients participating in concurrent write mode use 
  application specific information to improve performance while maintaining 
  file consistency.  All clients accessing the file(s) must enable concurrent 
  write mode.  If any client does not enable concurrent write mode, then the 
  PanFS consistency protocol will be invoked.  Once a file is opened in CW mode 
  on a machine, attempts to open a file in non-CW mode will fail with 
  EACCES.  If a file is already opened in non-CW mode, attempts to open 
  the file in CW mode will fail with EACCES.  The following hint is 
  used to enable concurrent write mode.

  panfs_concurrent_write - If set to 1 at file open time, the file 
                           is opened using the PanFS concurrent write 
                           mode flag.  Concurrent write mode is not a
                           persistent attribute of the file.

  Below is an example PanFS layout using the following parameters:
 
  - panfs_layout_type                = 3
  - panfs_layout_total_num_comps     = 100
  - panfs_layout_parity_stripe_width = 10
  - panfs_layout_parity_stripe_depth = 8
  - panfs_layout_visit_policy        = 1

   Parity Stripe Group 1     Parity Stripe Group 2  . . . Parity Stripe Group 10
  ----------------------    ----------------------        --------------------
   SB1    SB2  ... SB10     SB11    SB12 ...  SB20  ...   SB91   SB92 ... SB100
  -----------------------   -----------------------       ---------------------
   D1     D2   ...  D10      D91    D92  ...  D100        D181   D182  ... D190
   D11    D12       D20      D101   D102      D110        D191   D192      D193
   D21    D22       D30      . . .                        . . .
   D31    D32       D40
   D41    D42       D50
   D51    D52       D60
   D61    D62       D70
   D71    D72       D80
   D81    D82       D90      D171   D172      D180        D261   D262   D270
   D271   D272      D273     . . .                        . . .
   ...

* Initial support for the Globus GridFTP filesystem.  Work contributed by Troy
  Baer (troy@osc.edu).  

Major Changes in Version 1.2.5:
------------------------------

* Initial support for MPICH-2

* fix for a bug in which ROMIO would get confused for some permutations
  of the aggregator list

* direct io on IRIX's XFS should work now

* fixed an issue with the Fortran bindings that would cause them to fail
   when some compilers tried to build them.

* Initial support for deferred opens

Major Changes in Version 1.2.4:
------------------------------
* Added section describing ROMIO MPI_FILE_SYNC and MPI_FILE_CLOSE behavior to 
  User's Guide

* Bug removed from PVFS ADIO implementation regarding resize operations

* Added support for PVFS listio operations, including hints to control use


Major Changes in Version 1.2.3:
-------------------------------
* Enhanced aggregation control via cb_config_list, romio_cb_read,
  and romio_cb_write hints

* Asynchronous IO can be enabled under Linux with the --enable-aio argument
  to configure

* Additional PVFS support

* Additional control over data sieving with romio_ds_read hint

* NTFS ADIO implementation integrated into source tree

* testfs ADIO implementation added for debugging purposes


Major Changes in Version 1.0.3:
-------------------------------

* When used with MPICH 1.2.1, the MPI-IO functions return proper error codes 
  and classes, and the status object is filled in.

* On SGI's XFS file system, ROMIO can use direct I/O even if the
  user's request does not meet the various restrictions needed to use
  direct I/O. ROMIO does this by doing part of the request with
  buffered I/O (until all the restrictions are met) and doing the rest
  with direct I/O. (This feature hasn't been tested rigorously. Please
  check for errors.) 

  By default, ROMIO will use only buffered I/O. Direct I/O can be
  enabled either by setting the environment variables MPIO_DIRECT_READ
  and/or MPIO_DIRECT_WRITE to TRUE, or on a per-file basis by using
  the info keys "direct_read" and "direct_write". 

  Direct I/O will result in higher performance only if you are
  accessing a high-bandwidth disk system. Otherwise, buffered I/O is
  better and is therefore used as the default.

* Miscellaneous bug fixes.


Major Changes Version 1.0.2:
---------------------------

* Implemented the shared file pointer functions and
  split collective I/O functions. Therefore, the main
   components of the MPI I/O chapter not yet implemented are
  file interoperability and error handling.

* Added support for using "direct I/O" on SGI's XFS file system. 
  Direct I/O is an optional feature of XFS in which data is moved
  directly between the user's buffer and the storage devices, bypassing 
  the file-system cache. This can improve performance significantly on 
  systems with high disk bandwidth. Without high disk bandwidth,
  regular I/O (that uses the file-system cache) perfoms better.
  ROMIO, therefore, does not use direct I/O by default. The user can
  turn on direct I/O (separately for reading and writing) either by
  using environment variables or by using MPI's hints mechanism (info). 
  To use the environment-variables method, do
       setenv MPIO_DIRECT_READ TRUE
       setenv MPIO_DIRECT_WRITE TRUE
  To use the hints method, the two keys are "direct_read" and "direct_write". 
  By default their values are "false". To turn on direct I/O, set the values 
  to "true". The environment variables have priority over the info keys.
  In other words, if the environment variables are set to TRUE, direct I/O
  will be used even if the info keys say "false", and vice versa. 
  Note that direct I/O must be turned on separately for reading 
  and writing.
  The environment-variables method assumes that the environment
  variables can be read by each process in the MPI job. This is
  not guaranteed by the MPI Standard, but it works with SGI's MPI
  and the ch_shmem device of MPICH.

* Added support (new ADIO device, ad_pvfs) for the PVFS parallel 
  file system for Linux clusters, developed at Clemson University
  (see http://www.parl.clemson.edu/pvfs ). To use it, you must first install
  PVFS and then when configuring ROMIO, specify "-file_system=pvfs" in 
  addition to any other options to "configure". (As usual, you can configure
  for multiple file systems by using "+"; for example, 
  "-file_system=pvfs+ufs+nfs".)  You will need to specify the path 
  to the PVFS include files via the "-cflags" option to configure, 
  for example, "configure -cflags=-I/usr/pvfs/include". You
  will also need to specify the full path name of the PVFS library.
  The best way to do this is via the "-lib" option to MPICH's 
  configure script (assuming you are using ROMIO from within MPICH). 

* Uses weak symbols (where available) for building the profiling version,
  i.e., the PMPI routines. As a result, the size of the library is reduced
  considerably. 

* The Makefiles use "virtual paths" if supported by the make utility. GNU make
  supports it, for example. This feature allows you to untar the
  distribution in some directory, say a slow NFS directory,
  and compile the library (the .o files) in another 
  directory, say on a faster local disk. For example, if the tar file
  has been untarred in an NFS directory called /home/thakur/romio,
  one can compile it in a different directory, say /tmp/thakur, as follows:
        cd /tmp/thakur
        /home/thakur/romio/configure
        make
  The .o files will be created in /tmp/thakur; the library will be created in
  /home/thakur/romio/lib/$ARCH/libmpio.a .
  This method works only if the make utility supports virtual paths.
  If the default make does not, you can install GNU make which does,
  and specify it to configure as
       /home/thakur/romio/configure -make=/usr/gnu/bin/gmake (or whatever)

* Lots of miscellaneous bug fixes and other enhancements.

* This version is included in MPICH 1.2.0. If you are using MPICH, you
  need not download ROMIO separately; it gets built as part of MPICH.
  The previous version of ROMIO is included in LAM, HP MPI, SGI MPI, and 
  NEC MPI. NEC has also implemented the MPI-IO functions missing 
  in ROMIO, and therefore NEC MPI has a complete implementation
  of MPI-IO.


Major Changes in Version 1.0.1:
------------------------------

* This version is included in MPICH 1.1.1 and HP MPI 1.4.

* Added support for NEC SX-4 and created a new device ad_sfs for
  NEC SFS file system.

* New devices ad_hfs for HP/Convex HFS file system and ad_xfs for 
  SGI XFS file system.

* Users no longer need to prefix the filename with the type of 
  file system; ROMIO determines the file-system type on its own.

* Added support for 64-bit file sizes on IBM PIOFS, SGI XFS,
  HP/Convex HFS, and NEC SFS file systems.

* MPI_Offset is an 8-byte integer on machines that support 8-byte integers.
  It is of type "long long" in C and "integer*8" in Fortran.
  With a Fortran 90 compiler, you can use either integer*8 or
  integer(kind=MPI_OFFSET_KIND). 
  If you printf an MPI_Offset in C, remember to use %lld 
  or %ld as required by your compiler. (See what is used in the test 
  program romio/test/misc.c.)

* On some machines, ROMIO detects at configure time that "long long" is 
  either not supported by the C compiler or it doesn't work properly.
  In such cases, configure sets MPI_Offset to long in C and integer in
  Fortran. This happens on Intel Paragon, Sun4, and FreeBSD.

* Added support for passing hints to the implementation via the MPI_Info 
  parameter. ROMIO understands the following hints (keys in MPI_Info object):

  /* on all file systems */ 
     cb_buffer_size - buffer size for collective I/O
     cb_nodes - no. of processes that actually perform I/O in collective I/O
     ind_rd_buffer_size - buffer size for data sieving in independent reads

  /* on all file systems except IBM PIOFS */
     ind_wr_buffer_size - buffer size for data sieving in independent writes
      /* ind_wr_buffer_size is ignored on PIOFS because data sieving 
         cannot be done for writes since PIOFS doesn't support file locking */

  /* on Intel PFS and IBM PIOFS only. These hints are understood only if
     supplied at file-creation time. */
     striping_factor - no. of I/O devices to stripe the file across
     striping_unit - the striping unit in bytes
     start_iodevice - the number of the I/O device from which to start
                      striping (between 0 and (striping_factor-1))

  /* on Intel PFS only. */
     pfs_svr_buf - turn on or off PFS server buffering by setting the value 
               to "true" or "false", case-sensitive.
      
  If ROMIO doesn't understand a hint, or if the value is invalid, the hint
  will be ignored. The values of hints being used by ROMIO at any time 
  can be obtained via MPI_File_get_info.



General Information 
-------------------

ROMIO is a high-performance, portable implementation of MPI-IO (the
I/O chapter in MPI). ROMIO's home page is at
http://www.mcs.anl.gov/romio .  The MPI standard is available at
http://www.mpi-forum.org/docs/docs.html .

This version of ROMIO includes everything defined in the MPI I/O
chapter except support for file interoperability and
user-defined error handlers for files.  The subarray and
distributed array datatype constructor functions from Chapter 4
(Sec. 4.14.4 & 4.14.5) have been implemented. They are useful for
accessing arrays stored in files. The functions MPI_File_f2c and
MPI_File_c2f (Sec. 4.12.4) are also implemented.

C, Fortran, and profiling interfaces are provided for all functions
that have been implemented. 

Please read the limitations of this version of ROMIO that are listed
below (e.g., MPIO_Request object, restriction to homogeneous
environments).

This version of ROMIO runs on at least the following machines: IBM SP;
Intel Paragon; HP Exemplar; SGI Origin2000; Cray T3E; NEC SX-4; other
symmetric multiprocessors from HP, SGI, DEC, Sun, and IBM; and networks of
workstations (Sun, SGI, HP, IBM, DEC, Linux, and FreeBSD).  Supported
file systems are IBM PIOFS, Intel PFS, HP/Convex HFS, SGI XFS, NEC
SFS, PVFS, NFS, and any Unix file system (UFS).

This version of ROMIO is included in MPICH 1.2.3; an earlier version
is included in at least the following MPI implementations: LAM, HP
MPI, SGI MPI, and NEC MPI.  

Note that proper I/O error codes and classes are returned and the
status variable is filled only when used with MPICH 1.2.1 or later.

You can open files on multiple file systems in the same program. The
only restriction is that the directory where the file is to be opened
must be accessible from the process opening the file. For example, a
process running on one workstation may not be able to access a
directory on the local disk of another workstation, and therefore
ROMIO will not be able to open a file in such a directory. NFS-mounted
files can be accessed.

An MPI-IO file created by ROMIO is no different than any other file
created by the underlying file system. Therefore, you may use any of
the commands provided by the file system to access the file, e.g., ls,
mv, cp, rm, ftp.


Using ROMIO on NFS
------------------

To use ROMIO on NFS, file locking with fcntl must work correctly on
the NFS installation. On some installations, fcntl locks don't work.
To get them to work, you need to use Version 3 of NFS, ensure that the
lockd daemon is running on all the machines, and have the system
administrator mount the NFS file system with the "noac" option (no
attribute caching). Turning off attribute caching may reduce
performance, but it is necessary for correct behavior.

The following are some instructions we received from Ian Wells of HP
for setting the noac option on NFS. We have not tried them
ourselves. We are including them here because you may find 
them useful. Note that some of the steps may be specific to HP
systems, and you may need root permission to execute some of the
commands. 
   
   >1. first confirm you are running nfs version 3
   >
   >rpcnfo -p `hostname` | grep nfs
   >
   >ie 
   >    goedel >rpcinfo -p goedel | grep nfs
   >    100003    2   udp   2049  nfs
   >    100003    3   udp   2049  nfs
   >
   >
   >2. then edit /etc/fstab for each nfs directory read/written by MPIO
   >   on each  machine used for multihost MPIO.
   >
   >    Here is an example of a correct fstab entry for /epm1:
   >
   >   ie grep epm1 /etc/fstab
   > 
   >      ROOOOT 11>grep epm1 /etc/fstab
   >      gershwin:/epm1 /rmt/gershwin/epm1 nfs bg,intr,noac 0 0
   >
   >   if the noac option is not present, add it 
   >   and then remount this directory
   >   on each of the machines that will be used to share MPIO files
   >
   >ie
   >
   >ROOOOT >umount /rmt/gershwin/epm1
   >ROOOOT >mount  /rmt/gershwin/epm1
   >
   >3. Confirm that the directory is mounted noac:
   >
   >ROOOOT >grep gershwin /etc/mnttab 
   >gershwin:/epm1 /rmt/gershwin/epm1 nfs
   >noac,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0 0 0 899911504




ROMIO Installation Instructions
-------------------------------

Since ROMIO is included in MPICH, LAM, HP MPI, SGI MPI, and NEC MPI,
you don't need to install it separately if you are using any of these
MPI implementations.  If you are using some other MPI, you can
configure and build ROMIO as follows:

Untar the tar file as

    gunzip -c romio.tar.gz | tar xvf -

OR

    zcat romio.tar.Z | tar xvf -

THEN

    cd romio
    ./configure
    make

Some example programs and a Makefile are provided in the romio/test directory.
Run the examples the way you would run any MPI program. Each program takes 
the filename as a command-line argument "-fname filename".  

The configure script by default configures ROMIO for the file systems
most likely to be used on the given machine. If you wish, you can
explicitly specify the file systems by using the "-file_system" option
to configure. Multiple file systems can be specified by using "+" as a
separator. For example,

    ./configure -file_system=xfs+nfs

For the entire list of options to configure do

    ./configure -h | more

After building a specific version as above, you can install it in a
particular directory with 

    make install PREFIX=/usr/local/romio    (or whatever directory you like)

or just

    make install           (if you used -prefix at configure time)

If you intend to leave ROMIO where you built it, you should NOT install it 
(install is used only to move the necessary parts of a built ROMIO to 
another location). The installed copy will have the include files,
libraries, man pages, and a few other odds and ends, but not the whole
source tree.  It will have a test directory for testing the
installation and a location-independent Makefile built during
installation, which users can copy and modify to compile and link
against the installed copy. 

To rebuild ROMIO with a different set of configure options, do

    make distclean

to clean everything including the Makefiles created by configure.
Then run configure again with the new options, followed by make.



Testing ROMIO
-------------

To test if the installation works, do

     make testing 

in the romio/test directory. This calls a script that runs the test
programs and compares the results with what they should be. By
default, "make testing" causes the test programs to create files in
the current directory and use whatever file system that corresponds
to. To test with other file systems, you need to specify a filename in
a directory corresponding to that file system as follows:

     make testing TESTARGS="-fname=/foo/piofs/test"



Compiling and Running MPI-IO Programs
-------------------------------------

If ROMIO is not already included in the MPI implementation, you need
to include the file mpio.h for C or mpiof.h for Fortran in your MPI-IO
program.  

Note that on HP machines running HPUX and on NEC SX-4, you need to
compile Fortran programs with mpifort, because the f77 compilers on
these machines don't support 8-byte integers. 

With MPICH, HP MPI, or NEC MPI, you can compile MPI-IO programs as 
   mpicc foo.c
or 
   mpif77 foo.f
or
   mpifort foo.f

As mentioned above, mpifort is preferred over mpif77 on HPUX and NEC
because the f77 compilers on those machines do not support 8-byte integers.

With SGI MPI, you can compile MPI-IO programs as 
   cc foo.c -lmpi
or
   f77 foo.f -lmpi
or
   f90 foo.f -lmpi

With LAM, you can compile MPI-IO programs as 
   hcc foo.c -lmpi
or
   hf77 foo.f -lmpi

If you have built ROMIO with some other MPI implementation, you can
compile MPI-IO programs by explicitly giving the path to the include
file mpio.h or mpiof.h and explicitly specifying the path to the
library libmpio.a, which is located in $(ROMIO_HOME)/lib/$(ARCH)/libmpio.a .


Run the program as you would run any MPI program on the machine. If
you use mpirun, make sure you use the correct mpirun for the MPI
implementation you are using. For example, if you are using MPICH on
an SGI machine, make sure that you use MPICH's mpirun and not SGI's
mpirun.

The Makefile in the romio/test directory illustrates how to compile
and link MPI-IO programs. 



Limitations of this version of ROMIO
------------------------------------

* When used with any MPI implementation other than MPICH 1.2.1 (or later),
the "status" argument is not filled in any MPI-IO function. Consequently,
MPI_Get_count and MPI_Get_elements will not work when passed the status
object from an MPI-IO operation.

* All nonblocking I/O functions use a ROMIO-defined "MPIO_Request"
object instead of the usual "MPI_Request" object. Accordingly, two
functions, MPIO_Test and MPIO_Wait, are provided to wait and test on
these MPIO_Request objects. They have the same semantics as MPI_Test
and MPI_Wait.

int MPIO_Test(MPIO_Request *request, int *flag, MPI_Status *status);
int MPIO_Wait(MPIO_Request *request, MPI_Status *status);

The usual functions MPI_Test, MPI_Wait, MPI_Testany, etc., will not
work for nonblocking I/O. 

* This version works only on a homogeneous cluster of machines,
and only the "native" file data representation is supported.

* When used with any MPI implementation other than MPICH 1.2.1 (or later),
all MPI-IO functions return only two possible error codes---MPI_SUCCESS
on success and MPI_ERR_UNKNOWN on failure.

* Shared file pointers are not supported on PVFS and IBM PIOFS file
systems because they don't support fcntl file locks, and ROMIO uses
that feature to implement shared file pointers.

* On HP machines running HPUX and on NEC SX-4, you need to compile
Fortran programs with mpifort instead of mpif77, because the f77
compilers on these machines don't support 8-byte integers.

* The file-open mode MPI_MODE_EXCL does not work on Intel PFS file system,
due to a bug in PFS.



Usage Tips
----------

* When using ROMIO with SGI MPI, you may sometimes get an error
message from SGI MPI: ``MPI has run out of internal datatype
entries. Please set the environment variable MPI_TYPE_MAX for
additional space.'' If you get this error message, add this line to
your .cshrc file:
    setenv MPI_TYPE_MAX 65536 
Use a larger number if you still get the error message.

* If a Fortran program uses a file handle created using ROMIO's C
interface, or vice-versa, you must use the functions MPI_File_c2f 
or MPI_File_f2c. Such a situation occurs,
for example, if a Fortran program uses an I/O library written in C
with MPI-IO calls. Similar functions MPIO_Request_f2c and
MPIO_Request_c2f are also provided.

* For Fortran programs on the Intel Paragon, you may need
to provide the complete path to mpif.h in the include statement, e.g.,
        include '/usr/local/mpich/include/mpif.h'
instead of 
        include 'mpif.h'
This is because the -I option to the Paragon Fortran compiler if77
doesn't work correctly. It always looks in the default directories first 
and, therefore, picks up Intel's mpif.h, which is actually the
mpif.h of an older version of MPICH. 



ROMIO Users Mailing List
------------------------

Please register your copy of ROMIO with us by sending email
to majordomo@mcs.anl.gov with the message

subscribe romio-users

This will enable us to notify you of new releases of ROMIO as well as
bug fixes.



Reporting Bugs
--------------

If you have trouble, first check the users guide (in
romio/doc/users-guide.ps.gz). Then check the on-line list of known
bugs and patches at http://www.mcs.anl.gov/romio .
Finally, if you still have problems, send a detailed message containing:

   The type of system (often, uname -a)
   The output of configure
   The output of make
   Any programs or tests

to romio-maint@mcs.anl.gov .



ROMIO Internals
---------------

A key component of ROMIO that enables such a portable MPI-IO
implementation is an internal abstract I/O device layer called
ADIO. Most users of ROMIO will not need to deal with the ADIO layer at
all. However, ADIO is useful to those who want to port ROMIO to some
other file system. The ROMIO source code and the ADIO paper
(see doc/README) will help you get started.

MPI-IO implementation issues are discussed in our IOPADS '99 paper,
"On Implementing MPI-IO Portably and with High Performance."
All ROMIO-related papers are available online from
http://www.mcs.anl.gov/romio.


Learning MPI-IO
---------------

The book "Using MPI-2: Advanced Features of the Message-Passing
Interface," published by MIT Press, provides a tutorial introduction to
all aspects of MPI-2, including parallel I/O. It has lots of example
programs. See http://www.mcs.anl.gov/mpi/usingmpi2 for further
information about the book.