Blame doc/pmi/paper.tex

Packit Service c5cf8c
\documentclass[11pt]{article}
Packit Service c5cf8c
\usepackage{times} % Necessary because Acrobat can't handle fonts properly
Packit Service c5cf8c
\let\file=\texttt
Packit Service c5cf8c
\let\code=\texttt
Packit Service c5cf8c
\let\program=\code
Packit Service c5cf8c
\usepackage{epsf}
Packit Service c5cf8c
\setlength\textwidth{6.5in}
Packit Service c5cf8c
\setlength\oddsidemargin{0in}
Packit Service c5cf8c
\setlength\evensidemargin{0in}
Packit Service c5cf8c
\setlength\marginparwidth{0.7in}
Packit Service c5cf8c
\def\bw{\texttt{\char`\\}}
Packit Service c5cf8c
Packit Service c5cf8c
% Add a discussion macro for parts of the paper that need discussion
Packit Service c5cf8c
\newenvironment{discussion}{\begin{quotation}\centerline{\textbf{Discussion}}}{\ifvmode\else\par\fi\centerline{\textbf{End
Packit Service c5cf8c
      of Discussion}}\end{quotation}}
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
\newcommand\onehalf{\ifmmode {\scriptstyle {\scriptstyle 1\over
Packit Service c5cf8c
  \scriptstyle 2}} \else $\onehalf$ \fi}
Packit Service c5cf8c
\renewcommand{\floatpagefraction}{0.9}
Packit Service c5cf8c
Packit Service c5cf8c
\begin{document}
Packit Service c5cf8c
\title{{\bf Process Management in MPICH}\\[.2in] DRAFT 2.1}
Packit Service c5cf8c
\author{The MPICH Team\\Argonne National Laboratory}
Packit Service c5cf8c
\maketitle
Packit Service c5cf8c
Packit Service c5cf8c
\begin{abstract}
Packit Service c5cf8c
  In this note we describe a process management interface that can be
Packit Service c5cf8c
  used by MPI implementations and other parallel processing libraries
Packit Service c5cf8c
  yet be independent of both.  We define the specific interface we are
Packit Service c5cf8c
  developing, called PMI (Process Manager Interface) in the context of
Packit Service c5cf8c
  MPICH.  We describe the interface itself and a number of
Packit Service c5cf8c
  implementations.  We show how an MPI implementation built on MPICH
Packit Service c5cf8c
  can use PMI to make itself independent of the environment in which it
Packit Service c5cf8c
  executes, and how a process management environment can support MPI
Packit Service c5cf8c
  implementations based on MPICH.
Packit Service c5cf8c
\end{abstract}
Packit Service c5cf8c
Packit Service c5cf8c
\section{Introduction}
Packit Service c5cf8c
\label{sec:introduction}
Packit Service c5cf8c
Packit Service c5cf8c
This informal paper is intended to be useful for those using MPICH as the
Packit Service c5cf8c
basis of their own MPI implementations, or providing a process
Packit Service c5cf8c
management environment which will run MPI jobs that are linked against
Packit Service c5cf8c
MPICH itself or an MPICH-derived MPI implementation.  At the time of
Packit Service c5cf8c
writing, this audience potentially includes groups at IBM (for BG/L), Livermore
Packit Service c5cf8c
(SLURM process management), Cray (MPI implementation for Red Storm, with
Packit Service c5cf8c
YOD), Myricom (MPICH-GM), Viridian (PBSPro), and Globus/Teragrid
Packit Service c5cf8c
implementors (MPICH-G2).  Others are welcome to contribute suggestions.
Packit Service c5cf8c
Packit Service c5cf8c
This paper is incomplete in the sense that it describes an interface
Packit Service c5cf8c
still being defined.  The part that is currently in use in the MPI-1
Packit Service c5cf8c
part of MPICH has proved adequate for our needs and has several
Packit Service c5cf8c
implementations.  We will refer to it as ``Part 1.''  That part of the
Packit Service c5cf8c
interface that is required for the dynamic process management part of
Packit Service c5cf8c
MPI-2 is still under development, and we will refer to it here as ``Part
Packit Service c5cf8c
2.''  Most of this paper is about Part 1, which is all that is necessary
Packit Service c5cf8c
to support an MPI implementation that does not include dynamic process
Packit Service c5cf8c
management.
Packit Service c5cf8c
Packit Service c5cf8c
We describe the problem we are trying to solve in
Packit Service c5cf8c
Section~\ref{sec:problem}, the approach we have taken so far in MPICH
Packit Service c5cf8c
in Section~\ref{sec:approach}, the PMI interface itself in
Packit Service c5cf8c
Section~\ref{sec:interface} as it has been defined so far, and
Packit Service c5cf8c
implementations in Section~\ref{sec:implementing}.  In
Packit Service c5cf8c
Section~\ref{sec:implications} we outline the implications for those who
Packit Service c5cf8c
are collaborating with us in various ways related to the MPICH project. 
Packit Service c5cf8c
Packit Service c5cf8c
\section{The Problem}
Packit Service c5cf8c
\label{sec:problem}
Packit Service c5cf8c
Packit Service c5cf8c
The problem this paper addresses is how to provide the processes of a
Packit Service c5cf8c
parallel job with the information they need in order to interact with
Packit Service c5cf8c
the process management environment where necessary, in particular to set
Packit Service c5cf8c
up communication with one another.  In many cases such information is
Packit Service c5cf8c
partly provided by the systems software that actually starts processes,
Packit Service c5cf8c
and also partly provided by the processes themselves, in which case th
Packit Service c5cf8c
process management system must aid in the dissemination of this
Packit Service c5cf8c
information to other processes.  A classic example occurs when a process
Packit Service c5cf8c
in a TCP implementation of MPI acquires a port on which it can be
Packit Service c5cf8c
contacted, and then must notify other processes of this port so that
Packit Service c5cf8c
they can establish MPI communication with this process by connecting to
Packit Service c5cf8c
the port.
Packit Service c5cf8c
Packit Service c5cf8c
Traditionally, parallel programming libraries, such as MPI
Packit Service c5cf8c
implementations, have been integrated with process management mechanisms
Packit Service c5cf8c
(LAM, MPICH1 with ch\_p4 device, POE) in order to solve this problem.
Packit Service c5cf8c
After a preliminary exploration of separating the library from the
Packit Service c5cf8c
process manager in MPICH-1 with the {\tt ch\_p4mpd} device we have decided to
Packit Service c5cf8c
update the interface and then commit to this approach in MPICH.  We are
Packit Service c5cf8c
motivated by the challenges of implementing MPI-2 in a
Packit Service c5cf8c
system-independent way, but many of the ideas here might prove useful in
Packit Service c5cf8c
a non-MPI environment as well, either for other parallel libraries (such
Packit Service c5cf8c
as Global Arrays, GP-SHMEM, or GASNet) or language-based systems (such
Packit Service c5cf8c
as UPC, Co-Array Fortran, or Titanium).
Packit Service c5cf8c
Packit Service c5cf8c
The problem to be addressed has several components:
Packit Service c5cf8c
\begin{itemize}
Packit Service c5cf8c
\item Conveying to processes in a parallel job the information they need
Packit Service c5cf8c
  to establish communication with one another.  To focus the discussion,
Packit Service c5cf8c
  we assume that such communication is needed for implementing MPI.
Packit Service c5cf8c
  Thus this information could include hosts, ports, interfaces,
Packit Service c5cf8c
  shared-memory keys, and other information.
Packit Service c5cf8c
\item MPI-2 dynamic process management functions require extra support
Packit Service c5cf8c
  for implementing {\tt MPI\_Comm\_Spawn}, {\tt MPI\_Comm\_\{Connect/Accept\}},
Packit Service c5cf8c
  {\tt MPI\_\{Publish/Lookup\}\_name}, etc.
Packit Service c5cf8c
\item The interface should be simple and straightforward, particularly
Packit Service c5cf8c
  in the absence of dynamic process management.  MPICH will implement
Packit Service c5cf8c
  dynamic process management, but some other MPI implementations may
Packit Service c5cf8c
  not.
Packit Service c5cf8c
\item The interface must allow a scalable implementation for
Packit Service c5cf8c
  performance-critical operations.  In environments with even only hundreds
Packit Service c5cf8c
  of processes, serial algorithms will be inappropriate.
Packit Service c5cf8c
\end{itemize}
Packit Service c5cf8c
An earlier, similar version of this approach was described
Packit Service c5cf8c
in~\cite{butler-lusk-gropp:mpd-parcomp}, where the interface is called
Packit Service c5cf8c
BNR, incorporated in the {\tt ch\_p4mpd} device in the original MPICH.  PMI
Packit Service c5cf8c
represents an evolution of that interface and its implementation in MPICH.
Packit Service c5cf8c
Packit Service c5cf8c
Note that existing process managers often do a scalable job of starting
Packit Service c5cf8c
processes, and this part of existing systems can be kept.  What is
Packit Service c5cf8c
sometimes lacking is a way of conveying the communication-establishment
Packit Service c5cf8c
information.  Although the approach we take in the implementations below
Packit Service c5cf8c
is to combine the process startup and information exchange functionality
Packit Service c5cf8c
in a single interface, a different implementation could separate these,
Packit Service c5cf8c
using an existing process-startup mechanism and adding a new component
Packit Service c5cf8c
to implement the other parts of the interface.  One approach along these
Packit Service c5cf8c
lines is outlined in Section~\ref{sec:adding}.
Packit Service c5cf8c
Packit Service c5cf8c
\section{Our Approach}
Packit Service c5cf8c
\label{sec:approach}
Packit Service c5cf8c
Packit Service c5cf8c
The approach we have taken to the problem is to define a Process
Packit Service c5cf8c
Management Interface (PMI).  The MPICH implementation of MPI will be
Packit Service c5cf8c
implemented in terms of PMI rather than in terms of any particular
Packit Service c5cf8c
process management environment.  Multiple implementations of PMI will
Packit Service c5cf8c
then be possible, independently of the MPICH implementation of MPI.
Packit Service c5cf8c
The key to a good design for PMI is to specify it in a way that allows for
Packit Service c5cf8c
scalable implementation without dictating any details of the
Packit Service c5cf8c
implementations.  This has worked out well so far, and in
Packit Service c5cf8c
Section~\ref{sec:implementing} we describe a number of quite different
Packit Service c5cf8c
implementations of the interfaces described in Section~\ref{sec:interface}.
Packit Service c5cf8c
Packit Service c5cf8c
\section{The PMI Interface}
Packit Service c5cf8c
\label{sec:interface}
Packit Service c5cf8c
Packit Service c5cf8c
We present the interface in two parts.  The first part is sufficient for
Packit Service c5cf8c
the implementation of MPI-1 and many parts of MPI-2.  The second part is
Packit Service c5cf8c
required for implementing the dynamic process management part of MPI-2.
Packit Service c5cf8c
MPICH is now using the first part, with multiple PMI implementations,
Packit Service c5cf8c
so we consider it relatively final at this point.  Since we have not yet
Packit Service c5cf8c
implemented the dynamic process management functions in MPICH, some
Packit Service c5cf8c
evolution of Part 2 of PMI may take place as we do so.
Packit Service c5cf8c
Packit Service c5cf8c
The fundamental idea of Part 1 is the {\em key-value space}, or KVS,
Packit Service c5cf8c
containing a set of (key, value) pairs of strings.  Processes acquire 
Packit Service c5cf8c
access to one or more KVS's through PMI and can perform {\tt put/get}
Packit Service c5cf8c
operations on them.  Synchronization is defined in a scalable way via
Packit Service c5cf8c
the barrier operation, so that processes can be assure that the necessary
Packit Service c5cf8c
puts have been done before attempting the corresponding gets.
Packit Service c5cf8c
Packit Service c5cf8c
Thus the PMI interface (Part 1) consists of {\tt put/get/barrier} operations
Packit Service c5cf8c
together with housekeeping operations for managing the KVS's.  For
Packit Service c5cf8c
implementation of MPI-1, a single KVS, the default KVS for processes
Packit Service c5cf8c
started at the same time, is sufficient, but multiple KVS's will be
Packit Service c5cf8c
useful when we consider Part 2 and dynamic process management.
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
\subsection{Part 1:  Basic PMI Routines}
Packit Service c5cf8c
\label{sec:part1}
Packit Service c5cf8c
Packit Service c5cf8c
Part 1 of the interface is invoked in performance-critical parts of the
Packit Service c5cf8c
MPI implementation, both during initialization and connection setup.
Packit Service c5cf8c
Thus it is critical that this part of the interface allow scalable
Packit Service c5cf8c
implementation.  We accomplish this through the semantics of the {\tt
Packit Service c5cf8c
  put/get/barrier}, since the only synchronizing operation is the
Packit Service c5cf8c
collective {\tt barrier}, which is can have a scalable implementation.
Packit Service c5cf8c
The {\tt commit} operation allowsl batching of {\tt put} operations for
Packit Service c5cf8c
improved performance.
Packit Service c5cf8c
Packit Service c5cf8c
Part 1 has two subparts:  firstly, the functions associated with the process group
Packit Service c5cf8c
being started, and thus already implemented in some way in any MPI
Packit Service c5cf8c
implementation, and secondly, the functions associated with managing the
Packit Service c5cf8c
keyval spaces, used for communicating setup information. 
Packit Service c5cf8c
Packit Service c5cf8c
\begin{small}
Packit Service c5cf8c
\begin{verbatim}
Packit Service c5cf8c
/* PMI Group functions */
Packit Service c5cf8c
int PMI_Init( int *spawned );  /* initialize PMI for this process group
Packit Service c5cf8c
                                  The value of spawned indicates whether
Packit Service c5cf8c
                                  this process was created by
Packit Service c5cf8c
                                  PMI_Spawn_multiple. */
Packit Service c5cf8c
int PMI_Initialized( void );   /* Return true if PMI has been initialized */
Packit Service c5cf8c
int PMI_Get_size( int *size ); /* get size of process group */
Packit Service c5cf8c
int PMI_Get_rank( int *rank ); /* get rank in process group */
Packit Service c5cf8c
int PMI_Barrier( void );       /* barrier across processes in process group */
Packit Service c5cf8c
int PMI_Finalize( void );      /* finalize PMI for this process group */
Packit Service c5cf8c
Packit Service c5cf8c
/* PMI Keyval Space functions */
Packit Service c5cf8c
int PMI_KVS_Get_my_name( char *kvsname );       /* get name of keyval space */
Packit Service c5cf8c
int PMI_KVS_Get_name_length_max( void );        /* maximum name size */
Packit Service c5cf8c
int PMI_KVS_Get_key_length_max( void );         /* maximum key size */
Packit Service c5cf8c
int PMI_KVS_Get_value_length_max( void );       /* maximum value size */
Packit Service c5cf8c
int PMI_KVS_Create( char *kvsname );            /* make a new one, get name */
Packit Service c5cf8c
int PMI_KVS_Destroy( const char *kvsname );     /* finish with one */
Packit Service c5cf8c
int PMI_KVS_Put( const char *kvsname, const char *key,
Packit Service c5cf8c
                const char *value);             /* put key and data */
Packit Service c5cf8c
int PMI_KVS_Commit( const char *kvsname );      /* block until all pending put
Packit Service c5cf8c
                                                   operations from this process
Packit Service c5cf8c
                                                   are complete.  This is a
Packit Service c5cf8c
                                                    process local operation. */
Packit Service c5cf8c
int PMI_KVS_Get( const char *kvsname, const char *key, char *value); 
Packit Service c5cf8c
                                /* get value associated with key */
Packit Service c5cf8c
Packit Service c5cf8c
int PMI_KVS_iter_first(const char *kvsname, char *key, char *val);
Packit Service c5cf8c
int PMI_KVS_iter_next(const char *kvsname, char *key, char *val);
Packit Service c5cf8c
                                /* loop through the pairs in the kvs */
Packit Service c5cf8c
                             
Packit Service c5cf8c
\end{verbatim}
Packit Service c5cf8c
\end{small}
Packit Service c5cf8c
Packit Service c5cf8c
A scalable implementation of Part 1 of PMI could probably use existing software
Packit Service c5cf8c
for the group functions, and add some new functionality to support the
Packit Service c5cf8c
KVS-related functions.  One possible implementation is suggested below
Packit Service c5cf8c
in Section~\ref{sec:adding}.
Packit Service c5cf8c
Packit Service c5cf8c
\paragraph{Notes}
Packit Service c5cf8c
\begin{itemize}
Packit Service c5cf8c
\item The above routines (Part 1) are all that is needed for an
Packit Service c5cf8c
  implementation of MPI-1 and most of MPI-2.  Part 2 is only needed to
Packit Service c5cf8c
  support the MPI functions defined in the Dynamic Process Management
Packit Service c5cf8c
  section of the MPI Standard.
Packit Service c5cf8c
\item Similarly, multiple KVS's are only really needed in the dynamic
Packit Service c5cf8c
  process management case.  An initial implementation could omit {\tt
Packit Service c5cf8c
    PMI\_KVS\_Create} and {\tt PMI\_KVS\_Destroy}.  The iterators {\tt
Packit Service c5cf8c
    int PMI\_KVS\_iter\_first} and {\tt int PMI\_KVS\_iter\_next} are used to
Packit Service c5cf8c
  transfer KVS's in grid environments, and could also be omitted from
Packit Service c5cf8c
  some implementations.
Packit Service c5cf8c
\item The {\tt spawned} argument to {\tt PMI\_Init} is necessary to
Packit Service c5cf8c
  implement MPI-2 functionality, in particular {\tt MPI\_Get\_parent}.
Packit Service c5cf8c
  In a PMI implementation that does not support dynamic process
Packit Service c5cf8c
  management, it can always just return a pointer to 0.
Packit Service c5cf8c
\item The {\tt PMI\_Commit} exists so that in case {\tt PMI\_Put} is an
Packit Service c5cf8c
  expensive operation, involving communication with an external process,
Packit Service c5cf8c
  several {\tt PMI\_Put}s can be batched locally and only sent off when
Packit Service c5cf8c
  the {\tt PMI\_Commit} is done.
Packit Service c5cf8c
\item The notion of KVS is reminiscent of Linda, in which processes
Packit Service c5cf8c
  execute {\tt read} and {\tt write} operations on a shared ``tuple
Packit Service c5cf8c
  space''.  Why not use the Linda interface?  The reason is scalability.
Packit Service c5cf8c
  Linda implements a blocking {\tt read}, in which the calling process blocks
Packit Service c5cf8c
  until data with the requested key is put into the tuple space by
Packit Service c5cf8c
  another process.  While this is a convenient synchronizing operation,
Packit Service c5cf8c
  and could in theory be used here, it would not be scalable.  Note that
Packit Service c5cf8c
  in PMI, there is no point-to-point communication.  The only
Packit Service c5cf8c
  synchronization operation is the {\tt PMI\_Barrier}, which can have a
Packit Service c5cf8c
  variety of scalable implementations, depending on the environment.
Packit Service c5cf8c
\end{itemize}
Packit Service c5cf8c
Packit Service c5cf8c
\textbf{To Do}: Provide minimum sizes for the various strings.  Here
Packit Service c5cf8c
is a proposal.
Packit Service c5cf8c
Packit Service c5cf8c
The minimum sizes of the names and values stored will depend on the
Packit Service c5cf8c
MPI implementation.  The following limits will work with most
Packit Service c5cf8c
implementations:
Packit Service c5cf8c
\begin{description}
Packit Service c5cf8c
\item[kvsname]16
Packit Service c5cf8c
\item[key]32
Packit Service c5cf8c
\item[value]64
Packit Service c5cf8c
\item[Number of keys]Number of processes in an MPI program (size of
Packit Service c5cf8c
  \texttt{MPI\_COMM\_WORLD} in an MPI-1 program)
Packit Service c5cf8c
\item[Number of groups]Number of separate \texttt{MPI\_COMM\_WORLD}s
Packit Service c5cf8c
  managed by the process manager.  For an single MPI-1 code, this is one.
Packit Service c5cf8c
\end{description}
Packit Service c5cf8c
Packit Service c5cf8c
\subsection{Part 2:  Advanced PMI Routines}
Packit Service c5cf8c
\label{sec:part2}
Packit Service c5cf8c
Packit Service c5cf8c
This part of PMI is still under development.  If one assumes that the
Packit Service c5cf8c
dynamic process management functions in MPI-2 are not performance
Packit Service c5cf8c
critical, then the requirements for efficiency and scalability of these
Packit Service c5cf8c
operations are less crucial, although we expect MPI\_Comm\_Spawn to be
Packit Service c5cf8c
scalably implemented, at least to compete with an original {\tt mpiexec}.
Packit Service c5cf8c
Packit Service c5cf8c
\begin{small}
Packit Service c5cf8c
\begin{verbatim}
Packit Service c5cf8c
/* PMI Process Creation functions */
Packit Service c5cf8c
Packit Service c5cf8c
int PMI_Spawn_multiple(int count, const char *cmds[], const char **argvs[], 
Packit Service c5cf8c
                       const int *maxprocs, const void *info, int *errors, 
Packit Service c5cf8c
                       int *same_domain, const void *preput_info);
Packit Service c5cf8c
Packit Service c5cf8c
int PMI_Spawn(const char *cmd, const char *argv[], const int maxprocs,
Packit Service c5cf8c
              char *spawned_kvsname, const int kvsnamelen );
Packit Service c5cf8c
Packit Service c5cf8c
/* parse PMI implementation specific values into an info object that can
Packit Service c5cf8c
   then be passed to PMI_Spawn_multiple.  Remove PMI implementation
Packit Service c5cf8c
   specific arguments from argc and argv
Packit Service c5cf8c
*/
Packit Service c5cf8c
Packit Service c5cf8c
int PMI_Args_to_info(int *argcp, char ***argvp, void *infop);
Packit Service c5cf8c
Packit Service c5cf8c
/* Other PMI functions to be defined as necessary for other parts of
Packit Service c5cf8c
   dynamic process management */
Packit Service c5cf8c
Packit Service c5cf8c
\end{verbatim}
Packit Service c5cf8c
\end{small}
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
\section{Typical Usage}
Packit Service c5cf8c
\label{sec:usage}
Packit Service c5cf8c
Packit Service c5cf8c
In this section we give an example of typical usage of the PMI interface
Packit Service c5cf8c
in MPICH.  In the {\tt CH3\_TCP} device used on Linux clusters, TCP is
Packit Service c5cf8c
used to support MPI communication.  Connections between processes are
Packit Service c5cf8c
established by the normal socket {\tt connect}/{\tt accept} mechanism.
Packit Service c5cf8c
For this to work, before the first {\tt MPI\_SEND} from one process to
Packit Service c5cf8c
another, one process must have acquired a port from the operating system
Packit Service c5cf8c
and be listening on it with the normal {\tt socket}/{\tt bind}/{\tt listen}
Packit Service c5cf8c
sequence.  The other process, typically on a separate host, will execute
Packit Service c5cf8c
the corresponding {\tt socket}/{\tt connect} sequence, at which time the
Packit Service c5cf8c
first process will issue an {\tt accept}, establishing the TCP
Packit Service c5cf8c
connection.  Since we don't use reserved ports, the first process must
Packit Service c5cf8c
advertise in some way the port it is listening on.  Since for the sake
Packit Service c5cf8c
of scalability and rapid startup we don't establish these connections
Packit Service c5cf8c
until they are needed, the {\tt connect} operation is not executed until
Packit Service c5cf8c
the socket is needed, typically the first time a process issues an {\tt
Packit Service c5cf8c
  MPI\_SEND}.  At this point the MPICH implementation must determine from the
Packit Service c5cf8c
MPI rank of the destination process which host, and which port on that
Packit Service c5cf8c
host, to connect to in order to establish the connection.
Packit Service c5cf8c
Packit Service c5cf8c
PMI is the mechanism by which the first process advertises its host and
Packit Service c5cf8c
listening port, keyed by rank, and the the second process finds out this
Packit Service c5cf8c
information.  Thus the sequence of events during {\tt MPI\_Init} goes
Packit Service c5cf8c
like this:
Packit Service c5cf8c
\begin{enumerate}
Packit Service c5cf8c
\item During {\tt MPI\_Init}, each process calls {\tt PMI\_Init}, in
Packit Service c5cf8c
  order to perform whatever initialization is needed by the PMI
Packit Service c5cf8c
  implementation.
Packit Service c5cf8c
\item Still in {\tt MPI\_Init}, each process calls {\tt PMI\_Get\_rank}
Packit Service c5cf8c
  to find out its rank in the MPI job.
Packit Service c5cf8c
\item Each process executes {\tt gethostname} to find out its host and
Packit Service c5cf8c
  {\tt socket}/{\tt bind} to obtain a port.
Packit Service c5cf8c
\item Each process creates a key from its rank and a value for that key
Packit Service c5cf8c
  from its host and port.  We actually use two pairs, using keys {\tt
Packit Service c5cf8c
    P<rank>-hostname} and {\tt P<rank>-port}.
Packit Service c5cf8c
\item Each process does a {\tt PMI\_KVS\_Put} to put its (key, value)
Packit Service c5cf8c
  pairs into the default KVS.  It may deposit other information with
Packit Service c5cf8c
  other calls to {\tt PMI\_KVS\_Put}.  It does a {\tt PMI\_Commit} to
Packit Service c5cf8c
  flush all of the (key, value) pairs to the KVS.
Packit Service c5cf8c
\item All processes execute {\tt PMI\_Barrier} to synchronize.  It is
Packit Service c5cf8c
  assumed that this operation is implemented in a scalable way.  Note
Packit Service c5cf8c
  that MPI communication is not available yet, so this is not an {\tt
Packit Service c5cf8c
    MPI\_Barrier}.  We are still inside {\tt MPI\_Init}.
Packit Service c5cf8c
\end{enumerate}
Packit Service c5cf8c
At this point the only non-local communication that has taken place is
Packit Service c5cf8c
the barrier.  Now, each process can exit from {\tt MPI\_INIT}, having
Packit Service c5cf8c
made available the information that only it knows (the listening port).
Packit Service c5cf8c
Packit Service c5cf8c
When a process executes any form of {\tt MPI\_SEND}, the
Packit Service c5cf8c
implementation can check to see if a connection to the destination
Packit Service c5cf8c
process already exists, and if not, use the rank to create the
Packit Service c5cf8c
appropriate key and do {\tt PMI\_KVS\_Get} to find the host and port of the
Packit Service c5cf8c
destination process and do the {\tt connect}.  We currently keep these
Packit Service c5cf8c
sockets open for the rest of the job, but there is nothing to preclude
Packit Service c5cf8c
closing and reopening them as needed.
Packit Service c5cf8c
Packit Service c5cf8c
The above sequence is of course not the only way to use the PMI
Packit Service c5cf8c
interface, but it constitutes a typical example of its use.  Note that
Packit Service c5cf8c
even if more information is conveyed in this way, the actual size of the
Packit Service c5cf8c
KVS is not anticipated to be large.  Room for a few (key, value) pairs
Packit Service c5cf8c
for each process is all that is necessary.
Packit Service c5cf8c
Packit Service c5cf8c
\section{Implementing PMI}
Packit Service c5cf8c
\label{sec:implementing}
Packit Service c5cf8c
Packit Service c5cf8c
In this section we describe some implementations of PMI that we are
Packit Service c5cf8c
using, together with a design for one we are not.  Although this note is
Packit Service c5cf8c
about the interface itself and not any specific implementation, it might
Packit Service c5cf8c
be useful to understand the alternatives that we have explored and are
Packit Service c5cf8c
in current use.  Also, the very existence of multiple implementations
Packit Service c5cf8c
demonstrates that PMI is a real interface with a purpose, not just a
Packit Service c5cf8c
design for part of MPICH.  All implementations are distributed with
Packit Service c5cf8c
MPICH, as described at the end of Section~\ref{sec:implications}.
Packit Service c5cf8c
Packit Service c5cf8c
It is useful to think of a PMI implementation as having two parts: a 
Packit Service c5cf8c
{\em client\/} side and a {\em server\/} side.  The client side is the
Packit Service c5cf8c
direct implementation of the PMI functions defined above, linked together
Packit Service c5cf8c
with the MPI library in the application's executable.  In some cases this
Packit Service c5cf8c
part of the implementation communicates with other processes not part of
Packit Service c5cf8c
the application;  we call these processes the server side of the PMI
Packit Service c5cf8c
implementation.  As we shall see, the server side may not exist at all,
Packit Service c5cf8c
or be part of the client side; multiple architectures for PMI
Packit Service c5cf8c
implementations already exist.  In the following subsection we describe some
Packit Service c5cf8c
existing client-side PMI implementations.  In the next section we
Packit Service c5cf8c
describe some server-side implementations.  Currently most of these are
Packit Service c5cf8c
part of the MPICH distribution~\cite{mpich-web-page}.
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
\subsection{Client Side}
Packit Service c5cf8c
\label{sec:client-side}
Packit Service c5cf8c
Packit Service c5cf8c
We currently are using three separate implementations of the client side of PMI.
Packit Service c5cf8c
\begin{description}
Packit Service c5cf8c
%% \item[uni] is a stub used for debugging.  It assumes that there is only one
Packit Service c5cf8c
%%   process and so needs to provide no services.
Packit Service c5cf8c
\item[simple] is our primary implementation for Unix systems.  It
Packit Service c5cf8c
  assumes that a socket (the PMI socket) has been created that can be
Packit Service c5cf8c
  used to exchange commands with the server side of the implementation.
Packit Service c5cf8c
  It is used by multiple server implementations, as described below.
Packit Service c5cf8c
\end{description}
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
\subsection{Server Side}
Packit Service c5cf8c
\label{sec:server-side}
Packit Service c5cf8c
Packit Service c5cf8c
One can think of the server side of a PMI implementation as that part of
Packit Service c5cf8c
a process management system that supports the client.  Currently several
Packit Service c5cf8c
are in use or under development.
Packit Service c5cf8c
Packit Service c5cf8c
\begin{description}
Packit Service c5cf8c
\item[forker] implements the server side of the ``simple'' client side
Packit Service c5cf8c
  described above.  It is used primarily for debugging, although it can
Packit Service c5cf8c
  also be used in production on SMP's.  It consists of an {\tt mpiexec}
Packit Service c5cf8c
  script that simply forks the parallel processes after setting up the
Packit Service c5cf8c
  PMI socket.  Thus all processes must be on the same machine.
Packit Service c5cf8c
\item[Remshell] uses {\tt rsh} or {\tt ssh} to start the processes from
Packit Service c5cf8c
  the {\tt mpiexec} process, then they connect back to exchange the
Packit Service c5cf8c
  keyval information.  This illustrates the combination of an old ({\tt
Packit Service c5cf8c
    rsh}) process startup mechanism with a new data-exchange mechanism.
Packit Service c5cf8c
\end{description}
Packit Service c5cf8c
Packit Service c5cf8c
\subsection{Combined Client and Server}
Packit Service c5cf8c
\label{sec:combined}
Packit Service c5cf8c
Packit Service c5cf8c
The MPICH-G2~\cite{karonis02:mpich-g2} implementation of MPI illustrates
Packit Service c5cf8c
yet another approach.  MPICH-G2 is built on MPICH1 and thus uses the BNR
Packit Service c5cf8c
interface, but the underlying principles are the same.  In MPICH-G2, the
Packit Service c5cf8c
{\tt put} operations are local, and the {\tt barrier} operation is a
Packit Service c5cf8c
global all-to-all-exchange, implemented in a scalable way.  Then the
Packit Service c5cf8c
{\tt get}s can be done without further communication. 
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
\subsection{Adding a PMI Module to an Existing Process Starter}
Packit Service c5cf8c
\label{sec:adding}
Packit Service c5cf8c
Packit Service c5cf8c
In the implementations listed above, we have combined the PMI
Packit Service c5cf8c
implementation, particularly the server side, with a process startup
Packit Service c5cf8c
mechanism being implemented at the same time.  Some systems, such as
Packit Service c5cf8c
SLURM, may already have scalable methods in place for starting processes
Packit Service c5cf8c
and might be looking for the simplest way to add PMI capabilities.
Packit Service c5cf8c
Although the best approach is likely to be to incorporate PMI
Packit Service c5cf8c
server-side capabilities into the process starter, it may be that the
Packit Service c5cf8c
following approach, though less scalable, might be serviceable:
Packit Service c5cf8c
\begin{enumerate}
Packit Service c5cf8c
\item At the time each process of the MPI job is started, it is passed
Packit Service c5cf8c
  its rank and the size of the job in an environment variable, since
Packit Service c5cf8c
  these are things the process manager knows.  This could be used to
Packit Service c5cf8c
  implement {\tt PMI\_Get\_rank} and {\tt PMI\_Get\_size}.  (Actually
Packit Service c5cf8c
  these values would probably be read from the environment during {\tt
Packit Service c5cf8c
    PMI\_Init} and cached.)
Packit Service c5cf8c
\item At the time the job is started, a separate ``KVS server'' process
Packit Service c5cf8c
  would be forked to hold all KVS data.
Packit Service c5cf8c
\item All processes would send their {\tt PMI\_KVS\_Put} data to this
Packit Service c5cf8c
  server.  Use of UDP rather than TCP would probably help with the
Packit Service c5cf8c
  obvious scalability problem that this server would receive data from
Packit Service c5cf8c
  each process in the job at approximately the same time.
Packit Service c5cf8c
\item The {\tt PMI\_Barrier} would be implemented in the server with a
Packit Service c5cf8c
  simple counter.
Packit Service c5cf8c
\item Data requested by {\tt PMI\_KVS\_Get} would come from the server.
Packit Service c5cf8c
\item A variation would be to have all the data broadcast at the time of
Packit Service c5cf8c
  the barrier, so that subsequent gets would be local.
Packit Service c5cf8c
\end{enumerate}
Packit Service c5cf8c
This mechanism is not intrinsically scalable to thousands of nodes,
Packit Service c5cf8c
which is why we are not using it.  However, it might scale farther than
Packit Service c5cf8c
a few hundred nodes, and be a rather straightforward addition to an
Packit Service c5cf8c
existing process startup mechanism.
Packit Service c5cf8c
Packit Service c5cf8c
\section{Resource Registration}
Packit Service c5cf8c
\label{sec:register}
Packit Service c5cf8c
There are some resources that a program may need to allocate that the
Packit Service c5cf8c
program cannot guarantee will be released when the program exits,
Packit Service c5cf8c
particularly if the program exits as the result of an error or an
Packit Service c5cf8c
uncatchable signal.  These resources include other processes, SYSV
Packit Service c5cf8c
shared memory segments and semaphores, and temporary files.  The
Packit Service c5cf8c
routines in this section allow the program to notify the process
Packit Service c5cf8c
manager of these resources and provide a general way for the process
Packit Service c5cf8c
manager to free them when the program exits.  
Packit Service c5cf8c
Packit Service c5cf8c
If the process manager does not provide these functions, then there
Packit Service c5cf8c
are several options:
Packit Service c5cf8c
\begin{enumerate}
Packit Service c5cf8c
\item The calls can be ignored.  The program will do its best to free
Packit Service c5cf8c
  these resources when it exits.  This may include setting a cleanup
Packit Service c5cf8c
  handler on the catchable signals that normally cause an abort.
Packit Service c5cf8c
  Note that in this case the registration routine must retrun an error
Packit Service c5cf8c
  so that the application knows that it must handle this itself.
Packit Service c5cf8c
\item The calls can be directed to an alternate process, called a
Packit Service c5cf8c
  ``watchdog'', that will free the resources if the watched process
Packit Service c5cf8c
  terminates abnormally.
Packit Service c5cf8c
\end{enumerate}
Packit Service c5cf8c
Packit Service c5cf8c
Note that this interface provides a way for process managers to permit
Packit Service c5cf8c
a process to create new processes, since the processes will be
Packit Service c5cf8c
registered with the process manager.
Packit Service c5cf8c
Packit Service c5cf8c
The following is still in rough draft form
Packit Service c5cf8c
\begin{verbatim}
Packit Service c5cf8c
int PMI_Resource_register( const char *name, (void *()(void*))at_exit, 
Packit Service c5cf8c
                           void *at_exit_extra_data,
Packit Service c5cf8c
                          (void *()(void *))at_abort, 
Packit Service c5cf8c
                           void *at_abort_extra_data );
Packit Service c5cf8c
Packit Service c5cf8c
int PMI_Resource_release_begin( const char *name, int timeout );
Packit Service c5cf8c
int PMI_Resource_release_end( const char *name );
Packit Service c5cf8c
\end{verbatim}
Packit Service c5cf8c
Packit Service c5cf8c
The functions in \texttt{PMI\_Resource\_register} may need to be
Packit Service c5cf8c
command names or an enumerated list of known resources.  
Packit Service c5cf8c
Packit Service c5cf8c
The release functions are split to allow the process to indicate that
Packit Service c5cf8c
it is about to release a resource and a timeout at which time the
Packit Service c5cf8c
watchdog may consider the process to have failed.  For example, when
Packit Service c5cf8c
removing a SYSV shared memory segment, the following code would be used:
Packit Service c5cf8c
Packit Service c5cf8c
\begin{verbatim}
Packit Service c5cf8c
   PMI_Resource_release_begin( "myipc", 10 );
Packit Service c5cf8c
   shmctl( memid, IPC_RMID, NULL );
Packit Service c5cf8c
   PMI_Resource_release_end( "myipc" );
Packit Service c5cf8c
\end{verbatim}
Packit Service c5cf8c
Packit Service c5cf8c
This interface still contains a small race condition: the time between
Packit Service c5cf8c
when the resource is created and when it is registered.  This is a
Packit Service c5cf8c
very narrow race, so it may not be important to close it (and with
Packit Service c5cf8c
registration, much more likely resource leaks have been closed).
Packit Service c5cf8c
However, a two-phase registration process could be considered, that
Packit Service c5cf8c
would register the intent to create a resource.  In the case of
Packit Service c5cf8c
failure to complete the second part of the two-phase registration, the
Packit Service c5cf8c
watchdog could try to hunt down the newly allocated resource.  
Packit Service c5cf8c
Packit Service c5cf8c
\section{Topology Information}
Packit Service c5cf8c
\label{sec:topology}
Packit Service c5cf8c
The process manager often has some information about the process
Packit Service c5cf8c
topology.  For example, it is likely to know about multiprocessor
Packit Service c5cf8c
nodes and may know about parallel machine layout.  The routines in
Packit Service c5cf8c
this section provide a way for the process manager to communicate that
Packit Service c5cf8c
information to the program.  As with the other PMI services, if the
Packit Service c5cf8c
process manager cannot provide this service, several alternatives
Packit Service c5cf8c
exist, including returning an \texttt{PMI\_ERR\_UNSUPPORTED} and using
Packit Service c5cf8c
a separate service to provide this information.
Packit Service c5cf8c
Packit Service c5cf8c
\begin{verbatim}
Packit Service c5cf8c
int PMI_Topo_type( PMI_Group group, int *kind );
Packit Service c5cf8c
int PMI_Topo_cluster_info( PMI_Group group, 
Packit Service c5cf8c
                            int *levels, int my_cluster[], 
Packit Service c5cf8c
                            int my_rank[] );
Packit Service c5cf8c
int PMI_Topo_mesh_info( PMI_Group group, int ndims, int dims[] );
Packit Service c5cf8c
\end{verbatim}
Packit Service c5cf8c
These routines provide information on the specified PMI group.
Packit Service c5cf8c
\texttt{PMI\_Topo\_type} gets the type of topology.  The current
Packit Service c5cf8c
choices are \texttt{PMI\_TOPO\_CLUSTER}, \texttt{PMI\_TOPO\_MESH}, and
Packit Service c5cf8c
\texttt{PMI\_TOPO\_NONE}.
Packit Service c5cf8c
The other routines provide information about the cluster and mesh
Packit Service c5cf8c
topology.  Other topologies can be added as necessary; these cover
Packit Service c5cf8c
most current systems.
Packit Service c5cf8c
Packit Service c5cf8c
\section{Resource Allocation on Behalf of Parallel Jobs}
Packit Service c5cf8c
\label{sec:request}
Packit Service c5cf8c
(I'm not sure that this section goes here)
Packit Service c5cf8c
Packit Service c5cf8c
In some cases, resources must be allocated before a process is
Packit Service c5cf8c
created.  For example, if several processes on the same SMP node are
Packit Service c5cf8c
to share an anonymous mmap (for shared memory), this memory must be
Packit Service c5cf8c
allocated before the processes are created (strictly, before all but
Packit Service c5cf8c
the first process is created, if the first process creates the
Packit Service c5cf8c
others).  The purpose of the routines in this section is to allow a
Packit Service c5cf8c
startup program, such as \texttt{mpiexec}, to describe these
Packit Service c5cf8c
requirements to the process manager before  any processes are started.
Packit Service c5cf8c
Question: it may be that the only routine here is used to answer the
Packit Service c5cf8c
question ``did you give me the resource''?  This leaves unanswered the
Packit Service c5cf8c
question of ``how does a device let an mpiexec know that it needs a
Packit Service c5cf8c
particular resource''?
Packit Service c5cf8c
Packit Service c5cf8c
\section{Implications for Collaborators}
Packit Service c5cf8c
\label{sec:implications}
Packit Service c5cf8c
Packit Service c5cf8c
We hope that this brief discussion has made it easier to understand what
Packit Service c5cf8c
options and opportunities exist for implementors of parallel programming
Packit Service c5cf8c
libraries or process management environments that will interact with
Packit Service c5cf8c
MPICH or MPICH-derived MPI implementations.
Packit Service c5cf8c
Packit Service c5cf8c
MPI and other library implementors are recommended to use the PMI
Packit Service c5cf8c
functions to exchange data with other processes related to the setting
Packit Service c5cf8c
up of the primary communication mechanism.  MPICH does this already
Packit Service c5cf8c
for setting up TCP connections in the CH3 implementation of the
Packit Service c5cf8c
Abstract Device Interface (ADI-3).  If one links with the ``simple''
Packit Service c5cf8c
implementation of the client side of the PMI implementation in MPICH,
Packit Service c5cf8c
then MPI jobs can be started by any process management environment
Packit Service c5cf8c
that implements the server side.
Packit Service c5cf8c
Packit Service c5cf8c
Process management systems, such as PBS, YOD, or SLURM, have two options
Packit Service c5cf8c
in the short run.
Packit Service c5cf8c
Packit Service c5cf8c
In the long run implementations may prefer to implement both sides
Packit Service c5cf8c
themselves, meaning that one would link one's application with a PBS- or
Packit Service c5cf8c
SLURM- or Myricom-specific object file implementing the client side.
Packit Service c5cf8c
Packit Service c5cf8c
The PMI-related code described here is available in the current MPICH
Packit Service c5cf8c
distribution~\cite{mpich-web-page}, in the {\tt
Packit Service c5cf8c
  src/pmi/\{simple,uni\}} (client side) and {\tt
Packit Service c5cf8c
  src/pm/forker} (server side) subdirectories.
Packit Service c5cf8c
Different process managers (the server side) and different PMI
Packit Service c5cf8c
implementations can be chosen when MPICH is configured.  The default is
Packit Service c5cf8c
as if one had specified
Packit Service c5cf8c
\begin{verbatim}
Packit Service c5cf8c
     configure --with-pmi=simple
Packit Service c5cf8c
\end{verbatim}
Packit Service c5cf8c
Please send questions and comments to mpich-discuss@mcs.anl.gov.
Packit Service c5cf8c
Packit Service c5cf8c
%\appendix
Packit Service c5cf8c
%\section{Wire Protocol for the Simple PMI Implementation}
Packit Service c5cf8c
%\texttt{PMI\_PORT} environment variable
Packit Service c5cf8c
%\section{Man Pages for PMI Routines}
Packit Service c5cf8c
% Use the man page generator and include the relevant files here
Packit Service c5cf8c
Packit Service c5cf8c
\bibliography{/home/MPI/allbib,paper}
Packit Service c5cf8c
\bibliographystyle{plain}
Packit Service c5cf8c
Packit Service c5cf8c
\end{document}