|
Packit Service |
c5cf8c |
\documentclass[11pt]{article}
|
|
Packit Service |
c5cf8c |
\usepackage{times} % Necessary because Acrobat can't handle fonts properly
|
|
Packit Service |
c5cf8c |
\let\file=\texttt
|
|
Packit Service |
c5cf8c |
\let\code=\texttt
|
|
Packit Service |
c5cf8c |
\let\program=\code
|
|
Packit Service |
c5cf8c |
\usepackage{epsf}
|
|
Packit Service |
c5cf8c |
\setlength\textwidth{6.5in}
|
|
Packit Service |
c5cf8c |
\setlength\oddsidemargin{0in}
|
|
Packit Service |
c5cf8c |
\setlength\evensidemargin{0in}
|
|
Packit Service |
c5cf8c |
\setlength\marginparwidth{0.7in}
|
|
Packit Service |
c5cf8c |
\def\bw{\texttt{\char`\\}}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
% Add a discussion macro for parts of the paper that need discussion
|
|
Packit Service |
c5cf8c |
\newenvironment{discussion}{\begin{quotation}\centerline{\textbf{Discussion}}}{\ifvmode\else\par\fi\centerline{\textbf{End
|
|
Packit Service |
c5cf8c |
of Discussion}}\end{quotation}}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\newcommand\onehalf{\ifmmode {\scriptstyle {\scriptstyle 1\over
|
|
Packit Service |
c5cf8c |
\scriptstyle 2}} \else $\onehalf$ \fi}
|
|
Packit Service |
c5cf8c |
\renewcommand{\floatpagefraction}{0.9}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\begin{document}
|
|
Packit Service |
c5cf8c |
\title{{\bf Process Management in MPICH}\\[.2in] DRAFT 2.1}
|
|
Packit Service |
c5cf8c |
\author{The MPICH Team\\Argonne National Laboratory}
|
|
Packit Service |
c5cf8c |
\maketitle
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\begin{abstract}
|
|
Packit Service |
c5cf8c |
In this note we describe a process management interface that can be
|
|
Packit Service |
c5cf8c |
used by MPI implementations and other parallel processing libraries
|
|
Packit Service |
c5cf8c |
yet be independent of both. We define the specific interface we are
|
|
Packit Service |
c5cf8c |
developing, called PMI (Process Manager Interface) in the context of
|
|
Packit Service |
c5cf8c |
MPICH. We describe the interface itself and a number of
|
|
Packit Service |
c5cf8c |
implementations. We show how an MPI implementation built on MPICH
|
|
Packit Service |
c5cf8c |
can use PMI to make itself independent of the environment in which it
|
|
Packit Service |
c5cf8c |
executes, and how a process management environment can support MPI
|
|
Packit Service |
c5cf8c |
implementations based on MPICH.
|
|
Packit Service |
c5cf8c |
\end{abstract}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{Introduction}
|
|
Packit Service |
c5cf8c |
\label{sec:introduction}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
This informal paper is intended to be useful for those using MPICH as the
|
|
Packit Service |
c5cf8c |
basis of their own MPI implementations, or providing a process
|
|
Packit Service |
c5cf8c |
management environment which will run MPI jobs that are linked against
|
|
Packit Service |
c5cf8c |
MPICH itself or an MPICH-derived MPI implementation. At the time of
|
|
Packit Service |
c5cf8c |
writing, this audience potentially includes groups at IBM (for BG/L), Livermore
|
|
Packit Service |
c5cf8c |
(SLURM process management), Cray (MPI implementation for Red Storm, with
|
|
Packit Service |
c5cf8c |
YOD), Myricom (MPICH-GM), Viridian (PBSPro), and Globus/Teragrid
|
|
Packit Service |
c5cf8c |
implementors (MPICH-G2). Others are welcome to contribute suggestions.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
This paper is incomplete in the sense that it describes an interface
|
|
Packit Service |
c5cf8c |
still being defined. The part that is currently in use in the MPI-1
|
|
Packit Service |
c5cf8c |
part of MPICH has proved adequate for our needs and has several
|
|
Packit Service |
c5cf8c |
implementations. We will refer to it as ``Part 1.'' That part of the
|
|
Packit Service |
c5cf8c |
interface that is required for the dynamic process management part of
|
|
Packit Service |
c5cf8c |
MPI-2 is still under development, and we will refer to it here as ``Part
|
|
Packit Service |
c5cf8c |
2.'' Most of this paper is about Part 1, which is all that is necessary
|
|
Packit Service |
c5cf8c |
to support an MPI implementation that does not include dynamic process
|
|
Packit Service |
c5cf8c |
management.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
We describe the problem we are trying to solve in
|
|
Packit Service |
c5cf8c |
Section~\ref{sec:problem}, the approach we have taken so far in MPICH
|
|
Packit Service |
c5cf8c |
in Section~\ref{sec:approach}, the PMI interface itself in
|
|
Packit Service |
c5cf8c |
Section~\ref{sec:interface} as it has been defined so far, and
|
|
Packit Service |
c5cf8c |
implementations in Section~\ref{sec:implementing}. In
|
|
Packit Service |
c5cf8c |
Section~\ref{sec:implications} we outline the implications for those who
|
|
Packit Service |
c5cf8c |
are collaborating with us in various ways related to the MPICH project.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{The Problem}
|
|
Packit Service |
c5cf8c |
\label{sec:problem}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The problem this paper addresses is how to provide the processes of a
|
|
Packit Service |
c5cf8c |
parallel job with the information they need in order to interact with
|
|
Packit Service |
c5cf8c |
the process management environment where necessary, in particular to set
|
|
Packit Service |
c5cf8c |
up communication with one another. In many cases such information is
|
|
Packit Service |
c5cf8c |
partly provided by the systems software that actually starts processes,
|
|
Packit Service |
c5cf8c |
and also partly provided by the processes themselves, in which case th
|
|
Packit Service |
c5cf8c |
process management system must aid in the dissemination of this
|
|
Packit Service |
c5cf8c |
information to other processes. A classic example occurs when a process
|
|
Packit Service |
c5cf8c |
in a TCP implementation of MPI acquires a port on which it can be
|
|
Packit Service |
c5cf8c |
contacted, and then must notify other processes of this port so that
|
|
Packit Service |
c5cf8c |
they can establish MPI communication with this process by connecting to
|
|
Packit Service |
c5cf8c |
the port.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Traditionally, parallel programming libraries, such as MPI
|
|
Packit Service |
c5cf8c |
implementations, have been integrated with process management mechanisms
|
|
Packit Service |
c5cf8c |
(LAM, MPICH1 with ch\_p4 device, POE) in order to solve this problem.
|
|
Packit Service |
c5cf8c |
After a preliminary exploration of separating the library from the
|
|
Packit Service |
c5cf8c |
process manager in MPICH-1 with the {\tt ch\_p4mpd} device we have decided to
|
|
Packit Service |
c5cf8c |
update the interface and then commit to this approach in MPICH. We are
|
|
Packit Service |
c5cf8c |
motivated by the challenges of implementing MPI-2 in a
|
|
Packit Service |
c5cf8c |
system-independent way, but many of the ideas here might prove useful in
|
|
Packit Service |
c5cf8c |
a non-MPI environment as well, either for other parallel libraries (such
|
|
Packit Service |
c5cf8c |
as Global Arrays, GP-SHMEM, or GASNet) or language-based systems (such
|
|
Packit Service |
c5cf8c |
as UPC, Co-Array Fortran, or Titanium).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The problem to be addressed has several components:
|
|
Packit Service |
c5cf8c |
\begin{itemize}
|
|
Packit Service |
c5cf8c |
\item Conveying to processes in a parallel job the information they need
|
|
Packit Service |
c5cf8c |
to establish communication with one another. To focus the discussion,
|
|
Packit Service |
c5cf8c |
we assume that such communication is needed for implementing MPI.
|
|
Packit Service |
c5cf8c |
Thus this information could include hosts, ports, interfaces,
|
|
Packit Service |
c5cf8c |
shared-memory keys, and other information.
|
|
Packit Service |
c5cf8c |
\item MPI-2 dynamic process management functions require extra support
|
|
Packit Service |
c5cf8c |
for implementing {\tt MPI\_Comm\_Spawn}, {\tt MPI\_Comm\_\{Connect/Accept\}},
|
|
Packit Service |
c5cf8c |
{\tt MPI\_\{Publish/Lookup\}\_name}, etc.
|
|
Packit Service |
c5cf8c |
\item The interface should be simple and straightforward, particularly
|
|
Packit Service |
c5cf8c |
in the absence of dynamic process management. MPICH will implement
|
|
Packit Service |
c5cf8c |
dynamic process management, but some other MPI implementations may
|
|
Packit Service |
c5cf8c |
not.
|
|
Packit Service |
c5cf8c |
\item The interface must allow a scalable implementation for
|
|
Packit Service |
c5cf8c |
performance-critical operations. In environments with even only hundreds
|
|
Packit Service |
c5cf8c |
of processes, serial algorithms will be inappropriate.
|
|
Packit Service |
c5cf8c |
\end{itemize}
|
|
Packit Service |
c5cf8c |
An earlier, similar version of this approach was described
|
|
Packit Service |
c5cf8c |
in~\cite{butler-lusk-gropp:mpd-parcomp}, where the interface is called
|
|
Packit Service |
c5cf8c |
BNR, incorporated in the {\tt ch\_p4mpd} device in the original MPICH. PMI
|
|
Packit Service |
c5cf8c |
represents an evolution of that interface and its implementation in MPICH.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Note that existing process managers often do a scalable job of starting
|
|
Packit Service |
c5cf8c |
processes, and this part of existing systems can be kept. What is
|
|
Packit Service |
c5cf8c |
sometimes lacking is a way of conveying the communication-establishment
|
|
Packit Service |
c5cf8c |
information. Although the approach we take in the implementations below
|
|
Packit Service |
c5cf8c |
is to combine the process startup and information exchange functionality
|
|
Packit Service |
c5cf8c |
in a single interface, a different implementation could separate these,
|
|
Packit Service |
c5cf8c |
using an existing process-startup mechanism and adding a new component
|
|
Packit Service |
c5cf8c |
to implement the other parts of the interface. One approach along these
|
|
Packit Service |
c5cf8c |
lines is outlined in Section~\ref{sec:adding}.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{Our Approach}
|
|
Packit Service |
c5cf8c |
\label{sec:approach}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The approach we have taken to the problem is to define a Process
|
|
Packit Service |
c5cf8c |
Management Interface (PMI). The MPICH implementation of MPI will be
|
|
Packit Service |
c5cf8c |
implemented in terms of PMI rather than in terms of any particular
|
|
Packit Service |
c5cf8c |
process management environment. Multiple implementations of PMI will
|
|
Packit Service |
c5cf8c |
then be possible, independently of the MPICH implementation of MPI.
|
|
Packit Service |
c5cf8c |
The key to a good design for PMI is to specify it in a way that allows for
|
|
Packit Service |
c5cf8c |
scalable implementation without dictating any details of the
|
|
Packit Service |
c5cf8c |
implementations. This has worked out well so far, and in
|
|
Packit Service |
c5cf8c |
Section~\ref{sec:implementing} we describe a number of quite different
|
|
Packit Service |
c5cf8c |
implementations of the interfaces described in Section~\ref{sec:interface}.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{The PMI Interface}
|
|
Packit Service |
c5cf8c |
\label{sec:interface}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
We present the interface in two parts. The first part is sufficient for
|
|
Packit Service |
c5cf8c |
the implementation of MPI-1 and many parts of MPI-2. The second part is
|
|
Packit Service |
c5cf8c |
required for implementing the dynamic process management part of MPI-2.
|
|
Packit Service |
c5cf8c |
MPICH is now using the first part, with multiple PMI implementations,
|
|
Packit Service |
c5cf8c |
so we consider it relatively final at this point. Since we have not yet
|
|
Packit Service |
c5cf8c |
implemented the dynamic process management functions in MPICH, some
|
|
Packit Service |
c5cf8c |
evolution of Part 2 of PMI may take place as we do so.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The fundamental idea of Part 1 is the {\em key-value space}, or KVS,
|
|
Packit Service |
c5cf8c |
containing a set of (key, value) pairs of strings. Processes acquire
|
|
Packit Service |
c5cf8c |
access to one or more KVS's through PMI and can perform {\tt put/get}
|
|
Packit Service |
c5cf8c |
operations on them. Synchronization is defined in a scalable way via
|
|
Packit Service |
c5cf8c |
the barrier operation, so that processes can be assure that the necessary
|
|
Packit Service |
c5cf8c |
puts have been done before attempting the corresponding gets.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Thus the PMI interface (Part 1) consists of {\tt put/get/barrier} operations
|
|
Packit Service |
c5cf8c |
together with housekeeping operations for managing the KVS's. For
|
|
Packit Service |
c5cf8c |
implementation of MPI-1, a single KVS, the default KVS for processes
|
|
Packit Service |
c5cf8c |
started at the same time, is sufficient, but multiple KVS's will be
|
|
Packit Service |
c5cf8c |
useful when we consider Part 2 and dynamic process management.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\subsection{Part 1: Basic PMI Routines}
|
|
Packit Service |
c5cf8c |
\label{sec:part1}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Part 1 of the interface is invoked in performance-critical parts of the
|
|
Packit Service |
c5cf8c |
MPI implementation, both during initialization and connection setup.
|
|
Packit Service |
c5cf8c |
Thus it is critical that this part of the interface allow scalable
|
|
Packit Service |
c5cf8c |
implementation. We accomplish this through the semantics of the {\tt
|
|
Packit Service |
c5cf8c |
put/get/barrier}, since the only synchronizing operation is the
|
|
Packit Service |
c5cf8c |
collective {\tt barrier}, which is can have a scalable implementation.
|
|
Packit Service |
c5cf8c |
The {\tt commit} operation allowsl batching of {\tt put} operations for
|
|
Packit Service |
c5cf8c |
improved performance.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Part 1 has two subparts: firstly, the functions associated with the process group
|
|
Packit Service |
c5cf8c |
being started, and thus already implemented in some way in any MPI
|
|
Packit Service |
c5cf8c |
implementation, and secondly, the functions associated with managing the
|
|
Packit Service |
c5cf8c |
keyval spaces, used for communicating setup information.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\begin{small}
|
|
Packit Service |
c5cf8c |
\begin{verbatim}
|
|
Packit Service |
c5cf8c |
/* PMI Group functions */
|
|
Packit Service |
c5cf8c |
int PMI_Init( int *spawned ); /* initialize PMI for this process group
|
|
Packit Service |
c5cf8c |
The value of spawned indicates whether
|
|
Packit Service |
c5cf8c |
this process was created by
|
|
Packit Service |
c5cf8c |
PMI_Spawn_multiple. */
|
|
Packit Service |
c5cf8c |
int PMI_Initialized( void ); /* Return true if PMI has been initialized */
|
|
Packit Service |
c5cf8c |
int PMI_Get_size( int *size ); /* get size of process group */
|
|
Packit Service |
c5cf8c |
int PMI_Get_rank( int *rank ); /* get rank in process group */
|
|
Packit Service |
c5cf8c |
int PMI_Barrier( void ); /* barrier across processes in process group */
|
|
Packit Service |
c5cf8c |
int PMI_Finalize( void ); /* finalize PMI for this process group */
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
/* PMI Keyval Space functions */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Get_my_name( char *kvsname ); /* get name of keyval space */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Get_name_length_max( void ); /* maximum name size */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Get_key_length_max( void ); /* maximum key size */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Get_value_length_max( void ); /* maximum value size */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Create( char *kvsname ); /* make a new one, get name */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Destroy( const char *kvsname ); /* finish with one */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Put( const char *kvsname, const char *key,
|
|
Packit Service |
c5cf8c |
const char *value); /* put key and data */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Commit( const char *kvsname ); /* block until all pending put
|
|
Packit Service |
c5cf8c |
operations from this process
|
|
Packit Service |
c5cf8c |
are complete. This is a
|
|
Packit Service |
c5cf8c |
process local operation. */
|
|
Packit Service |
c5cf8c |
int PMI_KVS_Get( const char *kvsname, const char *key, char *value);
|
|
Packit Service |
c5cf8c |
/* get value associated with key */
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
int PMI_KVS_iter_first(const char *kvsname, char *key, char *val);
|
|
Packit Service |
c5cf8c |
int PMI_KVS_iter_next(const char *kvsname, char *key, char *val);
|
|
Packit Service |
c5cf8c |
/* loop through the pairs in the kvs */
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\end{verbatim}
|
|
Packit Service |
c5cf8c |
\end{small}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
A scalable implementation of Part 1 of PMI could probably use existing software
|
|
Packit Service |
c5cf8c |
for the group functions, and add some new functionality to support the
|
|
Packit Service |
c5cf8c |
KVS-related functions. One possible implementation is suggested below
|
|
Packit Service |
c5cf8c |
in Section~\ref{sec:adding}.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\paragraph{Notes}
|
|
Packit Service |
c5cf8c |
\begin{itemize}
|
|
Packit Service |
c5cf8c |
\item The above routines (Part 1) are all that is needed for an
|
|
Packit Service |
c5cf8c |
implementation of MPI-1 and most of MPI-2. Part 2 is only needed to
|
|
Packit Service |
c5cf8c |
support the MPI functions defined in the Dynamic Process Management
|
|
Packit Service |
c5cf8c |
section of the MPI Standard.
|
|
Packit Service |
c5cf8c |
\item Similarly, multiple KVS's are only really needed in the dynamic
|
|
Packit Service |
c5cf8c |
process management case. An initial implementation could omit {\tt
|
|
Packit Service |
c5cf8c |
PMI\_KVS\_Create} and {\tt PMI\_KVS\_Destroy}. The iterators {\tt
|
|
Packit Service |
c5cf8c |
int PMI\_KVS\_iter\_first} and {\tt int PMI\_KVS\_iter\_next} are used to
|
|
Packit Service |
c5cf8c |
transfer KVS's in grid environments, and could also be omitted from
|
|
Packit Service |
c5cf8c |
some implementations.
|
|
Packit Service |
c5cf8c |
\item The {\tt spawned} argument to {\tt PMI\_Init} is necessary to
|
|
Packit Service |
c5cf8c |
implement MPI-2 functionality, in particular {\tt MPI\_Get\_parent}.
|
|
Packit Service |
c5cf8c |
In a PMI implementation that does not support dynamic process
|
|
Packit Service |
c5cf8c |
management, it can always just return a pointer to 0.
|
|
Packit Service |
c5cf8c |
\item The {\tt PMI\_Commit} exists so that in case {\tt PMI\_Put} is an
|
|
Packit Service |
c5cf8c |
expensive operation, involving communication with an external process,
|
|
Packit Service |
c5cf8c |
several {\tt PMI\_Put}s can be batched locally and only sent off when
|
|
Packit Service |
c5cf8c |
the {\tt PMI\_Commit} is done.
|
|
Packit Service |
c5cf8c |
\item The notion of KVS is reminiscent of Linda, in which processes
|
|
Packit Service |
c5cf8c |
execute {\tt read} and {\tt write} operations on a shared ``tuple
|
|
Packit Service |
c5cf8c |
space''. Why not use the Linda interface? The reason is scalability.
|
|
Packit Service |
c5cf8c |
Linda implements a blocking {\tt read}, in which the calling process blocks
|
|
Packit Service |
c5cf8c |
until data with the requested key is put into the tuple space by
|
|
Packit Service |
c5cf8c |
another process. While this is a convenient synchronizing operation,
|
|
Packit Service |
c5cf8c |
and could in theory be used here, it would not be scalable. Note that
|
|
Packit Service |
c5cf8c |
in PMI, there is no point-to-point communication. The only
|
|
Packit Service |
c5cf8c |
synchronization operation is the {\tt PMI\_Barrier}, which can have a
|
|
Packit Service |
c5cf8c |
variety of scalable implementations, depending on the environment.
|
|
Packit Service |
c5cf8c |
\end{itemize}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\textbf{To Do}: Provide minimum sizes for the various strings. Here
|
|
Packit Service |
c5cf8c |
is a proposal.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The minimum sizes of the names and values stored will depend on the
|
|
Packit Service |
c5cf8c |
MPI implementation. The following limits will work with most
|
|
Packit Service |
c5cf8c |
implementations:
|
|
Packit Service |
c5cf8c |
\begin{description}
|
|
Packit Service |
c5cf8c |
\item[kvsname]16
|
|
Packit Service |
c5cf8c |
\item[key]32
|
|
Packit Service |
c5cf8c |
\item[value]64
|
|
Packit Service |
c5cf8c |
\item[Number of keys]Number of processes in an MPI program (size of
|
|
Packit Service |
c5cf8c |
\texttt{MPI\_COMM\_WORLD} in an MPI-1 program)
|
|
Packit Service |
c5cf8c |
\item[Number of groups]Number of separate \texttt{MPI\_COMM\_WORLD}s
|
|
Packit Service |
c5cf8c |
managed by the process manager. For an single MPI-1 code, this is one.
|
|
Packit Service |
c5cf8c |
\end{description}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\subsection{Part 2: Advanced PMI Routines}
|
|
Packit Service |
c5cf8c |
\label{sec:part2}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
This part of PMI is still under development. If one assumes that the
|
|
Packit Service |
c5cf8c |
dynamic process management functions in MPI-2 are not performance
|
|
Packit Service |
c5cf8c |
critical, then the requirements for efficiency and scalability of these
|
|
Packit Service |
c5cf8c |
operations are less crucial, although we expect MPI\_Comm\_Spawn to be
|
|
Packit Service |
c5cf8c |
scalably implemented, at least to compete with an original {\tt mpiexec}.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\begin{small}
|
|
Packit Service |
c5cf8c |
\begin{verbatim}
|
|
Packit Service |
c5cf8c |
/* PMI Process Creation functions */
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
int PMI_Spawn_multiple(int count, const char *cmds[], const char **argvs[],
|
|
Packit Service |
c5cf8c |
const int *maxprocs, const void *info, int *errors,
|
|
Packit Service |
c5cf8c |
int *same_domain, const void *preput_info);
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
int PMI_Spawn(const char *cmd, const char *argv[], const int maxprocs,
|
|
Packit Service |
c5cf8c |
char *spawned_kvsname, const int kvsnamelen );
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
/* parse PMI implementation specific values into an info object that can
|
|
Packit Service |
c5cf8c |
then be passed to PMI_Spawn_multiple. Remove PMI implementation
|
|
Packit Service |
c5cf8c |
specific arguments from argc and argv
|
|
Packit Service |
c5cf8c |
*/
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
int PMI_Args_to_info(int *argcp, char ***argvp, void *infop);
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
/* Other PMI functions to be defined as necessary for other parts of
|
|
Packit Service |
c5cf8c |
dynamic process management */
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\end{verbatim}
|
|
Packit Service |
c5cf8c |
\end{small}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{Typical Usage}
|
|
Packit Service |
c5cf8c |
\label{sec:usage}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
In this section we give an example of typical usage of the PMI interface
|
|
Packit Service |
c5cf8c |
in MPICH. In the {\tt CH3\_TCP} device used on Linux clusters, TCP is
|
|
Packit Service |
c5cf8c |
used to support MPI communication. Connections between processes are
|
|
Packit Service |
c5cf8c |
established by the normal socket {\tt connect}/{\tt accept} mechanism.
|
|
Packit Service |
c5cf8c |
For this to work, before the first {\tt MPI\_SEND} from one process to
|
|
Packit Service |
c5cf8c |
another, one process must have acquired a port from the operating system
|
|
Packit Service |
c5cf8c |
and be listening on it with the normal {\tt socket}/{\tt bind}/{\tt listen}
|
|
Packit Service |
c5cf8c |
sequence. The other process, typically on a separate host, will execute
|
|
Packit Service |
c5cf8c |
the corresponding {\tt socket}/{\tt connect} sequence, at which time the
|
|
Packit Service |
c5cf8c |
first process will issue an {\tt accept}, establishing the TCP
|
|
Packit Service |
c5cf8c |
connection. Since we don't use reserved ports, the first process must
|
|
Packit Service |
c5cf8c |
advertise in some way the port it is listening on. Since for the sake
|
|
Packit Service |
c5cf8c |
of scalability and rapid startup we don't establish these connections
|
|
Packit Service |
c5cf8c |
until they are needed, the {\tt connect} operation is not executed until
|
|
Packit Service |
c5cf8c |
the socket is needed, typically the first time a process issues an {\tt
|
|
Packit Service |
c5cf8c |
MPI\_SEND}. At this point the MPICH implementation must determine from the
|
|
Packit Service |
c5cf8c |
MPI rank of the destination process which host, and which port on that
|
|
Packit Service |
c5cf8c |
host, to connect to in order to establish the connection.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
PMI is the mechanism by which the first process advertises its host and
|
|
Packit Service |
c5cf8c |
listening port, keyed by rank, and the the second process finds out this
|
|
Packit Service |
c5cf8c |
information. Thus the sequence of events during {\tt MPI\_Init} goes
|
|
Packit Service |
c5cf8c |
like this:
|
|
Packit Service |
c5cf8c |
\begin{enumerate}
|
|
Packit Service |
c5cf8c |
\item During {\tt MPI\_Init}, each process calls {\tt PMI\_Init}, in
|
|
Packit Service |
c5cf8c |
order to perform whatever initialization is needed by the PMI
|
|
Packit Service |
c5cf8c |
implementation.
|
|
Packit Service |
c5cf8c |
\item Still in {\tt MPI\_Init}, each process calls {\tt PMI\_Get\_rank}
|
|
Packit Service |
c5cf8c |
to find out its rank in the MPI job.
|
|
Packit Service |
c5cf8c |
\item Each process executes {\tt gethostname} to find out its host and
|
|
Packit Service |
c5cf8c |
{\tt socket}/{\tt bind} to obtain a port.
|
|
Packit Service |
c5cf8c |
\item Each process creates a key from its rank and a value for that key
|
|
Packit Service |
c5cf8c |
from its host and port. We actually use two pairs, using keys {\tt
|
|
Packit Service |
c5cf8c |
P<rank>-hostname} and {\tt P<rank>-port}.
|
|
Packit Service |
c5cf8c |
\item Each process does a {\tt PMI\_KVS\_Put} to put its (key, value)
|
|
Packit Service |
c5cf8c |
pairs into the default KVS. It may deposit other information with
|
|
Packit Service |
c5cf8c |
other calls to {\tt PMI\_KVS\_Put}. It does a {\tt PMI\_Commit} to
|
|
Packit Service |
c5cf8c |
flush all of the (key, value) pairs to the KVS.
|
|
Packit Service |
c5cf8c |
\item All processes execute {\tt PMI\_Barrier} to synchronize. It is
|
|
Packit Service |
c5cf8c |
assumed that this operation is implemented in a scalable way. Note
|
|
Packit Service |
c5cf8c |
that MPI communication is not available yet, so this is not an {\tt
|
|
Packit Service |
c5cf8c |
MPI\_Barrier}. We are still inside {\tt MPI\_Init}.
|
|
Packit Service |
c5cf8c |
\end{enumerate}
|
|
Packit Service |
c5cf8c |
At this point the only non-local communication that has taken place is
|
|
Packit Service |
c5cf8c |
the barrier. Now, each process can exit from {\tt MPI\_INIT}, having
|
|
Packit Service |
c5cf8c |
made available the information that only it knows (the listening port).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
When a process executes any form of {\tt MPI\_SEND}, the
|
|
Packit Service |
c5cf8c |
implementation can check to see if a connection to the destination
|
|
Packit Service |
c5cf8c |
process already exists, and if not, use the rank to create the
|
|
Packit Service |
c5cf8c |
appropriate key and do {\tt PMI\_KVS\_Get} to find the host and port of the
|
|
Packit Service |
c5cf8c |
destination process and do the {\tt connect}. We currently keep these
|
|
Packit Service |
c5cf8c |
sockets open for the rest of the job, but there is nothing to preclude
|
|
Packit Service |
c5cf8c |
closing and reopening them as needed.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The above sequence is of course not the only way to use the PMI
|
|
Packit Service |
c5cf8c |
interface, but it constitutes a typical example of its use. Note that
|
|
Packit Service |
c5cf8c |
even if more information is conveyed in this way, the actual size of the
|
|
Packit Service |
c5cf8c |
KVS is not anticipated to be large. Room for a few (key, value) pairs
|
|
Packit Service |
c5cf8c |
for each process is all that is necessary.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{Implementing PMI}
|
|
Packit Service |
c5cf8c |
\label{sec:implementing}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
In this section we describe some implementations of PMI that we are
|
|
Packit Service |
c5cf8c |
using, together with a design for one we are not. Although this note is
|
|
Packit Service |
c5cf8c |
about the interface itself and not any specific implementation, it might
|
|
Packit Service |
c5cf8c |
be useful to understand the alternatives that we have explored and are
|
|
Packit Service |
c5cf8c |
in current use. Also, the very existence of multiple implementations
|
|
Packit Service |
c5cf8c |
demonstrates that PMI is a real interface with a purpose, not just a
|
|
Packit Service |
c5cf8c |
design for part of MPICH. All implementations are distributed with
|
|
Packit Service |
c5cf8c |
MPICH, as described at the end of Section~\ref{sec:implications}.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
It is useful to think of a PMI implementation as having two parts: a
|
|
Packit Service |
c5cf8c |
{\em client\/} side and a {\em server\/} side. The client side is the
|
|
Packit Service |
c5cf8c |
direct implementation of the PMI functions defined above, linked together
|
|
Packit Service |
c5cf8c |
with the MPI library in the application's executable. In some cases this
|
|
Packit Service |
c5cf8c |
part of the implementation communicates with other processes not part of
|
|
Packit Service |
c5cf8c |
the application; we call these processes the server side of the PMI
|
|
Packit Service |
c5cf8c |
implementation. As we shall see, the server side may not exist at all,
|
|
Packit Service |
c5cf8c |
or be part of the client side; multiple architectures for PMI
|
|
Packit Service |
c5cf8c |
implementations already exist. In the following subsection we describe some
|
|
Packit Service |
c5cf8c |
existing client-side PMI implementations. In the next section we
|
|
Packit Service |
c5cf8c |
describe some server-side implementations. Currently most of these are
|
|
Packit Service |
c5cf8c |
part of the MPICH distribution~\cite{mpich-web-page}.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\subsection{Client Side}
|
|
Packit Service |
c5cf8c |
\label{sec:client-side}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
We currently are using three separate implementations of the client side of PMI.
|
|
Packit Service |
c5cf8c |
\begin{description}
|
|
Packit Service |
c5cf8c |
%% \item[uni] is a stub used for debugging. It assumes that there is only one
|
|
Packit Service |
c5cf8c |
%% process and so needs to provide no services.
|
|
Packit Service |
c5cf8c |
\item[simple] is our primary implementation for Unix systems. It
|
|
Packit Service |
c5cf8c |
assumes that a socket (the PMI socket) has been created that can be
|
|
Packit Service |
c5cf8c |
used to exchange commands with the server side of the implementation.
|
|
Packit Service |
c5cf8c |
It is used by multiple server implementations, as described below.
|
|
Packit Service |
c5cf8c |
\end{description}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\subsection{Server Side}
|
|
Packit Service |
c5cf8c |
\label{sec:server-side}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
One can think of the server side of a PMI implementation as that part of
|
|
Packit Service |
c5cf8c |
a process management system that supports the client. Currently several
|
|
Packit Service |
c5cf8c |
are in use or under development.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\begin{description}
|
|
Packit Service |
c5cf8c |
\item[forker] implements the server side of the ``simple'' client side
|
|
Packit Service |
c5cf8c |
described above. It is used primarily for debugging, although it can
|
|
Packit Service |
c5cf8c |
also be used in production on SMP's. It consists of an {\tt mpiexec}
|
|
Packit Service |
c5cf8c |
script that simply forks the parallel processes after setting up the
|
|
Packit Service |
c5cf8c |
PMI socket. Thus all processes must be on the same machine.
|
|
Packit Service |
c5cf8c |
\item[Remshell] uses {\tt rsh} or {\tt ssh} to start the processes from
|
|
Packit Service |
c5cf8c |
the {\tt mpiexec} process, then they connect back to exchange the
|
|
Packit Service |
c5cf8c |
keyval information. This illustrates the combination of an old ({\tt
|
|
Packit Service |
c5cf8c |
rsh}) process startup mechanism with a new data-exchange mechanism.
|
|
Packit Service |
c5cf8c |
\end{description}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\subsection{Combined Client and Server}
|
|
Packit Service |
c5cf8c |
\label{sec:combined}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The MPICH-G2~\cite{karonis02:mpich-g2} implementation of MPI illustrates
|
|
Packit Service |
c5cf8c |
yet another approach. MPICH-G2 is built on MPICH1 and thus uses the BNR
|
|
Packit Service |
c5cf8c |
interface, but the underlying principles are the same. In MPICH-G2, the
|
|
Packit Service |
c5cf8c |
{\tt put} operations are local, and the {\tt barrier} operation is a
|
|
Packit Service |
c5cf8c |
global all-to-all-exchange, implemented in a scalable way. Then the
|
|
Packit Service |
c5cf8c |
{\tt get}s can be done without further communication.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\subsection{Adding a PMI Module to an Existing Process Starter}
|
|
Packit Service |
c5cf8c |
\label{sec:adding}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
In the implementations listed above, we have combined the PMI
|
|
Packit Service |
c5cf8c |
implementation, particularly the server side, with a process startup
|
|
Packit Service |
c5cf8c |
mechanism being implemented at the same time. Some systems, such as
|
|
Packit Service |
c5cf8c |
SLURM, may already have scalable methods in place for starting processes
|
|
Packit Service |
c5cf8c |
and might be looking for the simplest way to add PMI capabilities.
|
|
Packit Service |
c5cf8c |
Although the best approach is likely to be to incorporate PMI
|
|
Packit Service |
c5cf8c |
server-side capabilities into the process starter, it may be that the
|
|
Packit Service |
c5cf8c |
following approach, though less scalable, might be serviceable:
|
|
Packit Service |
c5cf8c |
\begin{enumerate}
|
|
Packit Service |
c5cf8c |
\item At the time each process of the MPI job is started, it is passed
|
|
Packit Service |
c5cf8c |
its rank and the size of the job in an environment variable, since
|
|
Packit Service |
c5cf8c |
these are things the process manager knows. This could be used to
|
|
Packit Service |
c5cf8c |
implement {\tt PMI\_Get\_rank} and {\tt PMI\_Get\_size}. (Actually
|
|
Packit Service |
c5cf8c |
these values would probably be read from the environment during {\tt
|
|
Packit Service |
c5cf8c |
PMI\_Init} and cached.)
|
|
Packit Service |
c5cf8c |
\item At the time the job is started, a separate ``KVS server'' process
|
|
Packit Service |
c5cf8c |
would be forked to hold all KVS data.
|
|
Packit Service |
c5cf8c |
\item All processes would send their {\tt PMI\_KVS\_Put} data to this
|
|
Packit Service |
c5cf8c |
server. Use of UDP rather than TCP would probably help with the
|
|
Packit Service |
c5cf8c |
obvious scalability problem that this server would receive data from
|
|
Packit Service |
c5cf8c |
each process in the job at approximately the same time.
|
|
Packit Service |
c5cf8c |
\item The {\tt PMI\_Barrier} would be implemented in the server with a
|
|
Packit Service |
c5cf8c |
simple counter.
|
|
Packit Service |
c5cf8c |
\item Data requested by {\tt PMI\_KVS\_Get} would come from the server.
|
|
Packit Service |
c5cf8c |
\item A variation would be to have all the data broadcast at the time of
|
|
Packit Service |
c5cf8c |
the barrier, so that subsequent gets would be local.
|
|
Packit Service |
c5cf8c |
\end{enumerate}
|
|
Packit Service |
c5cf8c |
This mechanism is not intrinsically scalable to thousands of nodes,
|
|
Packit Service |
c5cf8c |
which is why we are not using it. However, it might scale farther than
|
|
Packit Service |
c5cf8c |
a few hundred nodes, and be a rather straightforward addition to an
|
|
Packit Service |
c5cf8c |
existing process startup mechanism.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{Resource Registration}
|
|
Packit Service |
c5cf8c |
\label{sec:register}
|
|
Packit Service |
c5cf8c |
There are some resources that a program may need to allocate that the
|
|
Packit Service |
c5cf8c |
program cannot guarantee will be released when the program exits,
|
|
Packit Service |
c5cf8c |
particularly if the program exits as the result of an error or an
|
|
Packit Service |
c5cf8c |
uncatchable signal. These resources include other processes, SYSV
|
|
Packit Service |
c5cf8c |
shared memory segments and semaphores, and temporary files. The
|
|
Packit Service |
c5cf8c |
routines in this section allow the program to notify the process
|
|
Packit Service |
c5cf8c |
manager of these resources and provide a general way for the process
|
|
Packit Service |
c5cf8c |
manager to free them when the program exits.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
If the process manager does not provide these functions, then there
|
|
Packit Service |
c5cf8c |
are several options:
|
|
Packit Service |
c5cf8c |
\begin{enumerate}
|
|
Packit Service |
c5cf8c |
\item The calls can be ignored. The program will do its best to free
|
|
Packit Service |
c5cf8c |
these resources when it exits. This may include setting a cleanup
|
|
Packit Service |
c5cf8c |
handler on the catchable signals that normally cause an abort.
|
|
Packit Service |
c5cf8c |
Note that in this case the registration routine must retrun an error
|
|
Packit Service |
c5cf8c |
so that the application knows that it must handle this itself.
|
|
Packit Service |
c5cf8c |
\item The calls can be directed to an alternate process, called a
|
|
Packit Service |
c5cf8c |
``watchdog'', that will free the resources if the watched process
|
|
Packit Service |
c5cf8c |
terminates abnormally.
|
|
Packit Service |
c5cf8c |
\end{enumerate}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Note that this interface provides a way for process managers to permit
|
|
Packit Service |
c5cf8c |
a process to create new processes, since the processes will be
|
|
Packit Service |
c5cf8c |
registered with the process manager.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The following is still in rough draft form
|
|
Packit Service |
c5cf8c |
\begin{verbatim}
|
|
Packit Service |
c5cf8c |
int PMI_Resource_register( const char *name, (void *()(void*))at_exit,
|
|
Packit Service |
c5cf8c |
void *at_exit_extra_data,
|
|
Packit Service |
c5cf8c |
(void *()(void *))at_abort,
|
|
Packit Service |
c5cf8c |
void *at_abort_extra_data );
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
int PMI_Resource_release_begin( const char *name, int timeout );
|
|
Packit Service |
c5cf8c |
int PMI_Resource_release_end( const char *name );
|
|
Packit Service |
c5cf8c |
\end{verbatim}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The functions in \texttt{PMI\_Resource\_register} may need to be
|
|
Packit Service |
c5cf8c |
command names or an enumerated list of known resources.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The release functions are split to allow the process to indicate that
|
|
Packit Service |
c5cf8c |
it is about to release a resource and a timeout at which time the
|
|
Packit Service |
c5cf8c |
watchdog may consider the process to have failed. For example, when
|
|
Packit Service |
c5cf8c |
removing a SYSV shared memory segment, the following code would be used:
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\begin{verbatim}
|
|
Packit Service |
c5cf8c |
PMI_Resource_release_begin( "myipc", 10 );
|
|
Packit Service |
c5cf8c |
shmctl( memid, IPC_RMID, NULL );
|
|
Packit Service |
c5cf8c |
PMI_Resource_release_end( "myipc" );
|
|
Packit Service |
c5cf8c |
\end{verbatim}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
This interface still contains a small race condition: the time between
|
|
Packit Service |
c5cf8c |
when the resource is created and when it is registered. This is a
|
|
Packit Service |
c5cf8c |
very narrow race, so it may not be important to close it (and with
|
|
Packit Service |
c5cf8c |
registration, much more likely resource leaks have been closed).
|
|
Packit Service |
c5cf8c |
However, a two-phase registration process could be considered, that
|
|
Packit Service |
c5cf8c |
would register the intent to create a resource. In the case of
|
|
Packit Service |
c5cf8c |
failure to complete the second part of the two-phase registration, the
|
|
Packit Service |
c5cf8c |
watchdog could try to hunt down the newly allocated resource.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{Topology Information}
|
|
Packit Service |
c5cf8c |
\label{sec:topology}
|
|
Packit Service |
c5cf8c |
The process manager often has some information about the process
|
|
Packit Service |
c5cf8c |
topology. For example, it is likely to know about multiprocessor
|
|
Packit Service |
c5cf8c |
nodes and may know about parallel machine layout. The routines in
|
|
Packit Service |
c5cf8c |
this section provide a way for the process manager to communicate that
|
|
Packit Service |
c5cf8c |
information to the program. As with the other PMI services, if the
|
|
Packit Service |
c5cf8c |
process manager cannot provide this service, several alternatives
|
|
Packit Service |
c5cf8c |
exist, including returning an \texttt{PMI\_ERR\_UNSUPPORTED} and using
|
|
Packit Service |
c5cf8c |
a separate service to provide this information.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\begin{verbatim}
|
|
Packit Service |
c5cf8c |
int PMI_Topo_type( PMI_Group group, int *kind );
|
|
Packit Service |
c5cf8c |
int PMI_Topo_cluster_info( PMI_Group group,
|
|
Packit Service |
c5cf8c |
int *levels, int my_cluster[],
|
|
Packit Service |
c5cf8c |
int my_rank[] );
|
|
Packit Service |
c5cf8c |
int PMI_Topo_mesh_info( PMI_Group group, int ndims, int dims[] );
|
|
Packit Service |
c5cf8c |
\end{verbatim}
|
|
Packit Service |
c5cf8c |
These routines provide information on the specified PMI group.
|
|
Packit Service |
c5cf8c |
\texttt{PMI\_Topo\_type} gets the type of topology. The current
|
|
Packit Service |
c5cf8c |
choices are \texttt{PMI\_TOPO\_CLUSTER}, \texttt{PMI\_TOPO\_MESH}, and
|
|
Packit Service |
c5cf8c |
\texttt{PMI\_TOPO\_NONE}.
|
|
Packit Service |
c5cf8c |
The other routines provide information about the cluster and mesh
|
|
Packit Service |
c5cf8c |
topology. Other topologies can be added as necessary; these cover
|
|
Packit Service |
c5cf8c |
most current systems.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{Resource Allocation on Behalf of Parallel Jobs}
|
|
Packit Service |
c5cf8c |
\label{sec:request}
|
|
Packit Service |
c5cf8c |
(I'm not sure that this section goes here)
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
In some cases, resources must be allocated before a process is
|
|
Packit Service |
c5cf8c |
created. For example, if several processes on the same SMP node are
|
|
Packit Service |
c5cf8c |
to share an anonymous mmap (for shared memory), this memory must be
|
|
Packit Service |
c5cf8c |
allocated before the processes are created (strictly, before all but
|
|
Packit Service |
c5cf8c |
the first process is created, if the first process creates the
|
|
Packit Service |
c5cf8c |
others). The purpose of the routines in this section is to allow a
|
|
Packit Service |
c5cf8c |
startup program, such as \texttt{mpiexec}, to describe these
|
|
Packit Service |
c5cf8c |
requirements to the process manager before any processes are started.
|
|
Packit Service |
c5cf8c |
Question: it may be that the only routine here is used to answer the
|
|
Packit Service |
c5cf8c |
question ``did you give me the resource''? This leaves unanswered the
|
|
Packit Service |
c5cf8c |
question of ``how does a device let an mpiexec know that it needs a
|
|
Packit Service |
c5cf8c |
particular resource''?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{Implications for Collaborators}
|
|
Packit Service |
c5cf8c |
\label{sec:implications}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
We hope that this brief discussion has made it easier to understand what
|
|
Packit Service |
c5cf8c |
options and opportunities exist for implementors of parallel programming
|
|
Packit Service |
c5cf8c |
libraries or process management environments that will interact with
|
|
Packit Service |
c5cf8c |
MPICH or MPICH-derived MPI implementations.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
MPI and other library implementors are recommended to use the PMI
|
|
Packit Service |
c5cf8c |
functions to exchange data with other processes related to the setting
|
|
Packit Service |
c5cf8c |
up of the primary communication mechanism. MPICH does this already
|
|
Packit Service |
c5cf8c |
for setting up TCP connections in the CH3 implementation of the
|
|
Packit Service |
c5cf8c |
Abstract Device Interface (ADI-3). If one links with the ``simple''
|
|
Packit Service |
c5cf8c |
implementation of the client side of the PMI implementation in MPICH,
|
|
Packit Service |
c5cf8c |
then MPI jobs can be started by any process management environment
|
|
Packit Service |
c5cf8c |
that implements the server side.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Process management systems, such as PBS, YOD, or SLURM, have two options
|
|
Packit Service |
c5cf8c |
in the short run.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
In the long run implementations may prefer to implement both sides
|
|
Packit Service |
c5cf8c |
themselves, meaning that one would link one's application with a PBS- or
|
|
Packit Service |
c5cf8c |
SLURM- or Myricom-specific object file implementing the client side.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The PMI-related code described here is available in the current MPICH
|
|
Packit Service |
c5cf8c |
distribution~\cite{mpich-web-page}, in the {\tt
|
|
Packit Service |
c5cf8c |
src/pmi/\{simple,uni\}} (client side) and {\tt
|
|
Packit Service |
c5cf8c |
src/pm/forker} (server side) subdirectories.
|
|
Packit Service |
c5cf8c |
Different process managers (the server side) and different PMI
|
|
Packit Service |
c5cf8c |
implementations can be chosen when MPICH is configured. The default is
|
|
Packit Service |
c5cf8c |
as if one had specified
|
|
Packit Service |
c5cf8c |
\begin{verbatim}
|
|
Packit Service |
c5cf8c |
configure --with-pmi=simple
|
|
Packit Service |
c5cf8c |
\end{verbatim}
|
|
Packit Service |
c5cf8c |
Please send questions and comments to mpich-discuss@mcs.anl.gov.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
%\appendix
|
|
Packit Service |
c5cf8c |
%\section{Wire Protocol for the Simple PMI Implementation}
|
|
Packit Service |
c5cf8c |
%\texttt{PMI\_PORT} environment variable
|
|
Packit Service |
c5cf8c |
%\section{Man Pages for PMI Routines}
|
|
Packit Service |
c5cf8c |
% Use the man page generator and include the relevant files here
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\bibliography{/home/MPI/allbib,paper}
|
|
Packit Service |
c5cf8c |
\bibliographystyle{plain}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\end{document}
|