\documentclass{report}
\usepackage{graphics}
\usepackage[dvipdfm]{hyperref}
%
% This is the new (September 2005) MPICH design document. The plan is
% to make this available both as a PDF document and split on the web in
% an easy to read fashion. We can use latex2html to tohtml to provide a
% simple version of this, and that may be enough. However, to retain the
% option of generating web pages directly from this source, only simple
% Latex should be used, and only the new forms (e.g., \texttt{...} instead of
% {\tt ...}
%
\makeindex
\begin{document}
\markright{MPICH Design Document}
\title{MPICH Design Document}
\author{William D. Gropp and Rajeev Thakur}
\maketitle
\pagenumbering{roman}
\tableofcontents
\clearpage
%\raggedright
%% raggedright resets parindent
%\parindent 1em
%% no parskip when parindent used
%\parskip 0pt
\pagenumbering{arabic}
\pagestyle{headings}
\part{MPICH organization for users}
\chapter{Overview}
<<up here, explain that MPICH is a framework for MPI implementation with many possible choices, including specialized communication devices and interfaces to process managers, not ch3+hydra. Duplicate this discussion in the top-level entry for developers>>
\chapter{Communication Devices}
\chapter{Process Managers}
other options
<<this chapter partially overlaps the installation manual, but that manual doesn't cover much of the motivation for the design>>
\chapter{Help Resources}
\section{Frequently Asked Questions (FAQ)}
\section{Diagnostic Programs}
\section{Buglist archives}
\part{MPICH organization for developers and hackers}
<<each section (as appropriate) contains a rationale for the choices and a brief discussion of alternatives and why they were not chosen>>
\chapter{Goals of MPICH}
\section{Principles}
No duplicated code (no cut and paste programming)
Uniform appearance of code
Uniform user interface for controls (e.g., standardized parameter handling, output, naming)
Coverage and Testing
\section{Major components}
devices
the ch3 device and channels
process managers and PM interface
logging
collectives
topology
\section{Build System}
rationale
why not automake
why not libtool
using the build system
using your own configure/make/build scripts
\section{Coding Style}
Code template for uniform error checking and reporting (e.g.,
common \texttt{fn\_fail} target)
Macros for common operations
error reporting
memory allocation
debugging
Safer or more efficient replacements for common routines
safe string routines
memcpy hooks
Error reporting
Rationale
Adding new error classes and codes
Tags for coverage analysis
Take advantage of compiler features to identify potential
problems, including warnings about easily misused features in
C (such as assignments within if tests).
Need to put somewhere - don't fix missing prototype messages
by adding a prototype to the C file; there should always be
one prototype in a header file somewhere.
\section{Major Structures}
request, comm, group, etc.
\section{Selecting Features at Compile Time}
use of macros to control which code is used
list of all macros
\section{Language Bindings}
rationale for buildiface
Outstanding issues
\section{Scripts and Standards for user interfaces}
command line and environment variables
\chapter{Adding your own implementation of a component}
General mechanism (--with-<<component-name>>=directory, e.g., --with-pm=/home/me/mypm)
configure and setup scripts
standardized variables (\texttt{MPI\_CFLAGS}, CFLAGS, etc.)
Specific components (describe, explain the API/ABI, and how to work with it):
Two types
configure/compile-time
link/runtime
\section{Specifying Components during Configure}
\section{Specifying Components at Runtime}
It is possible to override the default routines for any component at
runtime by calling one of the following routines.
\begin{verbatim}
MPIX_SetMethod( component, object, name, function, communicator,
version )
\end{verbatim}
where
component - identifier for the component (e.g., topology,
collectives)
object - object to change; null if changing the defaults for
all objects
name - name of the method (probably character string for function)
function - function to use
communicator - communicator scope of this change (MPI\_COMM\_SELF
and MPI\_COMM\_WORLD are popular choices)
version - version of the interface (see below)
This call is collective over the communicator. The intent of the
communicator option is to allow consistency checking for changes
that must be consistent across processes (e.g., changing the
collective algorithms).
The version number is used to ensure that the version of the
interface matches the one used in the library
A related call is
\begin{verbatim}
MPIX_SetAllMethods( component, object, struct-of-functions*,
communicator, version )
\end{verbatim}
This is similar to SetMethod, except all functions are set from those
defined in the struct-of-functions. The version is very important
here as it ensures that the struct is known.
One additional call is
\begin{verbatim}
MPIX_DLLLoadMethods( component, object, dll-name, communicator,
version )
\end{verbatim}
This loads the methods from a dynamic loaded shared object. It is
collective over the communicator
Implementation note: the ``component'' in these calls can be the name
of a routine that
\section{Collectives}
\section{Topology}
\section{Logging}
--with-logging=/pathname
looks for \texttt{setup\_logging} script. In addition, the
\texttt{Makefile} will be invoked.
clean, dist-clean, maintainer-clean, library target?
Must provide an mpilogging.h file (contents defined as...)
Example (give example directory on web)
Builtin versions...
\section{PM and PMI}
mpiexec
-pmiargs host port executable
special args used to allow singleton init.
\section{Name Server}
--with-nameserver=...
\section{mpid}
\section{ch3-channel}
Components still evolving (a placeholder for things that we want to do)
\chapter{Working with MPICH itself}
<<for people that edit MPICH rather than use the component interface>>
Coding standards
Keeping configure clean
Keeping MPICH modular
\section{Parameters within the MPICH code}
The purpose of this part is to enable the configure, compile, and
run-time control of parameters in the collective implementation.
Goals for the parameter handling routines:
\begin{enumerate}
\item All values (at least in comm\_world) must be the same (the
collective routines expect the same parameter values)
For communicators in different comm worlds (e.g., created by
spawn or connect/accept), it may be necessary to perform a
separate step when the communicator is created to negotiate
parameter choices. This may require a communicator-creation
hook.
The most likely implementation of this step is to check that the
different comm\_worlds have compatible (typically identical)
values for the parameters. There may be a
\texttt{MPIR\_Parm\_check\_consistent}(new intercomm) call for this.
\item The parameter routine should provide the following choices, in
order of decreasing priority:
\begin{enumerate}
\item Explicit control from within the routine (i.e., by a routine
call)
\item Command line parameter on mpiexec
\item Environment variable on the process with rank 0 in MPI\_COMM\_WORLD.
\item Value is user-specified parameter file (e.g., ~/.mpichrc)
\item Value in site-specified parameter file (e.g.,
/usr/local/mpich/.mpichrc)
\item Compile-time value set at configure time (e.g.,
--enable-collective-config=file)
\item Compile-time value set within the collective code (the
default values)
\end{enumerate}
The choice of parameter value should also have some configure-time
control to allow a trade-off between generality and absolute
best performance (particular for startup, if we want to allow
configurations files to be read).
Note that even process managers that can deliver the same
environment variable to all processes often allow the user to
change that behavior (e.g., with a command-line option to
prevent environment variables from being sent to other
processes, so we cannot assume that the values of the
environment variables are the same on all processes without some
additional information (e.g., the process manager could tell us
that the environment variables are the same on all processes).
\item Parameter documentation should (at least in part) be provided
where the parameters are used.
\item Overhead of using parameter routines should be low; particularly
after the first use (if that matters; that is, the first time
may involve an initialization phase)
\item Parameter routines should be a component, cleanly initialized by
MPI\_Init/MPI\_Init\_thread and shut down by MPI\_Finalize.
\item Parameter types that must be handled include integer (e.g.,
message size, group size). Others should be allowed for (such
as arrays of integers, characters) as needed. We should avoid
floating-point values because of possible problems in using them
consistently, particularly in a heterogeneous environment.
Note that this does not say anything about how the values may be
represented in a parameter database, as that is an
implementation issue. However, it does say that the value must
be delivered to the using routine in the form that it needs. We
do not want to call strtol() everytime a routine needs an
integer (atoi should not be used because it has no way to
indicate that the value contains non-digit characters).
\item The design should ensure that changes need be made in only one
place or alternately that inconsistencies (e.g., between an
initialization and a use) are detectable before runtime (before
runtime so that the problem can be detected and fixed without
depending on running a particular test case).
\end{enumerate}
Proposed Design that meets these goals
\begin{enumerate}
\item => Some setup at MPI\_Init time to ensure all processes have
consistent values. Also requires hook for communicators that span
multiple MPI\_COMM\_WORLDs.
\item => registration of names so that the data can be acquired scalably
(e.g., read from file, environment variable on process with rank zero)
\item => a description string either in the call or in the structured
comment at the point of use.
\item also => an initialization step
\item => using the finalize callback. To make them a component for the
initialization step, there needs to be some mechanism to load modules
\item ?
\item => Either a single point of use (which introduces efficiency
problems) or some sort of source-code preprocessing.
\end{enumerate}
Based on these, I propose the following:
Initialization module:\\
This routine is called from MPI\_Init. It details all parameters used
by the collective routines, the associated environment variables, and
the description strings. It might look something like:
\begin{verbatim}
int MPIR_Parm_register_collective( ) {
rc = MPIR_Parm_register_int( "MPICH:SCATTER_THRESHOLD",
&MPIR_Scatter_threshold,
2048,
"Maximum size of messages sent using \
doubling in scatter algorithm with a default of %d" );
...
return MPI_SUCCESS;
}
\end{verbatim}
The purpose of this routine is to communicate, at runtime, the names
of the parameters that some part of the system may need. The
parameters to this routine are:
\begin{description}
\item[\texttt{MPICH:SCATTER\_THRESHOLD}]The name of the parameter. The
environment variable that might correspond to this is
\texttt{MPICH\_SCATTER\_THRESHOLD}
\item[\texttt{\&MPIR\_Scatter\_threashold}]The address of an int (since this is
\texttt{MPIR\_Parm\_register\_int}) that will contain the value.
This integer can be accessed directly by the code; see the example of the use
below.
\item[2048]The default value if no value is provided
\item["Maximum ... \%d..."]A documentation string including a format
specifier for the default value. This will be used to create
help information on the parameter directly from the source code.
\end{description}
Possible additional features would be a valid range (e.g., 0 to 64K)
or a routine to test for valid input.
Once all of these routines are called (e.g., for any set of modules
that choose to use this method to acquire parameters), the routine
\begin{verbatim}
int MPIR_Parm_init( )
\end{verbatim}
is called from within MPI\_Init/MPI\_Init\_thread. This is collective
over COMM\_WORLD and performs the
necessary file reads, environment variable reads and broadcasts the
results to all processes. It handles all registered names, from all
modules.
In the code, all parameters are accessed through a macro that allows
the values to be either compile-time or run-time constants
\begin{verbatim}
MPIR_PARM_GET_INT( "MPICH:SCATTER_THRESHOLD", MPIR_Scatter_threshold,2048 )
\end{verbatim}
This expands into either:
\begin{verbatim}
2048 (compile-time only)
\end{verbatim}
or
\begin{verbatim}
MPIR_Scatter_threshold (run-time)
\end{verbatim}
or even (runtime, lazy evaluation):
\begin{verbatim}
(MPIR_Scatter_threshhold_not_set ? \
(MPIR_Scatter_threshhold_not_set=1,\
MPIR_Scatter_threshold=MPIR_ScatterMPIRParmGetInt(\
"MPICH:SCATTER_THRESHOLD", 2048 )) : MPIR_Scatter_threshhold )
\end{verbatim}
The name (the first argument) is provided to allow for consistency
checking against the registered names. And in practice, the "2048"
would itself be a macro, set in an include file in the coll directory
(not in src/include/mpiimpl.h).
The parm init routine would use the name in the register routine (the
first argument) in the following way:
\begin{verbatim}
The : in the name separates the prefix from the rest of the name.
The name after the : may be used in an init file
The name with the : replaced by an \_ may be used as an environment
variable or in the init file.
\end{verbatim}
A more sophisticated approach would use code similar to (and shared
with) the error message extraction to synthesize the
\texttt{MPIR\_Parm\_register\_collective()} routine. This would be requried to
address goal 7 above. An alternative would be to simply detect an
inconsistency by having a program read the source code and check that
the registration and use calls match.
The script
\texttt{maint/extractstrings}, combined with code similar to that for
error messages in \texttt{maint/extracterrmsgs}, can be used to
automate the collection of registration information from the source code.
% ---
% Old text from December 2004
%% 1) Control of parameterization
%% The purpose of this part is to enable the configure, compile, and
%% run-time control of parameters in the collective implementation.
%% Goals for the parameter handling routines:
%% 1) All values (at least in comm_world) must be the same (the
%% collective routines expect the same parameter values)
%% a) For communicators in different comm worlds (e.g., created by
%% spawn or connect/accept), it may be necessary to perform a
%% separate step when the communicator is created to negotiate
%% parameter choices. This may require a communicator-creation
%% hook.
%% 2) The parameter routine should provide the following choices, in
%% order of decreasing priority:
%% a) Explicit control from within the routine (i.e., by a routine
%% call)
%% b) Environment variable on rank-0 communicator
%% c) Value is user-specified parameter file (e.g., ~/.mpichrc)
%% d) Value in site-specified parameter file (e.g.,
%% /usr/local/mpich/.mpichrc)
%% e) Compile-time value set at configure time (e.g.,
%% --enable-collective-config=file)
%% f) Compile-time value set within the collective code (the
%% default values)
%% The choice of parameter value should also have some configure-time
%% control to allow a trade-off between generality and absolute
%% best performance (particular for startup, if we want to allow
%% configurations files to be read).
%% 3) Parameter documentation should (at least in part) be provided
%% where the parameters are used.
%% 4) Overhead of using parameter routines should be low; particularly
%% after the first use (if that matters; that is, the first time
%% may involve an initialization phase)
%% 5) Parameter routines should be a component, cleanly initialized by
%% MPI_Init/MPI_Init_thread and shut down by MPI_Finalize.
%% 6) Parameter types that must be handled include integer (e.g.,
%% message size, group size). Others should be allowed for (such
%% as arrays of integers, characters). We should avoid
%% floating-point values because of possible problems in using them
%% consistently, particularly in a heterogeneous environment.
%% Proposed Design that meets these goals
%% 1) => Some setup at MPI_Init time to ensure all processes have
%% consistent values. Also requires hook for communicators that span
%% multiple MPI_COMM_WORLDs.
%% 2) => registration of names so that the data can be acquired scalably
%% (e.g., read from file, environment variable on process with rank zero)
%% 3) => a description string either in the call or in the structured
%% comment at the point of use.
%% 4) also => an initialization step
%% 5) => using the finalize callback. To make them a component for the
%% initialization step, there needs to be some mechanism to load modules
%% Based on these, I propose the following:
%% Initialization module:
%% This routine is called from MPI_Init. It details all parameters used
%% by the collective routines, the associated environment variables, and
%% the description strings. It might look something like:
%% int MPIR_Parm_register_collective( ) {
%% rc = MPIR_Parm_register_int( "MPICH:SCATTER_THRESHOLD",
%% &MPIR_Scatter_threshold,
%% 2048,
%% "Maximum size of messages sent using \
%% doubling in scatter algorithm with a default of %d" );
%% ...
%% return MPI_SUCCESS;
%% }
%% The purpose of this routine is to communicate, at runtime, the names
%% of the parameters that some part of the system may need.
%% Once all of these routines are called (e.g., for any set of modules
%% that choose to use this method to acquire parameters), the routine
%% int MPIR_Parm_init( )
%% is called from within MPI_Init/MPI_Init_thread. This is collective
%% over COMM_WORLD and performs the
%% necessary file reads, environment variable reads and broadcasts the
%% results to all processes. It handles all registered names, from all
%% modules.
%% In the code, all parameters are accessed through a macro that allows
%% the values to be either compile-time or run-time constants
%% MPIR_PARM_GET_INT( "MPICH:SCATTER_THRESHOLD", MPIR_Scatter_threshold, 2048 )
%% This expands into either:
%% 2048 (compile-time only)
%% MPIR_Scatter_threshold (run-time)
%% or even (runtime, lazy evaluation):
%% (MPIR_Scatter_threshhold_not_set ? \
%% (MPIR_Scatter_threshhold_not_set=0,MPIR_Scatter_threshold=MPIR_ScatterMPIRParmGetInt("MPICH:SCATTER_THRESHOLD",
%% 2048 )) : MPIR_Scatter_threshhold )
%% The name (the first argument) is provided to allow for consistency
%% checking against the registered names. And in practice, the "2048"
%% would itself be a macro, set in an include file in the coll directory
%% (not in src/include/mpiimpl.h).
%% The parm init routine would use the name in the register routine (the
%% first argument) in the following way:
%% The : in the name separates the prefix from the rest of the name.
%% The name after the : may be used in an init file
%% The name with the : replaced by an _ may be used as an environment
%% variable or in the init file.
%% A more sophisticated approach would use code similar to (and shared
%% with) the error message extraction to synthesize the
%% MPIR_Parm_register_collective() routine.
\end{document}