Blob Blame History Raw
\documentclass{report}
\usepackage{graphics}
\usepackage[dvipdfm]{hyperref}
%
% This is the new (September 2005) MPICH design document.  The plan is
% to make this available both as a PDF document and split on the web in
% an easy to read fashion.  We can use latex2html to tohtml to provide a 
% simple version of this, and that may be enough.  However, to retain the
% option of generating web pages directly from this source, only simple 
% Latex should be used, and only the new forms (e.g., \texttt{...} instead of 
% {\tt ...}
%


\makeindex

\begin{document}

\markright{MPICH Design Document}

\title{MPICH Design Document}
\author{William D. Gropp and Rajeev Thakur}
\maketitle

\pagenumbering{roman}
\tableofcontents
\clearpage


%\raggedright
%% raggedright resets parindent
%\parindent 1em
%% no parskip when parindent used
%\parskip 0pt

\pagenumbering{arabic}
\pagestyle{headings}

\part{MPICH organization for users}

\chapter{Overview}
     <<up here, explain that MPICH is a framework for MPI implementation with many possible choices, including specialized communication devices and interfaces to process managers, not ch3+hydra.  Duplicate this discussion in the top-level entry for developers>>

\chapter{Communication Devices}

\chapter{Process Managers}

    other options 
    <<this chapter partially overlaps the installation manual, but that manual doesn't cover much of the motivation for the design>>

\chapter{Help Resources}

\section{Frequently Asked Questions (FAQ)}

\section{Diagnostic Programs}

\section{Buglist archives}


\part{MPICH organization for developers and hackers}


    <<each section (as appropriate) contains a rationale for the choices and a brief discussion of alternatives and why they were not chosen>>

\chapter{Goals of MPICH}


\section{Principles}
         No duplicated code (no cut and paste programming) 
         Uniform appearance of code
         Uniform user interface for controls (e.g., standardized parameter handling, output, naming)
         Coverage and Testing

\section{Major components}
         devices
              the ch3 device and channels
         process managers and PM interface
         logging
         collectives
         topology

\section{Build System}
         rationale
              why not automake
              why not libtool
         using the build system
         using your own configure/make/build scripts

\section{Coding Style}
         Code template for uniform error checking and reporting (e.g.,
         common \texttt{fn\_fail} target) 
         Macros for common operations
              error reporting
              memory allocation
              debugging
         Safer or more efficient replacements for common routines
              safe string routines
              memcpy hooks
         Error reporting
              Rationale
               Adding new error classes and codes
         Tags for coverage analysis
         Take advantage of compiler features to identify potential
         problems, including warnings about easily misused features in
         C (such as assignments within if tests).  
	 Need to put somewhere - don't fix missing prototype messages
         by adding a prototype to the C file; there should always be
         one prototype in a header file somewhere.

\section{Major Structures}
      request, comm, group, etc.

\section{Selecting Features at Compile Time}
     use of macros to control which code is used
     list of all macros

\section{Language Bindings}
           rationale for buildiface
           Outstanding issues

\section{Scripts and Standards for user interfaces}
           command line and environment variables

\chapter{Adding your own implementation of a component}
         General mechanism (--with-<<component-name>>=directory, e.g., --with-pm=/home/me/mypm)
              configure and setup scripts
              standardized variables (\texttt{MPI\_CFLAGS}, CFLAGS, etc.)
         Specific components (describe, explain the API/ABI, and how to work with it):

Two types 

configure/compile-time

link/runtime

\section{Specifying Components during Configure}

\section{Specifying Components at Runtime}

It is possible to override the default routines for any component at
runtime by calling one of the following routines.

\begin{verbatim}
    MPIX_SetMethod( component, object, name, function, communicator,
    version )
\end{verbatim}

where
    component - identifier for the component (e.g., topology,
    collectives)
    object    - object to change; null if changing the defaults for
    all objects
    name      - name of the method (probably character string for function)
    function  - function to use
    communicator - communicator scope of this change (MPI\_COMM\_SELF
    and MPI\_COMM\_WORLD are popular choices)
    version   - version of the interface (see below)

    This call is collective over the communicator.  The intent of the 
    communicator option is to allow consistency checking for changes
    that must be consistent across processes (e.g., changing the
    collective algorithms).

    The version number is used to ensure that the version of the
    interface matches the one used in the library

A related call is

\begin{verbatim}
    MPIX_SetAllMethods( component, object, struct-of-functions*,
    communicator, version )
\end{verbatim}

This is similar to SetMethod, except all functions are set from those
defined in the struct-of-functions.  The version is very important
here as it ensures that the struct is known.

One additional call is

\begin{verbatim}
    MPIX_DLLLoadMethods( component, object, dll-name, communicator,
                         version )
\end{verbatim}

This loads the methods from a dynamic loaded shared object.  It is
collective over the communicator

Implementation note: the ``component'' in these calls can be the name
of a routine that 

\section{Collectives}
\section{Topology}
\section{Logging}
    --with-logging=/pathname
looks for \texttt{setup\_logging} script.  In addition, the
\texttt{Makefile} will be invoked.

    clean, dist-clean, maintainer-clean, library target?

    Must provide an mpilogging.h file (contents defined as...)

    Example (give example directory on web)

    Builtin versions...

\section{PM and PMI}

mpiexec 
    -pmiargs host port executable
special args used to allow singleton init.

\section{Name Server}

    --with-nameserver=...

\section{mpid}
\section{ch3-channel}

          Components still evolving (a placeholder for things that we want to do)

\chapter{Working with MPICH itself}
          <<for people that edit MPICH rather than use the component interface>>
          Coding standards
          Keeping configure clean
          Keeping MPICH modular
          

\section{Parameters within the MPICH code}

   The purpose of this part is to enable the configure, compile, and
   run-time control of parameters in the collective implementation.

   Goals for the parameter handling routines:



\begin{enumerate}
\item All values (at least in comm\_world) must be the same (the
      collective routines expect the same parameter values)


      For communicators in different comm worlds (e.g., created by
      spawn or connect/accept), it may be necessary to perform a
      separate step when the communicator is created to negotiate
      parameter choices.  This may require a communicator-creation
      hook.

      The most likely implementation of this step is to check that the
      different comm\_worlds have compatible (typically identical)
      values for the parameters.  There may be a
      \texttt{MPIR\_Parm\_check\_consistent}(new intercomm) call for this.

\item The parameter routine should provide the following choices, in
      order of decreasing priority:
     
\begin{enumerate}
\item Explicit control from within the routine (i.e., by a routine
      call)
\item Command line parameter on mpiexec
\item Environment variable on the process with rank 0 in MPI\_COMM\_WORLD.
\item Value is user-specified parameter file (e.g., ~/.mpichrc)
\item Value in site-specified parameter file (e.g.,
      /usr/local/mpich/.mpichrc)
\item Compile-time value set at configure time (e.g.,
      --enable-collective-config=file) 
\item Compile-time value set within the collective code (the
      default values)
\end{enumerate}
      The choice of parameter value should also have some configure-time
      control to allow a trade-off between generality and absolute
      best performance (particular for startup, if we want to allow
      configurations files to be read).

      Note that even process managers that can deliver the same
      environment variable to all processes often allow the user to
      change that behavior (e.g., with a command-line option to
      prevent environment variables from being sent to other
      processes, so we cannot assume that the values of the
      environment variables are the same on all processes without some
      additional information (e.g., the process manager could tell us
      that the environment variables are the same on all processes).

\item Parameter documentation should (at least in part) be provided
      where the parameters are used.

\item Overhead of using parameter routines should be low; particularly 
      after the first use (if that matters; that is, the first time
      may involve an initialization phase)

\item Parameter routines should be a component, cleanly initialized by
      MPI\_Init/MPI\_Init\_thread and shut down by MPI\_Finalize.

\item Parameter types that must be handled include integer (e.g.,
      message size, group size).  Others should be allowed for (such
      as arrays of integers, characters) as needed.  We should avoid
      floating-point values because of possible problems in using them
      consistently, particularly in a heterogeneous environment.

      Note that this does not say anything about how the values may be
      represented in a parameter database, as that is an
      implementation issue.  However, it does say that the value must
      be delivered to the using routine in the form that it needs.  We
      do not want to call strtol() everytime a routine needs an
      integer (atoi should not be used because it has no way to
      indicate that the value contains non-digit characters).

\item The design should ensure that changes need be made in only one
      place or alternately that inconsistencies (e.g., between an
      initialization and a use) are detectable before runtime (before
      runtime so that the problem can be detected and fixed without
      depending on running a particular test case).

\end{enumerate}
Proposed Design that meets these goals

\begin{enumerate}
\item => Some setup at MPI\_Init time to ensure all processes have
consistent values.  Also requires hook for communicators that span
multiple MPI\_COMM\_WORLDs.
\item => registration of names so that the data can be acquired scalably
(e.g., read from file, environment variable on process with rank zero)
\item => a description string either in the call or in the structured
   comment at the point of use.
\item also => an initialization step 
\item => using the finalize callback.  To make them a component for the
initialization step, there needs to be some mechanism to load modules
\item ?
\item => Either a single point of use (which introduces efficiency
problems) or some sort of source-code preprocessing.  
\end{enumerate}

Based on these, I propose the following:

Initialization module:\\
This routine is called from MPI\_Init.  It details all parameters used
by the collective routines, the associated environment variables, and
the description strings.  It might look something like:

\begin{verbatim}
int MPIR_Parm_register_collective( ) {
    rc = MPIR_Parm_register_int( "MPICH:SCATTER_THRESHOLD", 
			         &MPIR_Scatter_threshold, 
                                 2048,
    "Maximum size of messages sent using \
    doubling in scatter algorithm with a default of %d" );
    ...

    return MPI_SUCCESS;
}
\end{verbatim}

The purpose of this routine is to communicate, at runtime, the names
of the parameters that some part of the system may need.  The
parameters to this routine are:
\begin{description}
\item[\texttt{MPICH:SCATTER\_THRESHOLD}]The name of the parameter.  The
    environment variable that might correspond to this is
    \texttt{MPICH\_SCATTER\_THRESHOLD}

\item[\texttt{\&MPIR\_Scatter\_threashold}]The address of an int (since this is
  \texttt{MPIR\_Parm\_register\_int}) that will contain the value.
  This integer can be accessed directly by the code; see the example of the use
    below.

\item[2048]The default value if no value is provided

\item["Maximum ... \%d..."]A documentation string including a format
    specifier for the default value.  This will be used to create
    help information on the parameter directly from the source code.
\end{description}
Possible additional features would be a valid range (e.g., 0 to 64K)
or a routine to test for valid input.

Once all of these routines are called (e.g., for any set of modules
that choose to use this method to acquire parameters), the routine

\begin{verbatim}
int MPIR_Parm_init( )
\end{verbatim}

is called from within MPI\_Init/MPI\_Init\_thread.  This is collective
over COMM\_WORLD and performs the 
necessary file reads, environment variable reads and broadcasts the
results to all processes.  It handles all registered names, from all
modules.

In the code, all parameters are accessed through a macro that allows
the values to be either compile-time or run-time constants

\begin{verbatim}
MPIR_PARM_GET_INT( "MPICH:SCATTER_THRESHOLD", MPIR_Scatter_threshold,2048 )
\end{verbatim}

This expands into either:
\begin{verbatim}
    2048 (compile-time only)
\end{verbatim}
or
\begin{verbatim}
    MPIR_Scatter_threshold (run-time)
\end{verbatim}
or even (runtime, lazy evaluation):
\begin{verbatim}
    (MPIR_Scatter_threshhold_not_set ? \
        (MPIR_Scatter_threshhold_not_set=1,\
        MPIR_Scatter_threshold=MPIR_ScatterMPIRParmGetInt(\
          "MPICH:SCATTER_THRESHOLD", 2048 )) : MPIR_Scatter_threshhold )
\end{verbatim}

The name (the first argument) is provided to allow for consistency
checking against the registered names.   And in practice, the "2048"
would itself be a macro, set in an include file in the coll directory
(not in src/include/mpiimpl.h).  

The parm init routine would use the name in the register routine (the
first argument) in the following way:

\begin{verbatim}
The : in the name separates the prefix from the rest of the name.
The name after the : may be used in an init file
The name with the : replaced by an \_ may be used as an environment
variable or in the init file.
\end{verbatim}

A more sophisticated approach would use code similar to (and shared
with) the error message extraction to synthesize the
\texttt{MPIR\_Parm\_register\_collective()} routine.  This would be requried to
address goal 7 above.  An alternative would be to simply detect an
inconsistency by having a program read the source code and check that
the registration and use calls match.  

The script
\texttt{maint/extractstrings}, combined with code similar to that for 
error messages in \texttt{maint/extracterrmsgs}, can be used to
automate the collection of registration information from the source code.
% ---
% Old text from December 2004
%% 1) Control of parameterization 
%%     The purpose of this part is to enable the configure, compile, and 
%%     run-time control of parameters in the collective implementation. 

%%     Goals for the parameter handling routines: 

%%     1) All values (at least in comm_world) must be the same (the 
%%        collective routines expect the same parameter values) 

%%        a) For communicators in different comm worlds (e.g., created by 
%%        spawn or connect/accept), it may be necessary to perform a 
%%        separate step when the communicator is created to negotiate 
%%        parameter choices.  This may require a communicator-creation 
%%        hook. 

%%     2) The parameter routine should provide the following choices, in 
%%        order of decreasing priority: 

%%        a) Explicit control from within the routine (i.e., by a routine 
%%        call) 
%%        b) Environment variable on rank-0 communicator 
%%        c) Value is user-specified parameter file (e.g., ~/.mpichrc) 
%%        d) Value in site-specified parameter file (e.g., 
%%        /usr/local/mpich/.mpichrc) 
%%        e) Compile-time value set at configure time (e.g., 
%%        --enable-collective-config=file) 
%%        f) Compile-time value set within the collective code (the 
%%        default values) 

%%        The choice of parameter value should also have some configure-time 
%%        control to allow a trade-off between generality and absolute 
%%        best performance (particular for startup, if we want to allow 
%%        configurations files to be read). 

%%     3) Parameter documentation should (at least in part) be provided 
%%        where the parameters are used. 

%%     4) Overhead of using parameter routines should be low; particularly 
%%        after the first use (if that matters; that is, the first time 
%%        may involve an initialization phase) 

%%     5) Parameter routines should be a component, cleanly initialized by 
%%        MPI_Init/MPI_Init_thread and shut down by MPI_Finalize. 

%%     6) Parameter types that must be handled include integer (e.g., 
%%        message size, group size).  Others should be allowed for (such 
%%        as arrays of integers, characters).  We should avoid 
%%        floating-point values because of possible problems in using them 
%%        consistently, particularly in a heterogeneous environment. 

%% Proposed Design that meets these goals 

%% 1) => Some setup at MPI_Init time to ensure all processes have 
%% consistent values.  Also requires hook for communicators that span 
%% multiple MPI_COMM_WORLDs. 
%% 2) => registration of names so that the data can be acquired scalably 
%% (e.g., read from file, environment variable on process with rank zero) 
%% 3) => a description string either in the call or in the structured 
%%     comment at the point of use. 
%% 4) also => an initialization step 
%% 5) => using the finalize callback.  To make them a component for the 
%% initialization step, there needs to be some mechanism to load modules 

%% Based on these, I propose the following: 

%% Initialization module: 
%% This routine is called from MPI_Init.  It details all parameters used 
%% by the collective routines, the associated environment variables, and 
%% the description strings.  It might look something like: 

%% int MPIR_Parm_register_collective( ) { 
%%      rc = MPIR_Parm_register_int( "MPICH:SCATTER_THRESHOLD", 
%%                                   &MPIR_Scatter_threshold, 
%%                                   2048, 
%%      "Maximum size of messages sent using \ 
%%      doubling in scatter algorithm with a default of %d" ); 
%%      ... 

%%      return MPI_SUCCESS; 
%% } 

%% The purpose of this routine is to communicate, at runtime, the names 
%% of the parameters that some part of the system may need. 

%% Once all of these routines are called (e.g., for any set of modules 
%% that choose to use this method to acquire parameters), the routine 

%% int MPIR_Parm_init( ) 

%% is called from within MPI_Init/MPI_Init_thread.  This is collective 
%% over COMM_WORLD and performs the 
%% necessary file reads, environment variable reads and broadcasts the 
%% results to all processes.  It handles all registered names, from all 
%% modules. 

%% In the code, all parameters are accessed through a macro that allows 
%% the values to be either compile-time or run-time constants 

%% MPIR_PARM_GET_INT( "MPICH:SCATTER_THRESHOLD", MPIR_Scatter_threshold, 2048 ) 

%% This expands into either: 

%% 2048 (compile-time only) 

%% MPIR_Scatter_threshold (run-time) 

%% or even (runtime, lazy evaluation): 

%% (MPIR_Scatter_threshhold_not_set ? \ 
%%      (MPIR_Scatter_threshhold_not_set=0,MPIR_Scatter_threshold=MPIR_ScatterMPIRParmGetInt("MPICH:SCATTER_THRESHOLD", 
%% 2048 )) : MPIR_Scatter_threshhold ) 

%% The name (the first argument) is provided to allow for consistency 
%% checking against the registered names.   And in practice, the "2048" 
%% would itself be a macro, set in an include file in the coll directory 
%% (not in src/include/mpiimpl.h). 

%% The parm init routine would use the name in the register routine (the 
%% first argument) in the following way: 

%% The : in the name separates the prefix from the rest of the name. 
%% The name after the : may be used in an init file 
%% The name with the : replaced by an _ may be used as an environment 
%% variable or in the init file. 

%% A more sophisticated approach would use code similar to (and shared 
%% with) the error message extraction to synthesize the 
%% MPIR_Parm_register_collective() routine. 

\end{document}