Blob Blame History Raw
MPI
- spawn/attach
- communicators
- pt2pt requests
- collective
- RMA
- error (handling, FT, reporting)

MPID
- messages
  - tradition MPI message
  - one-sided operations
  - control?

- streams
- process management (via BNR)

Communication Methods
- TCP
- VIA
- Shared Memory
- Loopback
- IMPI

===============================================================================

MPI layer

--------------------

Operations

- point-to-point
  - requests
  - datatypes
  - communicators
  - status
  - errors

- collective
  - datatypes
  - communicators
  - errors

- process management
  - communicators
  - info
  - errors

- RMA
  - datatypes
  - windows
  - communicators
  - groups
  - epoch
  - errors

- I/O
  - depends on
    - files
    - datatypes
    - requests
    - info
    - status
    - errors
  * implemented via ROMIO
    - dependent only on MPI functions
    - future enhancements may use low-level interfaces

- topology
  - communicators
  - errors
  * can be implemented entirely at the MPI layer

- generalized requests
  - errors
  * can be implemented entirely at the MPI layer

--------------------

Structures

- requests
- datatypes
- communicators
  - groups
- groups
  * are groups modified or augmented by low layers?
- windows
- files
  * defined and implemented via ROMIO
    - future enhancements may require access lower layers
- status
- errors
- attributes
  * can be defined and operations implemented entirely at the MPI layer
  - communicator
  - datatypes
  - windows

- info
  * can be defined and operations implemented entirely at the MPI layer

===============================================================================

MPID

--------------------

Operations

- point-to-point

- collective operations

- process management

- RMA

- generalized requests


Structures


--------------------

Concepts

- MPI buffer movement (moving buffers defined by an address, count and
  datatype)

- internal buffer management

- Connection management

  - virtual connection structures

  - low-level connnection management (sockets, etc.) should be handled
    entirely by the device and probably driven by a state machine


===============================================================================

Multi-method design

--------------------

Device-level objects

- group

  - data structures

  - methods

    - set_connection(group, rank, vc_ptr) - associate pointer to virtual
      connection structure with a (group,rank)
  
    - get_connection(group, rank) - returns pointer to virtual connection
      structure associated with (group,rank)

- communicators

  - data structures

    - group

  - methods

    - set_connection(dcomm, rank, vc_ptr) - associate pointer to virtual
      connection structure with a (dcomm,rank)
  
    - get_connection(dcomm, rank) - returns pointer to virtual connection
      structure associated with (dcomm,rank)

- virtual connections

  - alloc() - returns a pointer to a virtual connection structure

  - add_ref(vc) - increments the reference count (atomically)

  - release() - decrements the reference count; if the reference count reaches
    zero, the structure is freed

  NOTE: It may be useful to be able to locate a virtual connection based on a
  process group ID and rank, in part so we can detect when multiple virtual
  connections might be formed between a pair of processes.

  

can connect/accept be called multiple times between a set of processes?

--------------------

Method-level functions

Method descriptors are strings that are used to describe the capabilities of
the methods.  These descriptors can then be used to determine if two processes
miight be capable of "talking" using the method in question.  We say "might be
capable" because in the case of a method like VIA, it may be impossible provide
enough information in the descriptor to determine in two processes can talk.
It may be necessary to simply attempt to form the connection.  This implies
that binding to a particular protocol may need to be deferred until we are
ready to form a real connection.  However, some methods, such as shared memory,
can provide enough information and thus can be bound immediately.

- query/get_descriptor()

- match_descriptors()

===============================================================================


MPI_Init()

- create basic datatypes

  MPIR_Datatype_init()
  {
      foreach dt (all basic datatypes)
      {
          MPID_Datatype_init(dt, ...)
      }
  }


- initialize device

  - BNR initialization
  
    BNR_Init()
    BNR_Get_group(&my_bnr_group)
    BNR_Get_size(my_bnr_group, &size)
    BNR_Get_rank(my_bnr_group, &rank)
    BNR_Get_parent(&parent_bnr_group)
    BNR_Merge(my_bnr_group, parent_bnr_group, &inter_bnr_group)
  
  - loop through methods

    - initialize method
  
    - query for descriptor of method's capabilities
  
    Q: what about dymanicly loaded methods?  do they have to be initialized now
    or can they be added later?
  
  - publish capabilities of all known methods

  - initialize AQ, buffer management, etc.

- establish MPI_COMM_WORLD

  - create MPI_GROUP_WORLD (internal) from BNR my_group, etc.

    stores BNR info in group structure
    allocates virtual connection structures
    initializes virtual connections to stubs

  - create MPI_COMM_WORLD from MPI_GROUP_WORLD

- establish inter-communicator with parent (if parent exists)
  
  - create inter_group from inter_bnr_group
  
  - create inter-communicator from inter_group
  
- create inter_group from inter_bnr_gorup

- create inter-communicator from inter_group


MPI_Spawn()
{
    BNR_Open_group(my_bnr_group, &new_bnr_group)
    BNR_Spawn(remote_bnr_group, N, ..., func)
    BNR_Close(remote_bnr_group)
    BNR_Merge(my_bnr_group, remote_bnr_group, &inter_bnr_group);
}




need a BNR_Group_ID which is globally unique in order to implement MPI_Connect/Attach


------------------------------------------------------------------------

Structures that cross layers

- many of the information structures that are passed through the layers contain
  data sections from multiple layers

- ne option is to include device (and method) include files in the MPICH layer
  include file.  Rob and Brian feel this would be bad.

- David suggests that the structure definitions be supplied by the device
  header files and that method specific information be included in those
  definition using unions.  Rob and Brian feel this is ugly (from a software
  engineering standpoint).

- Rob and Brian suggest having each layer define their own portion of the
  structure.  The definitions of the higher layers are known to the lower
  layers, but not vice versa.  To increase cache locality and reduce memory
  allocation, the device (and methods) report the amount of space they need in
  these structures so that the highest layer can allocate sufficient space.
  pointer arithmatic, etc.

Virtual connection

- used by MM implementation to allow late binding to a method

  this implies that the VC contains a pointer to either the function pointer
  table for the method to which it is bound or the function pointer to table to
  a set of functions that perform the binding to such a method

- one-to-one correspondence between a virtual connection and a real connection

- Contains
  - state of binding
  - method specific information
  - function pointer table?

----------

Communicators
- Contains
  - communication group
  - local group (inter-communicator only)
  - send and receive context IDs - same for intra-communicator
  - attributes
  - reference count
  - error handlers
  - device specific information (needed for MPICH-G2)
  
----------

Groups
- Contains
  - virtual connection table
  - my rank
  - reference count
  - device specific information (???)

----------

Requests
- probably allocated and partially initialized above ADI
- initialization complete by device/method

- Request contains
  - Immutable after initialization
    - type of request
      - persistent request flag
      - send, bsend, rsend, ssend, recv, generalized
    - buffer
    - count
    - datatype
    - rank (src or dest depending on type of request)
    - tag
    - comm
----------

Connection resolution

- needs to talk to with BNR

----------

Communication agent