* Definitions
- MPI buffer - count, datatype, memory pointer
* Communication subsystem capabilities (requirements)
- general MPI messaging (MPI_*send() and MPI_*recv())
- sending messages
- if send-side MPI buffer is sufficiently contiguous, send data
directly from MPI buffer
- if RMA capabilities exist and MPI receive buffer is
sufficiently contiguous, write message data directly into MPI
receive buffer
- receiving messages
- match incoming messages with posted receives
- handle posting and matching of wildcard (source) receives
- special handling for already posted receives
- if MPI buffer is sufficiently contiguous, receive directly
into MPI buffer
- if user buffer is non-contiguous, unpack data as portions of
message data are received
- persistent MPI messaging (MPI_*_init())
- for some network interfaces, we should be able to perform
one-time initialization to eliminate unnecessary data copies
(manipulating the MPI buffer directly)
- collective operations
- send portions of a MPI buffer
- receive portions of a MPI buffer
- forward portions of an incoming message
Use pipelining instead of store and forward to increase network
utilization (and thus performance).
Potentially multicast the same portion to multiple remote
processes. Nick's prototype shows this is a big win for TCP and
vMPI. I suspect it is a big win in general.
- share buffers between methods to avoid copying data during forward
operations
- perform MPI computations (as defined by MPI_Reduce()) while
receiving/unpacking data
Computations may need to be performed at intermediate processes
(processes not receiving any of the results) which implies that
computations may need to be performed without the presence of a
user provided MPI buffer (or datatype).
- handle multiple simultaneous collective operations on the same
communicator (multi-threaded MPI)
We should be able to use the tag field and a rolling counter to
separate messages from independent collective operations. This
would allow us to use the same matching mechanisms that we use
for general MPI messaging.
- remote memory operations
- aggregate operations (from the same exposure epoch?) into a
single communication
- perform MPI computations on remote memory
- match communicated operations with exposure epochs (either
explicit or implied)
Is context sufficient for this? Do we need a tag to separate
independent access/exposure epochs?
- unreliable communication and QoS
Theoretically, a MPI communicator could be tagged to allow
unrealiable delivery, QoS, etc. We haven't thought much what
impact this has on our design, but we probably don't want to
prevent these capabilites.
* Communication subsystem components
- virtual connnections
allows late binding to a communication device (method)
provides a function tables for all connection/communication
related interfaces
- progress engine
- matching incoming messages to posted requests
- message level flow control
- shared network buffer management
- network communication
- network flow control
* Message level flow control
For simple messaging operations, message envelope meta-data must be
sent to the remote process immediately. Failure to do so may cause
the remote process to block indefinitely awaiting a particular
message. However, the method also needs to balance messaging
performance (sending the entire message immediately) with the memory
used by the remote process to buffer messages not already posted by
the remote process.
Messages are typically converted (by the method?) to one of three
types to obtain this balance: short, eager, and rendezvous.
Conversion to a particular message type may depend on the memory
availability of the remote process.
NOTE: Some communication interfaces such as vendor MPI will do this
automatically, which means we shouldn't force message level flow
control upon the method.
* Method
- definition of a method
A method presents an interface which allows upper layers to convey
actions it wishes to the method perform in the context of a
virtual connection. These actions consist of sending and
receiving messages, performing remote memory operations, and
providing data and buffers to other methods.
- flow control at the message level
- flow control at the network buffer (packet) level
Some methods need to worry about network buffer availability at
remote process.
- reliability
Under a default environemnt, MPI messages are inherently reliable
which means that some methods may need concern themselves with
acknowledgments and retransmission if the underlying network does
not guarantee reliability.
- matching incoming messages to requests