pt2pt requirement
- need to specify blocking vs. non-blocking for most routines
------------------------------------------------------------------------
MPI_Send_init(buf, count, datatype, dest, tag, comm, request, error)
MPI_Bsend_init(buf, count, datatype, dest, tag, comm, request, error)
MPI_Rsend_init(buf, count, datatype, dest, tag, comm, request, error)
MPI_Ssend_init(buf, count, datatype, dest, tag, comm, request, error)
MPI_Recv_init(buf, count , datatype, src, tag, com, request, error)
{
request_p = MPIR_Request_alloc();
/* Fill in request structure based on parameters and type of operation */
request_p->buf = buf;
request_p->count = count;
request_p->datatype = datatype;
request_p->rank = dest/src;
request_p->tag = tag;
request_p->comm = comm;
request_p->type = persistent | <type>;
*request = MPIR_Request_handle(request_p);
}
MPI_Start(request, error)
{
switch(request->type)
{
send:
MPID_Isend(buf, count, datatype, dest, tag, comm, request_p,
error);
bsend:
MPID_Ibsend(...)
rsend:
MPID_Irsend(...)
ssend:
MPID_Issend(...)
recv:
MPID_Irecv(...)
}
}
- persistent requests require copying parameters into the request structure.
should we always fill in a request and simply pass the request as the only
parameter? this would eliminate optimizations on machines where large
numbers of parameters can be passed in registers, but the intel boxes will
just end up pushing the parameters on the stack anyway...
- there is an optimization here that allows registered memory to be maintained
as registered in the persistent case. to do this we will need to let the
method know that we do/do not want the memory unregistered.
- need to store request type in request structure so that MPI_Start() can do
the right thing (tm).
- we chose not to convert handles to structure pointers since the handles may
cointain quick access to common information avoiding pointer dereferences.
in some cases, an associated structure may not even exist.
the implication here is that many of the non-persistent MPI_Xsend routines
will do little work outside of calling an MPID function. Perhaps we should
not have separate MPI functions in those cases but rather map the MPI
functions direct to the MPID functions (through the use of macros or weak
symbols).
------------------------------------------------------------------------
MPI_Send(buf, count, datatype, dest, tag, comm, error)
MPI_Bsend(buf, count, datatype, dest, tag, comm, error)
MPI_Rsend(buf, count, datatype, dest, tag, comm, error)
MPI_Ssend(buf, count, datatype, dest, tag, comm, error)
{
/* Map (comm,rank) handle to a virtual connection */
MPID_Comm_get_connection(comm, rank, &vc);
/* If virtual connection is not bound to a real connection, then perform
connection resolution. */
/* (atomically) If no other requests are queued on this connection, the send
as much data as possible. If the entire message could not be sent
"immediately" then queue the request for later processing. (We need a
progress engine to ensure that later happens. */
/* Build up a segement unless the datatype is "trivial" */
/* Wait until entire message is sent */
}
- heterogeneity should be handled by the method. this allows methods which do
require conversions, such as shared memory, to be fully optimized.
- who should setup the segment and convert the buffer (buf, count, datatype) to
one or more blocks of bytes? should that be a layer above the method or
should it be the method itself?
a method may or may not need to use segments depending on its capabilities.
there should only be one implementation of the segment API which will be
called by all of the method implementations.
- we noticed that the segment initialization code take a (comm,rank) pair which
will have to be dereferenced to a virtual connection in order to determine if
data conversion is required. since we have already done the dereference, it
would be ideal if the segment took a ADI3 implementation (MPID) specific
connection type instead of a (comm,rank). Making this parameter type
implementation specific implies that the segment interface is never called
from the MPI layer or that the ADI3 interface provided a means of converting
a (comm, rank) to a connection type.
- David suggested that we might be able to use the xfer interface for
point-to-point messaging as well as for collective operations.
What should the xfer interface look like?
- David provided a write-up of the existing interface
- We questioned whether or not multiple receive blocks could be used to
receive a message sent from a single send block. We decided that blocks
define envelopes which match, where a single block defines an envelope (and
payload) per destination and/or source. So, a message sent to a particular
destination (from a single send block) must be received by a single receive
block. In other words, the message cannot be broken across receive blocks.
- there is an asymmetry in the existing interface which allows multiple
destinations but prevents multiple sources. the result of this is that
scattering operations can be naturally described, but aggregation
operations cannot. we believe that there are important cases where
aggregation would benefit collective operations.
- to address this we believe that we should extend the interface to
implement a many-to-one, in addition to the existing one-to-many
interface. we hope we don't need the many-to-many...
- perhaps we should call these scatter_init and gather_init (etc)?
- Nick proposed that the interface be split up such that sends requests were
separate from receive requests. This implies that there would be a
xfer_send_init() and xfer_recv_init(). We later threw this out, as it
didn't make a whole lot of sense with forwards existing in the recv case.
- Brian wondered about aggregating sends into a single receive and whether
that could be used to reduce the overhead of message headers when
forwarding. We think that this can be done below the xfer interface when
converting into a dataflow-like structure (?)
- We think it may be necessary to describe dependencies, such as progress,
completion and buffer. These dependencies as frighteningly close to
dataflow...
- basically we see the xfer init...start calls as being converted into a set of
comm. agent requests and a dependency graph. we see the dependencies as
being possibly stored in a tabular format, so that ranges of the incoming
stream can have different dependencies on them -- specifically this allows
for progress dependencies on a range basis, which we see as a requirement.
completion dependencies (of which there may be > 1) would be listed at the
end of this table
the table describes what depends on THIS request, rather than the other way
around. this is tailored to a notification system rather than some sort of
search-for-ready approach (which would be a disaster).
- for dependencies BETWEEN blocks, we propose waiting on the first block to
complete before starting the next block. you can still create blocks ahead
of time if desired. otherwise blocks may be processed in parallel
- blocks follow the same envelope matching rules as posted mpi send/recvs
(commit time order). this is the only "dependency" between blocks
reminder: envelope = (context (communicator), source_rank, tag)
QUESTION: what exactly are the semantics of a block? Sends to the same
destination are definitely ordered. Sends to different desinations could
proceed in parallel. Should they?
example:
init
rf(5)
rf(4)
r
start
a transfer block defines 0 or 1 envelope/payloads for sources and 0 to N envelope/payloads for destinations, one per destination.
- The communication agent will need to process these requests and data
dependencies. We see the agent having queues of requests similar in nature
to the run queue within an operating system. (We aren't really sure what
this means yet...) Queues might consist of the active queue, the wait queue,
and the still-to-be-matched queue.
- the "try to send right away" code will look to see if there is anything in
the active queue for the vc, and if not just put it in run queue and call
the make progress function (whatever that is...)
- adaptive polling done at the agent level, perhaps with method supplied
min/max/increments. comm. agent must track outstanding requests (as
described above) in order to know WHAT to poll. we must also take into
account that there might be incoming active message or error conditions, so
we should poll all methods (and all vcs) periodically.
- We believe that a MPIR_Request might simply contain enough information for
signalling that one or more CARs have completed. This implies that a
MPIR_Request might consist of a integer counter of outstanding CARs. When
the counter reached zero, the request is complete. David suggests making
CARs and MPIR_Requests reside in the same physical structure so that in the
MPI_Send/Recv() case, two logical allocations (one for MPIR_Request and CAR)
are combined into one.
- operations within a block are prioritized by the order in which they are
added to the block. operations may proceed in parallel so long as higher
priority operations are not slowed down by lesser priority operations. a
valid implementation is to serialize the operations thus guaranteeing that
the current operation has all available resources at its desposal.
MPI_Isend(buf, count, datatype, dest, tag, comm, request, error)
MPI_Ibsend(buf, count, datatype, dest, tag, comm, request, error)
MPI_Irsend(buf, count, datatype, dest, tag, comm, request, error)
MPI_Issend(buf, count, datatype, dest, tag, comm, request, error)
{
request_p = MPIR_Request_alloc();
MPID_IXsend(buf, count, datatype, dest, tag, comm, request_p, error);
*request = MPIR_Request_handle(request_p);
}
MPI_Recv()
MPI_Irecv()
- need to cover wild card receive!
MPI_Sendrecv()
{
/* KISS */
MPI_Isend()
MPI_Irecv()
MPI_Waitall()
}
MPID_Send(buf, count, datatype, dest, tag, comm, group, error)
MPID_Isend(buf, count, datatype, dest, tag, comm, request, error)
MPID_Bsend(buf, count, datatype, dest, tag, comm, error)
MPID_Ibsend(buf, count, datatype, dest, tag, comm, request, error)
MPID_Rsend(buf, count, datatype, dest, tag, comm, error)
MPID_Irsend(buf, count, datatype, dest, tag, comm, request, error)
MPID_Ssend(buf, count, datatype, dest, tag, comm, error)
MPID_Issend(buf, count, datatype, dest, tag, comm, request, error)
-----
Items which make life more difficult:
-