pt2pt requirement

- need to specify blocking vs. non-blocking for most routines


------------------------------------------------------------------------

MPI_Send_init(buf, count, datatype, dest, tag, comm, request, error)
MPI_Bsend_init(buf, count, datatype, dest, tag, comm, request, error)
MPI_Rsend_init(buf, count, datatype, dest, tag, comm, request, error)
MPI_Ssend_init(buf, count, datatype, dest, tag, comm, request, error)
MPI_Recv_init(buf, count , datatype, src, tag, com, request, error)
{
    request_p = MPIR_Request_alloc();

    /* Fill in request structure based on parameters and type of operation */
    request_p->buf = buf;
    request_p->count = count;
    request_p->datatype = datatype;
    request_p->rank = dest/src;
    request_p->tag = tag;
    request_p->comm = comm;
    request_p->type = persistent | <type>;

    *request = MPIR_Request_handle(request_p);
}


MPI_Start(request, error)
{
    switch(request->type)
    {
        send:
	    MPID_Isend(buf, count, datatype, dest, tag, comm, request_p,
                       error);
        bsend:
            MPID_Ibsend(...)
        rsend:
            MPID_Irsend(...)
        ssend:
            MPID_Issend(...)
        recv:
            MPID_Irecv(...)
    }
}

- persistent requests require copying parameters into the request structure.
  should we always fill in a request and simply pass the request as the only
  parameter?  this would eliminate optimizations on machines where large
  numbers of parameters can be passed in registers, but the intel boxes will
  just end up pushing the parameters on the stack anyway...

- there is an optimization here that allows registered memory to be maintained
  as registered in the persistent case.  to do this we will need to let the
  method know that we do/do not want the memory unregistered.

- need to store request type in request structure so that MPI_Start() can do
  the right thing (tm).

- we chose not to convert handles to structure pointers since the handles may
  cointain quick access to common information avoiding pointer dereferences.
  in some cases, an associated structure may not even exist.

  the implication here is that many of the non-persistent MPI_Xsend routines
  will do little work outside of calling an MPID function.  Perhaps we should
  not have separate MPI functions in those cases but rather map the MPI
  functions direct to the MPID functions (through the use of macros or weak
  symbols).

------------------------------------------------------------------------


MPI_Send(buf, count, datatype, dest, tag, comm, error)
MPI_Bsend(buf, count, datatype, dest, tag, comm, error)
MPI_Rsend(buf, count, datatype, dest, tag, comm, error)
MPI_Ssend(buf, count, datatype, dest, tag, comm, error)
{
  /* Map (comm,rank) handle to a virtual connection */
  MPID_Comm_get_connection(comm, rank, &vc);

  /* If virtual connection is not bound to a real connection, then perform
     connection resolution. */

  /* (atomically) If no other requests are queued on this connection, the send
     as much data as possible.  If the entire message could not be sent
     "immediately" then queue the request for later processing. (We need a
     progress engine to ensure that later happens.  */
  /* Build up a segement unless the datatype is "trivial" */


  /* Wait until entire message is sent */
}

- heterogeneity should be handled by the method.  this allows methods which do
  require conversions, such as shared memory, to be fully optimized.

- who should setup the segment and convert the buffer (buf, count, datatype) to
  one or more blocks of bytes?  should that be a layer above the method or
  should it be the method itself?

  a method may or may not need to use segments depending on its capabilities.

  there should only be one implementation of the segment API which will be
  called by all of the method implementations.

- we noticed that the segment initialization code take a (comm,rank) pair which
  will have to be dereferenced to a virtual connection in order to determine if
  data conversion is required.  since we have already done the dereference, it
  would be ideal if the segment took a ADI3 implementation (MPID) specific
  connection type instead of a (comm,rank).  Making this parameter type
  implementation specific implies that the segment interface is never called
  from the MPI layer or that the ADI3 interface provided a means of converting
  a (comm, rank) to a connection type.

- David suggested that we might be able to use the xfer interface for
  point-to-point messaging as well as for collective operations.

  What should the xfer interface look like?

  - David provided a write-up of the existing interface

  - We questioned whether or not multiple receive blocks could be used to
    receive a message sent from a single send block.  We decided that blocks
    define envelopes which match, where a single block defines an envelope (and
    payload) per destination and/or source.  So, a message sent to a particular
    destination (from a single send block) must be received by a single receive
    block.  In other words, the message cannot be broken across receive blocks.
 
    - there is an asymmetry in the existing interface which allows multiple
      destinations but prevents multiple sources.  the result of this is that
      scattering operations can be naturally described, but aggregation
      operations cannot.  we believe that there are important cases where
      aggregation would benefit collective operations.

    - to address this we believe that we should extend the interface to
      implement a many-to-one, in addition to the existing one-to-many
      interface.  we hope we don't need the many-to-many...

    - perhaps we should call these scatter_init and gather_init (etc)?

  - Nick proposed that the interface be split up such that sends requests were
    separate from receive requests.  This implies that there would be a
    xfer_send_init() and xfer_recv_init().  We later threw this out, as it
    didn't make a whole lot of sense with forwards existing in the recv case.

  - Brian wondered about aggregating sends into a single receive and whether
    that could be used to reduce the overhead of message headers when
    forwarding.  We think that this can be done below the xfer interface when
    converting into a dataflow-like structure (?)

- We think it may be necessary to describe dependencies, such as progress,
  completion and buffer.  These dependencies as frighteningly close to
  dataflow...

- basically we see the xfer init...start calls as being converted into a set of
  comm. agent requests and a dependency graph.  we see the dependencies as
  being possibly stored in a tabular format, so that ranges of the incoming
  stream can have different dependencies on them -- specifically this allows
  for progress dependencies on a range basis, which we see as a requirement.
  completion dependencies (of which there may be > 1) would be listed at the
  end of this table

  the table describes what depends on THIS request, rather than the other way
  around.  this is tailored to a notification system rather than some sort of
  search-for-ready approach (which would be a disaster).

- for dependencies BETWEEN blocks, we propose waiting on the first block to
  complete before starting the next block.  you can still create blocks ahead
  of time if desired.  otherwise blocks may be processed in parallel

- blocks follow the same envelope matching rules as posted mpi send/recvs
  (commit time order).  this is the only "dependency" between blocks

reminder: envelope = (context (communicator), source_rank, tag)

QUESTION: what exactly are the semantics of a block?  Sends to the same
destination are definitely ordered.  Sends to different desinations could
proceed in parallel.  Should they?

example:
  init
  rf(5)
  rf(4)
  r
  start

a transfer block defines 0 or 1 envelope/payloads for sources and 0 to N envelope/payloads for destinations, one per destination.


- The communication agent will need to process these requests and data
  dependencies.  We see the agent having queues of requests similar in nature
  to the run queue within an operating system.  (We aren't really sure what
  this means yet...)  Queues might consist of the active queue, the wait queue,
  and the still-to-be-matched queue.

  - the "try to send right away" code will look to see if there is anything in
    the active queue for the vc, and if not just put it in run queue and call
    the make progress function (whatever that is...)

- adaptive polling done at the agent level, perhaps with method supplied
  min/max/increments.  comm. agent must track outstanding requests (as
  described above) in order to know WHAT to poll.  we must also take into
  account that there might be incoming active message or error conditions, so
  we should poll all methods (and all vcs) periodically.

- We believe that a MPIR_Request might simply contain enough information for
  signalling that one or more CARs have completed.  This implies that a
  MPIR_Request might consist of a integer counter of outstanding CARs.  When
  the counter reached zero, the request is complete.  David suggests making
  CARs and MPIR_Requests reside in the same physical structure so that in the
  MPI_Send/Recv() case, two logical allocations (one for MPIR_Request and CAR)
  are combined into one.

- operations within a block are prioritized by the order in which they are
  added to the block.  operations may proceed in parallel so long as higher
  priority operations are not slowed down by lesser priority operations.  a
  valid implementation is to serialize the operations thus guaranteeing that
  the current operation has all available resources at its desposal.

MPI_Isend(buf, count, datatype, dest, tag, comm, request, error)
MPI_Ibsend(buf, count, datatype, dest, tag, comm, request, error)
MPI_Irsend(buf, count, datatype, dest, tag, comm, request, error)
MPI_Issend(buf, count, datatype, dest, tag, comm, request, error)
{
    request_p = MPIR_Request_alloc();
    MPID_IXsend(buf, count, datatype, dest, tag, comm, request_p, error);
    *request = MPIR_Request_handle(request_p);
}

MPI_Recv()
MPI_Irecv()

- need to cover wild card receive!

MPI_Sendrecv()
{
    /* KISS */
    MPI_Isend()
    MPI_Irecv()
    MPI_Waitall()
}


MPID_Send(buf, count, datatype, dest, tag, comm, group, error)
MPID_Isend(buf, count, datatype, dest, tag, comm, request, error)

MPID_Bsend(buf, count, datatype, dest, tag, comm, error)
MPID_Ibsend(buf, count, datatype, dest, tag, comm, request, error)

MPID_Rsend(buf, count, datatype, dest, tag, comm, error)
MPID_Irsend(buf, count, datatype, dest, tag, comm, request, error)

MPID_Ssend(buf, count, datatype, dest, tag, comm, error)
MPID_Issend(buf, count, datatype, dest, tag, comm, request, error)


-----

Items which make life more difficult:

-