Single-threaded implementation of RMA for distributed memory

------------------------------------------------------------------------

Base Assumptions

* All of the local windows are located in process local (not shared or
remotely accessible) memory.

* Only basic datatypes are supported for the target.

* Only active (fence) synchronization is supported.

* The application is single threaded.

* The MPI runtime system is single threaded.

------------------------------------------------------------------------

General Notes

* "Lessons Learned from Implmenenting BSP" by J. Hill and
  D.B. Skillicorn suggests that we should not be performing RMA
  operations as they are requested, but rather queue the entire set of
  operations and perform the operations at the next synchronization
  operation.

------------------------------------------------------------------------

Data Structures

* MPID_Win

  * struct MPIR_Win

  * handles - an array of local window handles (one per process)

    Q: Do we really need local window IDs?  We need to be able to map
    remote handler calls back to a particular window, but we might be
    able to do this using an attribute on a communicator.  Would an
    attribute lookup be too slow?

------------------------------------------------------------------------

MPID_Win_fence

* Since remote handler calls might be sent on another socket or
  processed in another thread, no natural synchronization occurs
  between RHCs and the collective operations.  Therefore, we need to
  know how many RHCs we should expect so that we don't prematurely
  return from the fence.  Likewise, we need to tell the other
  processes how many RHCs we have made.

* We need to block until such time that all incoming RHCs have been
  handled and all local requests and flags have completed.
  
  Q: What is the right interface for this blocking operation?  The
  operation should block, but it needs to guarantee that forward
  progress is being made on both the incoming RHCs and locally posted
  operations.  

  NOTE: We either need to pass dwin to a function or declare/cast the
  counters used in the while statement as volatile, otherwise the
  compiler may not generate instructions to reload the counter values
  before each iteration of the while loop.

  Q: It would be useful if the MPID layer could increment a counter
  (or call a non-blocking function) when the asynchronous request or
  RHC completed.  This seems like a much more ideal interface than
  requests and flags, at least for RMA.  Might something of this
  nature be possible without putting undo burden on the device or
  significantly complicating the ADI?

* Wait for all other processes in the window to complete

  Q: Should we perform a barrier here?  If we eliminate the barrier,
  then all processes still waiting for operations to complete will
  have to enqueue incoming requests from the next epoch until the
  operations from the currrent epoch are complete.  Not performing the
  barrier complicates the RMA operations, but the performance benefit
  may be significant for some cases.  (What are they?  How common are
  they?)

------------------------------------------------------------------------

MPID_Get

* If the target and origin ranks are the same, then copy the data from
  the target buffer to the origin buffer.

* Otherwise, we are attempting to get data from a remote node

  * Post an asynchronous receive for the data

    NOTE: the tag must be unique for this epoch so as to ensure that
    the soon-to-be incoming message is matched with this receive.

    NOTE: the request needs be allocated from the window's active
    requests object so that it can tracked.

  * Issue a remote handler call requesting the data from the remote
    process

    NOTE: the local completion flag needs be allocated from the
    window's active flags object so that it can tracked.

------------------------------------------------------------------------

MPIDI_Win_get_hdlr

* Post an asynchronous send of the requested data to origin process

* Increment the "number of RHCs processed" counter

------------------------------------------------------------------------

MPID_Put

* If the target and origin ranks are the same, then copy the data from
  the origin buffer to the target buffer.

* Otherwise, if the source and target buffers are contiguous and data
  conversion is not required

  NOTE: What I would like to do here is use MPID_Put_contig, but that
  would require that I communicate with the remote process in order to
  agree on a flag.  It would be much better if the target completion
  flag were a counter so that the counter could be prearranged and
  used for all Put operations.

* Otherwise, if the data is sufficiently small

  * If the data is not contiguous, then pack the data into a temporary
    buffer.

    NOTE: This assumes that MPID_Pack() does not add a header to the
    packed data.

  * Issue a remote handler call (MPIDI_Win_put_eager_hdlr) requesting
    the the data be written to the target's local window

* Otherwise, the data is large enough to send in a separate message(s)

  * Issue a remote handler call (MPIDI_Win_put_hdlr) letting the
    target now that data is being sent that needs to be written into
    the target's local window

  * Post an asynchronous send of the origin buffer

  Q: Instead of using MPI_Isend(), should we instead use segments and
  multiple RHCs to send the data?  Would doing so imply that the RMA
  subsystem now needs to do flow control?

Q: Should we have yet another case, where a rendezvous occurs,
guaranteeing that the target is able to post a receive before the send
is issued?  This would allow us to use MPID_Irsend(), potentially
eliminating an extra message.  Rather than having another case, should
we use this technique anytime the data is larger than the eager
message threshold?

------------------------------------------------------------------------

MPIDI_Win_put_eager_hdlr

* Unpack the data into local window buffer, performing data conversion
  if necessary

  Q: How are the header and data obtained?  Depending on the interface
  and the datatype, we should be able to read the header directly into
  the window buffer.

* Increment the "number of RHCs processed" counter

------------------------------------------------------------------------

MPIDI_Win_put_hdlr

  * Post an asynchronous receive of data into the window buffer
    defined in the RHC header

    Q: What should we do if a communication failure occurs?  Is the
    origin somehow notified of the failure?

* Increment the "number of RHCs processed" counter

------------------------------------------------------------------------

MPI_Accumulate

* If the target and origin ranks are the same, then copy the data from
  the target memory region to the origin memory region.

  NOTE: For now, we are assuming the application and the message agent
  are single-threaded so we do not need to hold a mutex before
  performing the operation.

* Otherwise, if the data is sufficiently small:

  * If the data is not contiguous, then pack the data into a temporary
    buffer.

    NOTE: This assumes that MPID_Pack() does not add a header to the
    packed data.

  * Issue a remote handler call (MPIDI_Win_acc_eager_hdlr) requesting
    the the enclosed data be accumulated into target buffer using the
    specified operation.

* Otherwise, the data is large enough that it needs to be segmented.

  Q: On the target side, we don't want to have to unpack the segment
  into a temporary buffer first.  We would like to do the data
  conversion and accumulation directly from the segment that will be
  received.  Does this make it impossible to use the MPIR_Segment API?

------------------------------------------------------------------------

MPIDI_Win_acc_eager_hdlr

* Perform the requested operation, converting the data on the fly if
  necessary

* Increment the "number of RHCs processed" counter

------------------------------------------------------------------------