Blame doc/notes/rma/dm.txt

Packit Service c5cf8c
Single-threaded implementation of RMA for distributed memory
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
Base Assumptions
Packit Service c5cf8c
Packit Service c5cf8c
* All of the local windows are located in process local (not shared or
Packit Service c5cf8c
remotely accessible) memory.
Packit Service c5cf8c
Packit Service c5cf8c
* Only basic datatypes are supported for the target.
Packit Service c5cf8c
Packit Service c5cf8c
* Only active (fence) synchronization is supported.
Packit Service c5cf8c
Packit Service c5cf8c
* The application is single threaded.
Packit Service c5cf8c
Packit Service c5cf8c
* The MPI runtime system is single threaded.
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
General Notes
Packit Service c5cf8c
Packit Service c5cf8c
* "Lessons Learned from Implmenenting BSP" by J. Hill and
Packit Service c5cf8c
  D.B. Skillicorn suggests that we should not be performing RMA
Packit Service c5cf8c
  operations as they are requested, but rather queue the entire set of
Packit Service c5cf8c
  operations and perform the operations at the next synchronization
Packit Service c5cf8c
  operation.
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
Data Structures
Packit Service c5cf8c
Packit Service c5cf8c
* MPID_Win
Packit Service c5cf8c
Packit Service c5cf8c
  * struct MPIR_Win
Packit Service c5cf8c
Packit Service c5cf8c
  * handles - an array of local window handles (one per process)
Packit Service c5cf8c
Packit Service c5cf8c
    Q: Do we really need local window IDs?  We need to be able to map
Packit Service c5cf8c
    remote handler calls back to a particular window, but we might be
Packit Service c5cf8c
    able to do this using an attribute on a communicator.  Would an
Packit Service c5cf8c
    attribute lookup be too slow?
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
MPID_Win_fence
Packit Service c5cf8c
Packit Service c5cf8c
* Since remote handler calls might be sent on another socket or
Packit Service c5cf8c
  processed in another thread, no natural synchronization occurs
Packit Service c5cf8c
  between RHCs and the collective operations.  Therefore, we need to
Packit Service c5cf8c
  know how many RHCs we should expect so that we don't prematurely
Packit Service c5cf8c
  return from the fence.  Likewise, we need to tell the other
Packit Service c5cf8c
  processes how many RHCs we have made.
Packit Service c5cf8c
Packit Service c5cf8c
* We need to block until such time that all incoming RHCs have been
Packit Service c5cf8c
  handled and all local requests and flags have completed.
Packit Service c5cf8c
  
Packit Service c5cf8c
  Q: What is the right interface for this blocking operation?  The
Packit Service c5cf8c
  operation should block, but it needs to guarantee that forward
Packit Service c5cf8c
  progress is being made on both the incoming RHCs and locally posted
Packit Service c5cf8c
  operations.  
Packit Service c5cf8c
Packit Service c5cf8c
  NOTE: We either need to pass dwin to a function or declare/cast the
Packit Service c5cf8c
  counters used in the while statement as volatile, otherwise the
Packit Service c5cf8c
  compiler may not generate instructions to reload the counter values
Packit Service c5cf8c
  before each iteration of the while loop.
Packit Service c5cf8c
Packit Service c5cf8c
  Q: It would be useful if the MPID layer could increment a counter
Packit Service c5cf8c
  (or call a non-blocking function) when the asynchronous request or
Packit Service c5cf8c
  RHC completed.  This seems like a much more ideal interface than
Packit Service c5cf8c
  requests and flags, at least for RMA.  Might something of this
Packit Service c5cf8c
  nature be possible without putting undo burden on the device or
Packit Service c5cf8c
  significantly complicating the ADI?
Packit Service c5cf8c
Packit Service c5cf8c
* Wait for all other processes in the window to complete
Packit Service c5cf8c
Packit Service c5cf8c
  Q: Should we perform a barrier here?  If we eliminate the barrier,
Packit Service c5cf8c
  then all processes still waiting for operations to complete will
Packit Service c5cf8c
  have to enqueue incoming requests from the next epoch until the
Packit Service c5cf8c
  operations from the currrent epoch are complete.  Not performing the
Packit Service c5cf8c
  barrier complicates the RMA operations, but the performance benefit
Packit Service c5cf8c
  may be significant for some cases.  (What are they?  How common are
Packit Service c5cf8c
  they?)
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
MPID_Get
Packit Service c5cf8c
Packit Service c5cf8c
* If the target and origin ranks are the same, then copy the data from
Packit Service c5cf8c
  the target buffer to the origin buffer.
Packit Service c5cf8c
Packit Service c5cf8c
* Otherwise, we are attempting to get data from a remote node
Packit Service c5cf8c
Packit Service c5cf8c
  * Post an asynchronous receive for the data
Packit Service c5cf8c
Packit Service c5cf8c
    NOTE: the tag must be unique for this epoch so as to ensure that
Packit Service c5cf8c
    the soon-to-be incoming message is matched with this receive.
Packit Service c5cf8c
Packit Service c5cf8c
    NOTE: the request needs be allocated from the window's active
Packit Service c5cf8c
    requests object so that it can tracked.
Packit Service c5cf8c
Packit Service c5cf8c
  * Issue a remote handler call requesting the data from the remote
Packit Service c5cf8c
    process
Packit Service c5cf8c
Packit Service c5cf8c
    NOTE: the local completion flag needs be allocated from the
Packit Service c5cf8c
    window's active flags object so that it can tracked.
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
MPIDI_Win_get_hdlr
Packit Service c5cf8c
Packit Service c5cf8c
* Post an asynchronous send of the requested data to origin process
Packit Service c5cf8c
Packit Service c5cf8c
* Increment the "number of RHCs processed" counter
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
MPID_Put
Packit Service c5cf8c
Packit Service c5cf8c
* If the target and origin ranks are the same, then copy the data from
Packit Service c5cf8c
  the origin buffer to the target buffer.
Packit Service c5cf8c
Packit Service c5cf8c
* Otherwise, if the source and target buffers are contiguous and data
Packit Service c5cf8c
  conversion is not required
Packit Service c5cf8c
Packit Service c5cf8c
  NOTE: What I would like to do here is use MPID_Put_contig, but that
Packit Service c5cf8c
  would require that I communicate with the remote process in order to
Packit Service c5cf8c
  agree on a flag.  It would be much better if the target completion
Packit Service c5cf8c
  flag were a counter so that the counter could be prearranged and
Packit Service c5cf8c
  used for all Put operations.
Packit Service c5cf8c
Packit Service c5cf8c
* Otherwise, if the data is sufficiently small
Packit Service c5cf8c
Packit Service c5cf8c
  * If the data is not contiguous, then pack the data into a temporary
Packit Service c5cf8c
    buffer.
Packit Service c5cf8c
Packit Service c5cf8c
    NOTE: This assumes that MPID_Pack() does not add a header to the
Packit Service c5cf8c
    packed data.
Packit Service c5cf8c
Packit Service c5cf8c
  * Issue a remote handler call (MPIDI_Win_put_eager_hdlr) requesting
Packit Service c5cf8c
    the the data be written to the target's local window
Packit Service c5cf8c
Packit Service c5cf8c
* Otherwise, the data is large enough to send in a separate message(s)
Packit Service c5cf8c
Packit Service c5cf8c
  * Issue a remote handler call (MPIDI_Win_put_hdlr) letting the
Packit Service c5cf8c
    target now that data is being sent that needs to be written into
Packit Service c5cf8c
    the target's local window
Packit Service c5cf8c
Packit Service c5cf8c
  * Post an asynchronous send of the origin buffer
Packit Service c5cf8c
Packit Service c5cf8c
  Q: Instead of using MPI_Isend(), should we instead use segments and
Packit Service c5cf8c
  multiple RHCs to send the data?  Would doing so imply that the RMA
Packit Service c5cf8c
  subsystem now needs to do flow control?
Packit Service c5cf8c
Packit Service c5cf8c
Q: Should we have yet another case, where a rendezvous occurs,
Packit Service c5cf8c
guaranteeing that the target is able to post a receive before the send
Packit Service c5cf8c
is issued?  This would allow us to use MPID_Irsend(), potentially
Packit Service c5cf8c
eliminating an extra message.  Rather than having another case, should
Packit Service c5cf8c
we use this technique anytime the data is larger than the eager
Packit Service c5cf8c
message threshold?
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
MPIDI_Win_put_eager_hdlr
Packit Service c5cf8c
Packit Service c5cf8c
* Unpack the data into local window buffer, performing data conversion
Packit Service c5cf8c
  if necessary
Packit Service c5cf8c
Packit Service c5cf8c
  Q: How are the header and data obtained?  Depending on the interface
Packit Service c5cf8c
  and the datatype, we should be able to read the header directly into
Packit Service c5cf8c
  the window buffer.
Packit Service c5cf8c
Packit Service c5cf8c
* Increment the "number of RHCs processed" counter
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
MPIDI_Win_put_hdlr
Packit Service c5cf8c
Packit Service c5cf8c
  * Post an asynchronous receive of data into the window buffer
Packit Service c5cf8c
    defined in the RHC header
Packit Service c5cf8c
Packit Service c5cf8c
    Q: What should we do if a communication failure occurs?  Is the
Packit Service c5cf8c
    origin somehow notified of the failure?
Packit Service c5cf8c
Packit Service c5cf8c
* Increment the "number of RHCs processed" counter
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
MPI_Accumulate
Packit Service c5cf8c
Packit Service c5cf8c
* If the target and origin ranks are the same, then copy the data from
Packit Service c5cf8c
  the target memory region to the origin memory region.
Packit Service c5cf8c
Packit Service c5cf8c
  NOTE: For now, we are assuming the application and the message agent
Packit Service c5cf8c
  are single-threaded so we do not need to hold a mutex before
Packit Service c5cf8c
  performing the operation.
Packit Service c5cf8c
Packit Service c5cf8c
* Otherwise, if the data is sufficiently small:
Packit Service c5cf8c
Packit Service c5cf8c
  * If the data is not contiguous, then pack the data into a temporary
Packit Service c5cf8c
    buffer.
Packit Service c5cf8c
Packit Service c5cf8c
    NOTE: This assumes that MPID_Pack() does not add a header to the
Packit Service c5cf8c
    packed data.
Packit Service c5cf8c
Packit Service c5cf8c
  * Issue a remote handler call (MPIDI_Win_acc_eager_hdlr) requesting
Packit Service c5cf8c
    the the enclosed data be accumulated into target buffer using the
Packit Service c5cf8c
    specified operation.
Packit Service c5cf8c
Packit Service c5cf8c
* Otherwise, the data is large enough that it needs to be segmented.
Packit Service c5cf8c
Packit Service c5cf8c
  Q: On the target side, we don't want to have to unpack the segment
Packit Service c5cf8c
  into a temporary buffer first.  We would like to do the data
Packit Service c5cf8c
  conversion and accumulation directly from the segment that will be
Packit Service c5cf8c
  received.  Does this make it impossible to use the MPIR_Segment API?
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
MPIDI_Win_acc_eager_hdlr
Packit Service c5cf8c
Packit Service c5cf8c
* Perform the requested operation, converting the data on the fly if
Packit Service c5cf8c
  necessary
Packit Service c5cf8c
Packit Service c5cf8c
* Increment the "number of RHCs processed" counter
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------