Blame doc/notes/rma/passive-acc.txt

Packit Service c5cf8c
Important issues with passive target accumulate
Packit Service c5cf8c
Packit Service c5cf8c
* Simple accumulations should combine lock/unlock with operation when
Packit Service c5cf8c
  latency is high
Packit Service c5cf8c
Packit Service c5cf8c
* Separate accumulations into non-overlapping buffers (within the same
Packit Service c5cf8c
  local window) should be processed with as much parallelism as
Packit Service c5cf8c
  possible.
Packit Service c5cf8c
Packit Service c5cf8c
  For single threaded MPI runtime systems, it is reasonable to
Packit Service c5cf8c
  restrict such optimizations to accumulations issued from separate
Packit Service c5cf8c
  processes since communications about multiple accumulations from a
Packit Service c5cf8c
  single process are likely to be serialized by the networking device
Packit Service c5cf8c
  anyway.
Packit Service c5cf8c
Packit Service c5cf8c
  The trick to parallelizing multiple requests is identifying that
Packit Service c5cf8c
  their target buffers do not overlap.  Detecting this seems extremely
Packit Service c5cf8c
  difficult, however, an assertion at the time the lock is acquired
Packit Service c5cf8c
  could be used to inform the runtime system that no overlapping
Packit Service c5cf8c
  accumulations will be issued by other processes during this epoch.
Packit Service c5cf8c
Packit Service c5cf8c
  * Two-sided
Packit Service c5cf8c
Packit Service c5cf8c
    Unless multiple threads (and processors) are present on the target
Packit Service c5cf8c
    system to process accumulation operations, serialization is
Packit Service c5cf8c
    eminent, and the best optmization is to not have a ton of
Packit Service c5cf8c
    multi-threading optimizations in the way slowing things down.  This
Packit Service c5cf8c
    suggests that two implementations may be necessary: one using
Packit Service c5cf8c
    threads and one that doesn't.
Packit Service c5cf8c
Packit Service c5cf8c
    If multiple threads and processors are available on the target
Packit Service c5cf8c
    system, then one might be able to use separate threads to process
Packit Service c5cf8c
    accumulate requests from different processes.  Multiple thread
Packit Service c5cf8c
    might also be useful to help overlap communication and computation
Packit Service c5cf8c
    for a single request if that request is large enough.
Packit Service c5cf8c
Packit Service c5cf8c
  * Remote memory
Packit Service c5cf8c
Packit Service c5cf8c
    The same issues that apply for the two-sided case, apply here.
Packit Service c5cf8c
    The only real difference is that we might use get/put to fill
Packit Service c5cf8c
    buffers instead of send/recv.
Packit Service c5cf8c
Packit Service c5cf8c
  * Shared memory
Packit Service c5cf8c
Packit Service c5cf8c
  * Multi-method
Packit Service c5cf8c
Packit Service c5cf8c
* Large accumulations into the same buffer should be pipelined to
Packit Service c5cf8c
  achieve maximum parallelism.
Packit Service c5cf8c
Packit Service c5cf8c
  If it is possible to detect that the same target buffer was being
Packit Service c5cf8c
  used by a set of processes, and that no other overlapping target
Packit Service c5cf8c
  buffers were simultaneously being operated upon, then mutexes can
Packit Service c5cf8c
  be associated with areas of progress within that buffer rather
Packit Service c5cf8c
  particular regions of the target's local window.  Defining the
Packit Service c5cf8c
  areas of progress may happen naturally as a result of limited
Packit Service c5cf8c
  buffer space necessitating the use of segments.  So, atomicity
Packit Service c5cf8c
  would need to be guaranteed for the operations associated with a
Packit Service c5cf8c
  particular segment associated with a buffer rather than a
Packit Service c5cf8c
  particular region with the local window.  In practice, detecting
Packit Service c5cf8c
  that the conditions have been met to ensure atomicity may prove
Packit Service c5cf8c
  too difficult to obtain reasonable performance.
Packit Service c5cf8c
Packit Service c5cf8c
  * Two-sided
Packit Service c5cf8c
Packit Service c5cf8c
    If multiple threads are available to process multiple data
Packit Service c5cf8c
    streams, then it should be possible to pipeline the processing.
Packit Service c5cf8c
    It is critical, however, that the data be organized and sent in
Packit Service c5cf8c
    such a way so as to maximize
Packit Service c5cf8c
Packit Service c5cf8c
  * Remote memory
Packit Service c5cf8c
Packit Service c5cf8c
  * Shared memory
Packit Service c5cf8c
Packit Service c5cf8c
    For shared memory, this can be accomplished by dividing local
Packit Service c5cf8c
    windows into regions and providing a separate mutex for each
Packit Service c5cf8c
    region.
Packit Service c5cf8c
Packit Service c5cf8c
  * Multi-method
Packit Service c5cf8c
Packit Service c5cf8c
* Datatype caching
Packit Service c5cf8c
Packit Service c5cf8c
  [traff00:mpi-impl] and [booth00:mpi-impl] discuss the need for
Packit Service c5cf8c
  datatype caching by the target process when the target process
Packit Service c5cf8c
  must be involved in the RMA operations (i.e., when the data cannot
Packit Service c5cf8c
  be directly read and interpreted by the origin process)
Packit Service c5cf8c
Packit Service c5cf8c
  Datatype caching can be either proactive or reactive.
Packit Service c5cf8c
Packit Service c5cf8c
  In the proactive case, the origin process would track whether the
Packit Service c5cf8c
  target process has a copy of the datatype and send the datatype to
Packit Service c5cf8c
  target process when necessary.  This means that each datatype must
Packit Service c5cf8c
  contain tracking information.  Unfortunately, because of dynamic
Packit Service c5cf8c
  processes in MPI-2, something more complex than a simple bit
Packit Service c5cf8c
  vector must be used to track processes already caching a datatype.
Packit Service c5cf8c
  What is the correct, high-performance structure?
Packit Service c5cf8c
Packit Service c5cf8c
  Alternatively, a reactive approach could be used.  In the reactive
Packit Service c5cf8c
  approach, the origin process would assume that the target process
Packit Service c5cf8c
  already knows about the datatype.  If that assumption is false, the
Packit Service c5cf8c
  target process will request the datatype from the origin process.
Packit Service c5cf8c
  This simplifies the tracking on the origin, but does not completely
Packit Service c5cf8c
  eliminate it.  It is necessary for the origin to increase the
Packit Service c5cf8c
  reference count associated with the datatype until the next
Packit Service c5cf8c
  synchronization point to ensure that the datatype is not deleted
Packit Service c5cf8c
  before the target process has had sufficient opportunity to request
Packit Service c5cf8c
  a copy of the datatype.
Packit Service c5cf8c
Packit Service c5cf8c
  * Two-sided
Packit Service c5cf8c
Packit Service c5cf8c
  * Remote memory
Packit Service c5cf8c
Packit Service c5cf8c
  * Shared memory
Packit Service c5cf8c
Packit Service c5cf8c
    For shared memory, data type caching is unnecessary if the origin
Packit Service c5cf8c
    process performs the work.  If the target is involved in the work,
Packit Service c5cf8c
    necessary datatype information can be placed in shared memory.
Packit Service c5cf8c
Packit Service c5cf8c
  * Multi-method