|
Packit Service |
c5cf8c |
Important issues with passive target accumulate
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Simple accumulations should combine lock/unlock with operation when
|
|
Packit Service |
c5cf8c |
latency is high
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Separate accumulations into non-overlapping buffers (within the same
|
|
Packit Service |
c5cf8c |
local window) should be processed with as much parallelism as
|
|
Packit Service |
c5cf8c |
possible.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
For single threaded MPI runtime systems, it is reasonable to
|
|
Packit Service |
c5cf8c |
restrict such optimizations to accumulations issued from separate
|
|
Packit Service |
c5cf8c |
processes since communications about multiple accumulations from a
|
|
Packit Service |
c5cf8c |
single process are likely to be serialized by the networking device
|
|
Packit Service |
c5cf8c |
anyway.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The trick to parallelizing multiple requests is identifying that
|
|
Packit Service |
c5cf8c |
their target buffers do not overlap. Detecting this seems extremely
|
|
Packit Service |
c5cf8c |
difficult, however, an assertion at the time the lock is acquired
|
|
Packit Service |
c5cf8c |
could be used to inform the runtime system that no overlapping
|
|
Packit Service |
c5cf8c |
accumulations will be issued by other processes during this epoch.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Two-sided
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Unless multiple threads (and processors) are present on the target
|
|
Packit Service |
c5cf8c |
system to process accumulation operations, serialization is
|
|
Packit Service |
c5cf8c |
eminent, and the best optmization is to not have a ton of
|
|
Packit Service |
c5cf8c |
multi-threading optimizations in the way slowing things down. This
|
|
Packit Service |
c5cf8c |
suggests that two implementations may be necessary: one using
|
|
Packit Service |
c5cf8c |
threads and one that doesn't.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
If multiple threads and processors are available on the target
|
|
Packit Service |
c5cf8c |
system, then one might be able to use separate threads to process
|
|
Packit Service |
c5cf8c |
accumulate requests from different processes. Multiple thread
|
|
Packit Service |
c5cf8c |
might also be useful to help overlap communication and computation
|
|
Packit Service |
c5cf8c |
for a single request if that request is large enough.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Remote memory
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The same issues that apply for the two-sided case, apply here.
|
|
Packit Service |
c5cf8c |
The only real difference is that we might use get/put to fill
|
|
Packit Service |
c5cf8c |
buffers instead of send/recv.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Shared memory
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Multi-method
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Large accumulations into the same buffer should be pipelined to
|
|
Packit Service |
c5cf8c |
achieve maximum parallelism.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
If it is possible to detect that the same target buffer was being
|
|
Packit Service |
c5cf8c |
used by a set of processes, and that no other overlapping target
|
|
Packit Service |
c5cf8c |
buffers were simultaneously being operated upon, then mutexes can
|
|
Packit Service |
c5cf8c |
be associated with areas of progress within that buffer rather
|
|
Packit Service |
c5cf8c |
particular regions of the target's local window. Defining the
|
|
Packit Service |
c5cf8c |
areas of progress may happen naturally as a result of limited
|
|
Packit Service |
c5cf8c |
buffer space necessitating the use of segments. So, atomicity
|
|
Packit Service |
c5cf8c |
would need to be guaranteed for the operations associated with a
|
|
Packit Service |
c5cf8c |
particular segment associated with a buffer rather than a
|
|
Packit Service |
c5cf8c |
particular region with the local window. In practice, detecting
|
|
Packit Service |
c5cf8c |
that the conditions have been met to ensure atomicity may prove
|
|
Packit Service |
c5cf8c |
too difficult to obtain reasonable performance.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Two-sided
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
If multiple threads are available to process multiple data
|
|
Packit Service |
c5cf8c |
streams, then it should be possible to pipeline the processing.
|
|
Packit Service |
c5cf8c |
It is critical, however, that the data be organized and sent in
|
|
Packit Service |
c5cf8c |
such a way so as to maximize
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Remote memory
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Shared memory
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
For shared memory, this can be accomplished by dividing local
|
|
Packit Service |
c5cf8c |
windows into regions and providing a separate mutex for each
|
|
Packit Service |
c5cf8c |
region.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Multi-method
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Datatype caching
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
[traff00:mpi-impl] and [booth00:mpi-impl] discuss the need for
|
|
Packit Service |
c5cf8c |
datatype caching by the target process when the target process
|
|
Packit Service |
c5cf8c |
must be involved in the RMA operations (i.e., when the data cannot
|
|
Packit Service |
c5cf8c |
be directly read and interpreted by the origin process)
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Datatype caching can be either proactive or reactive.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
In the proactive case, the origin process would track whether the
|
|
Packit Service |
c5cf8c |
target process has a copy of the datatype and send the datatype to
|
|
Packit Service |
c5cf8c |
target process when necessary. This means that each datatype must
|
|
Packit Service |
c5cf8c |
contain tracking information. Unfortunately, because of dynamic
|
|
Packit Service |
c5cf8c |
processes in MPI-2, something more complex than a simple bit
|
|
Packit Service |
c5cf8c |
vector must be used to track processes already caching a datatype.
|
|
Packit Service |
c5cf8c |
What is the correct, high-performance structure?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Alternatively, a reactive approach could be used. In the reactive
|
|
Packit Service |
c5cf8c |
approach, the origin process would assume that the target process
|
|
Packit Service |
c5cf8c |
already knows about the datatype. If that assumption is false, the
|
|
Packit Service |
c5cf8c |
target process will request the datatype from the origin process.
|
|
Packit Service |
c5cf8c |
This simplifies the tracking on the origin, but does not completely
|
|
Packit Service |
c5cf8c |
eliminate it. It is necessary for the origin to increase the
|
|
Packit Service |
c5cf8c |
reference count associated with the datatype until the next
|
|
Packit Service |
c5cf8c |
synchronization point to ensure that the datatype is not deleted
|
|
Packit Service |
c5cf8c |
before the target process has had sufficient opportunity to request
|
|
Packit Service |
c5cf8c |
a copy of the datatype.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Two-sided
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Remote memory
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Shared memory
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
For shared memory, data type caching is unnecessary if the origin
|
|
Packit Service |
c5cf8c |
process performs the work. If the target is involved in the work,
|
|
Packit Service |
c5cf8c |
necessary datatype information can be placed in shared memory.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
* Multi-method
|