Single-threaded implementation of RMA for shared memory
------------------------------------------------------------------------
Basic Assumptions
* All of the local windows associated with the specified window object
are located in and accessible through shared memory.
* All processors involved in the communicator are homogeneous.
* Only basic datatypes are supported.
------------------------------------------------------------------------
General Notes
------------------------------------------------------------------------
Data Structures
------------------------------------------------------------------------
MPID_shm_Win_create
* If the shared memory is not cache coherent, the initialize the
preceding put flag
If the local window is located in non-cache coherent shared memory,
then we need to track put operations to the local window which
(might) have occurred since the last fence. This tracking is
required so that cache lines associated with the local window can be
invalidated, ensuring that the local process sees the changes.
Q: Can puts happen before the first fence? In other words, is an
exposure epoch implicitly opened as part of the window creation
process?
* Initialize the inter-process (shared memory) mutex
Mutexes are required in order to ensure that accumulate operations
on any given element (basic datatype) in the local window are
atomic.
NOTE: multiple mutexes may be needed if the local window is broken
into multiple regions. For details, see the discussion in
MPID_shm_Accumulate().
------------------------------------------------------------------------
MPID_shm_Win_fence
* If the shared memory is not cache coherent, flush cache and/or write
buffer as necessary
If the shared memrory is not cache coherent and stores were
performed to the local window, then (depending on the
architecture specifics and the RMA implementation) we might need
to perform the following operations.
1) if system is using a write-back caching strategy, then flush
the cache
2) flush the write buffer
NOTE: It may be possible to defer these operations when
NOSUCCEED is also supplied. It's currently unclear if this
would be beneficial.
* barrier
We need a barrier to ensure that all remote puts and local stores to
the local window have completed so the results are available to
operations performed after the fence operation. We also
need to ensure that any remote gets and local loads from the local
window are complete before any future remote puts or local stores
are allowed to affect the local window.
* If the shared memory is not cache coherent
* invlidate cache
If the shared memrory is not cache coherent and RMA puts were
performed to the local window, then (depending on the
architecture specifics and the RMA implementation) we might to
invalidate any cache lines associated with the shared memory
bound to this window.
* set (or clear) preceding put flag based on the assertions
NOTE: To reduce unncessary cache and write buffer flushes, the
barrier (above) could be replaced with an alltoall gather of the
operation occuring between node pairs. Using this information, we
could eliminate flushes except when an operation actually affected
the local window.
------------------------------------------------------------------------
MPID_shm_Get
* Copy data directly from the target buffer (located in shared memory)
to the origin buffer.
------------------------------------------------------------------------
MPID_shm_Put
* Copy data directly from the the origin buffer to the target buffer
(located in shared memory).
------------------------------------------------------------------------
MPID_shm_Accumulate
* Lock target local window
The standard says that operations on elements (basic datatypes) need
to be atomic, but the entire accumulate operation need not be atomic
with repsect to other accumulate operations. The simple solution is
to lock the whole window when performing an operation; however this
ensures that operations are serialized which will seriously hurt
performance when multiple processes/threads are attempting to
accumulate data into a single window (or even a single large buffer
in that window).
TODO: Develop an algorithm for performing the operations when the
local window is broken into multiple regions, with a mutex per
region. Care must be taken to ensure that if an element spans two
regions, then the mutexes for both regions must be locked before the
operation is performed on that element. Performing these lock
operations is likely to be somewhat expensive, so we will want a
tuneable parameter for specifying the minimum size of a region.
Q: Do inter-process mutexes also ensure mutual exclusion for threads
within the same process? If not, then we need to a acquire both a
thread and process locks. We probably want to acquire the thread
lock first to minimize the contention at the process lock.
* Perform requested accumulation
We need an algorithm for performing accumulations when the
datatype are non-contiguous. Ideally, the two dataloops and the
accumulation operations could be processed without requiring any
extra copying, packing, or temporary buffers.
NOTE: While it may be possible to write a function to perform the
requested operations, it is likely that such functionality will need
to be inlined so that appropriate locking of local window regions
occurs as data is being processed. Also, the dataloops will need to
be optimized so that it is not necessary to acquire a region's mutex
more than once per request.
* Unlock target local window
------------------------------------------------------------------------