|
Packit Service |
c5cf8c |
\section{Error Checking}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
It would be useful to detect erroneous uses of the RMA interface by application
|
|
Packit Service |
c5cf8c |
codes. It should be possible to detect all errors involving a single process
|
|
Packit Service |
c5cf8c |
at the origin. Errors involving multiple processes, however, will undoubtably
|
|
Packit Service |
c5cf8c |
need to be detected at the target. To aid in this detection at the target, one
|
|
Packit Service |
c5cf8c |
could either mark bytes in the local window to detect collisions or log each
|
|
Packit Service |
c5cf8c |
operation and compare their target buffers for overlaps. In either case, one
|
|
Packit Service |
c5cf8c |
only needs to detect access collisions for the duration of an exposure epoch.
|
|
Packit Service |
c5cf8c |
This is complicated slightly by the ability of a shared lock being able to
|
|
Packit Service |
c5cf8c |
effectively join an exposure epoch already in progress.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The logging technique would provide slightly more detail since the exact
|
|
Packit Service |
c5cf8c |
processes issuing the illegal operations would be known. Logging has the
|
|
Packit Service |
c5cf8c |
disadvtange of potentially unbounded memory consumption. Detecting overlaps in
|
|
Packit Service |
c5cf8c |
any two target buffers will also be computationally expensive. For passive
|
|
Packit Service |
c5cf8c |
target synchronization, this extra computation will almost certainly change the
|
|
Packit Service |
c5cf8c |
relative timing between the processes and thus decreasing the chances of
|
|
Packit Service |
c5cf8c |
detecting an error. For active target syncrhonization, the extra overhead of
|
|
Packit Service |
c5cf8c |
logging should not affect the ability to detect an error since the epochs are
|
|
Packit Service |
c5cf8c |
well defined.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
The byte marking technique should provide better performance over logging.
|
|
Packit Service |
c5cf8c |
And, while its memory consumption is bounded, that consumption is guaranteed to
|
|
Packit Service |
c5cf8c |
be a large fraction of the window. For each byte in the window, we really need
|
|
Packit Service |
c5cf8c |
three bits, one for each of the possible operations (put, get, and accumulate).
|
|
Packit Service |
c5cf8c |
It might useful to keep a list of the processes that accessed the window so
|
|
Packit Service |
c5cf8c |
that the potential violators can be reported to the user.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Ok. We can do this with 2 bits actually, indicating what has occurred already:
|
|
Packit Service |
c5cf8c |
00 - nothing
|
|
Packit Service |
c5cf8c |
01 - put
|
|
Packit Service |
c5cf8c |
10 - get
|
|
Packit Service |
c5cf8c |
11 - accumulate
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Then you can test to see if future operations break the rules.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
proposal:
|
|
Packit Service |
c5cf8c |
- allow people to at compile time compile OUT additional debugging stuff with a
|
|
Packit Service |
c5cf8c |
flag (in by default)
|
|
Packit Service |
c5cf8c |
- allow it to be turned on and off at runtime via env. variable or whatever
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
----------
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{passive target accumulate, exclusive lock}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
you have an exclusive lock, you're doing a get on one set of things and
|
|
Packit Service |
c5cf8c |
an accumulate on another set of things.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
assume they are nonoverlapping datatypes.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
mpi_win_lock(exclusive, rank, assert, win)
|
|
Packit Service |
c5cf8c |
mpi_get(A,...)
|
|
Packit Service |
c5cf8c |
mpi_accumulate(B,...,sum)
|
|
Packit Service |
c5cf8c |
mpi_win_unlock()
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
for cache coherent shared memory (only)
|
|
Packit Service |
c5cf8c |
---------------------------------------
|
|
Packit Service |
c5cf8c |
one option:
|
|
Packit Service |
c5cf8c |
lock the appropriate window with interprocess_lock()
|
|
Packit Service |
c5cf8c |
do the get
|
|
Packit Service |
c5cf8c |
do the accumulate
|
|
Packit Service |
c5cf8c |
interprocess_unlock()
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- if the regions of the window didn't overlap, it might be better to
|
|
Packit Service |
c5cf8c |
lock only the region(s) of interest.
|
|
Packit Service |
c5cf8c |
- if the window is local, then this is the option to use
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
another option:
|
|
Packit Service |
c5cf8c |
don't do anything much on win_lock
|
|
Packit Service |
c5cf8c |
cache get
|
|
Packit Service |
c5cf8c |
cache acc
|
|
Packit Service |
c5cf8c |
interprocess_lock(), do get, do acc, unlock() as a result of win_unlock()
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
reordering of cached access could maybe be a win...
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
for remotely accessible memory (only)
|
|
Packit Service |
c5cf8c |
-------------------------------------
|
|
Packit Service |
c5cf8c |
there will be a lock on the remote system
|
|
Packit Service |
c5cf8c |
there will be some sort of agent
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
one option:
|
|
Packit Service |
c5cf8c |
agent lock request
|
|
Packit Service |
c5cf8c |
do the get, either directly or through the agent
|
|
Packit Service |
c5cf8c |
do the accumulate, again through the agent or directly
|
|
Packit Service |
c5cf8c |
agent unlock request
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
option two:
|
|
Packit Service |
c5cf8c |
agent lock/start request, including some inefficient stuff
|
|
Packit Service |
c5cf8c |
do direct accesses directly that are efficient
|
|
Packit Service |
c5cf8c |
agent complete/unlock request
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
option three:
|
|
Packit Service |
c5cf8c |
single message that defines the entire access epoch, start to finish.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
these cover the majority of the issues
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
side note:
|
|
Packit Service |
c5cf8c |
if we have a numa system in some cases it might be more efficient to
|
|
Packit Service |
c5cf8c |
have the process local to the memory region pack data for a get into a
|
|
Packit Service |
c5cf8c |
contiguous region, then the remote process can grab that instead of
|
|
Packit Service |
c5cf8c |
some set of discontiguous elements. the same process could be used
|
|
Packit Service |
c5cf8c |
for puts or accumulates.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
if we want to pipeline things, then we need to have something between
|
|
Packit Service |
c5cf8c |
options two and three. we want to be able to get overlap of computation
|
|
Packit Service |
c5cf8c |
(of buffer packing) and communication of rdma options.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we can use the win info structure to help tune when we try to pipeline, when
|
|
Packit Service |
c5cf8c |
we wait to pack at end, and so on.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
side note on lapi:
|
|
Packit Service |
c5cf8c |
things like lapi have atomic counters which we might be able to use
|
|
Packit Service |
c5cf8c |
to avoid explicit unlock calls. set to one when i'm done
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we might be able to use these same counters to perform locks, but that would
|
|
Packit Service |
c5cf8c |
be a nasty polling problem i think.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
likewise we can use the same lapi stuff to have agents set values local to
|
|
Packit Service |
c5cf8c |
the process performing operations in order to let it know that a set of
|
|
Packit Service |
c5cf8c |
operations that make up an epoch (which they have previously described to the
|
|
Packit Service |
c5cf8c |
agent) have been completed (and thus the process's unlock can complete)
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
note:
|
|
Packit Service |
c5cf8c |
there is more optimization that can occur here than what we have discussed so
|
|
Packit Service |
c5cf8c |
far; in particular nonoverlapping exclusive locks don't HAVE to be serialized
|
|
Packit Service |
c5cf8c |
like we have implied we would here. we should think more about how we can
|
|
Packit Service |
c5cf8c |
allow these things to continue simultaneously.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
----------
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
\section{passive target accumulate, shared lock}
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
you have an shared lock, you're doing a get on one set of things and
|
|
Packit Service |
c5cf8c |
an accumulate on another set of things.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
assume they are nonoverlapping datatypes.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
mpi_win_lock(shared, rank, assert, win)
|
|
Packit Service |
c5cf8c |
mpi_accumulate(B,...,sum)
|
|
Packit Service |
c5cf8c |
mpi_win_unlock()
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
remember that you can call accumulate over and over, even to the same data
|
|
Packit Service |
c5cf8c |
element. and others can be calling accumulate to that element as well.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
atomicity is maintained on the per-data-element basis only in the shared case.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
brian's implementation idea:
|
|
Packit Service |
c5cf8c |
- bust a window into a number of contiguous regions
|
|
Packit Service |
c5cf8c |
- allow locks on each one of these regions separately
|
|
Packit Service |
c5cf8c |
- in some cases there will be atomic operations supported by the processor, and
|
|
Packit Service |
c5cf8c |
in those cases we might be able to avoid the lock.
|
|
Packit Service |
c5cf8c |
- always acquire locks in sequential order
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
problems with implementation:
|
|
Packit Service |
c5cf8c |
- nasty strided datatypes which pass over the region and loop back would be
|
|
Packit Service |
c5cf8c |
really slow.
|
|
Packit Service |
c5cf8c |
- for those you want to get all the locks ahead of time maybe - how do you
|
|
Packit Service |
c5cf8c |
detect? maybe nonmonotonically increasing -> ``bad datatype''?
|
|
Packit Service |
c5cf8c |
- maybe cache incoming data and apply operations on elements in monotonically
|
|
Packit Service |
c5cf8c |
increasing order?
|
|
Packit Service |
c5cf8c |
- rather than this, maybe you reorganize both the source and destination
|
|
Packit Service |
c5cf8c |
datatypes so that the elements arrive in monotonically increasing order?
|
|
Packit Service |
c5cf8c |
THIS IS A BETTER BUT SCARIER IDEA.
|
|
Packit Service |
c5cf8c |
- this is an interesting problem of applying identical transformations on
|
|
Packit Service |
c5cf8c |
the two separate data types
|
|
Packit Service |
c5cf8c |
- worst case this could be done by breaking out a datatype into a huge
|
|
Packit Service |
c5cf8c |
struct; there should be a better/more efficient way.
|
|
Packit Service |
c5cf8c |
- caching transformed datatypes?
|
|
Packit Service |
c5cf8c |
- this could be done on portions of the datatypes as well to increase
|
|
Packit Service |
c5cf8c |
the granularity of operations with respect to the locking
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
it sounds like we're going to want to define the mechanism for locking at
|
|
Packit Service |
c5cf8c |
window create, or later if possible.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
the number of processes within the group on which a window is created should
|
|
Packit Service |
c5cf8c |
also be a factor when determining how to implement locking on the window.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
another possibility is locking on datatypes instead -- then we need functions
|
|
Packit Service |
c5cf8c |
which can look for overlaps between datatypes, which is kinda nasty...
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we don't need single writer/mult. reader because writes to elements that are
|
|
Packit Service |
c5cf8c |
being read is illegal. likewise with put you don't have to lock, because it is
|
|
Packit Service |
c5cf8c |
illegal for someone to write to the same location twice in the same epoch.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
illegal operations will be detected by the bit code above.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
so really it's only the accumulate that causes locking issues.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
proposal:
|
|
Packit Service |
c5cf8c |
start at random offsets into the target datatype when processing. i (rob)
|
|
Packit Service |
c5cf8c |
think that this is an interesting but problematic idea. it's nondeterministic.
|
|
Packit Service |
c5cf8c |
each guy has to get a random number. the idea though is to try to space out operations on the target to get better lock utilization.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
alternatively you could use the rank as a unique number. divide the datatype
|
|
Packit Service |
c5cf8c |
by N, where N = # of guys in the window object. each guy uses his rank to
|
|
Packit Service |
c5cf8c |
determine which block to start in. there are all sorts of assumptions on the
|
|
Packit Service |
c5cf8c |
nature of the datatype here.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
it will help with better utilizing locks on startup in some cases.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
you have an shared lock, you're doing a get on one set of things and
|
|
Packit Service |
c5cf8c |
an accumulate on another set of things.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
back to the example
|
|
Packit Service |
c5cf8c |
-------------------
|
|
Packit Service |
c5cf8c |
assume they are nonoverlapping datatypes.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
mpi_win_lock(shared, rank, assert, win)
|
|
Packit Service |
c5cf8c |
mpi_accumulate(B,...,sum)
|
|
Packit Service |
c5cf8c |
mpi_win_unlock()
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
process asks for shared lock (exposure epoch) on remote process
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
another example, looking at shmem combined with remote access
|
|
Packit Service |
c5cf8c |
-------------------------------------------------------------
|
|
Packit Service |
c5cf8c |
two pairs of processes on same nodes
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
shared memory lock on single node allows one process to look directly into the
|
|
Packit Service |
c5cf8c |
window information so that he doesn't ahve to go through the agent of the other
|
|
Packit Service |
c5cf8c |
process on the local node. that same lock will be used by the communication
|
|
Packit Service |
c5cf8c |
agent for a given process to ensure that things are kept sane in the off-node
|
|
Packit Service |
c5cf8c |
case; in other words, this lock is utilized by processes on the same node
|
|
Packit Service |
c5cf8c |
directly, but is also used by a single agent in the case of off-node access, in
|
|
Packit Service |
c5cf8c |
both cases to coordinate access to the local window information. by ``direct
|
|
Packit Service |
c5cf8c |
access'' this may mean the process directly, or via that process's agent.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
implication/question:
|
|
Packit Service |
c5cf8c |
starting an exposure or access epoch will need support functions within the
|
|
Packit Service |
c5cf8c |
methods. this would possibly be used as an alternative to going through the
|
|
Packit Service |
c5cf8c |
agent in order to have a fast path for these things.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
idea for assertion (valid for shared or exclusive):
|
|
Packit Service |
c5cf8c |
- NONOVERLAPPING
|
|
Packit Service |
c5cf8c |
says that i guarantee that none of the operations in the epoch will overlap.
|
|
Packit Service |
c5cf8c |
is this there already? if we have this, we can avoid locking entirely,
|
|
Packit Service |
c5cf8c |
relying on the user to have done the right thing (tm) :).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
post-wait/start-complete
|
|
Packit Service |
c5cf8c |
------------------------
|
|
Packit Service |
c5cf8c |
(target) (origin)
|
|
Packit Service |
c5cf8c |
post(group, assert, win) start(group, assert, win)
|
|
Packit Service |
c5cf8c |
... ...
|
|
Packit Service |
c5cf8c |
wait(win) complete(win)
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
complete(win) - only ensures that all operations have been ``locally
|
|
Packit Service |
c5cf8c |
completed''; they might not have yet completed at the target.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
wait(win) - blocks on all complete()s, and completes all operations at the
|
|
Packit Service |
c5cf8c |
target before returning.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
start(group, assert, win) - can block until matching post(), but isn't required
|
|
Packit Service |
c5cf8c |
to.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
this is different from lock/unlock, which ensures that things are completed on
|
|
Packit Service |
c5cf8c |
the target.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
observation: there is an option passed via info on create that says no_locks;
|
|
Packit Service |
c5cf8c |
there will be no passive target stuff. great!
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
post(group, assert, win) - doesn't block.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
note: all the groups here don't have to be identical; processes wanting to
|
|
Packit Service |
c5cf8c |
perform operations on multiple windows will have the targets in their window,
|
|
Packit Service |
c5cf8c |
while the ones that are targets will have all the sources in their groups...I
|
|
Packit Service |
c5cf8c |
have explained this poorly...
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
if a post() only contains one member, it is equivalent to a win lock exclusive.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
you can't have two outstanding post()s, as best we can tell, because you only
|
|
Packit Service |
c5cf8c |
wait() on the window. you could have different post()s on different windows...
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
so this approach is similar to the lock/unlock, only the post() doesn't HAVE
|
|
Packit Service |
c5cf8c |
to finish the operations on the target (or wait for them to finish).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
todo: we need to look at the overlapping window stuff and learn about the
|
|
Packit Service |
c5cf8c |
private/public views...
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
approaches
|
|
Packit Service |
c5cf8c |
----------
|
|
Packit Service |
c5cf8c |
for little sets of messages, we can use a single message from the origin to
|
|
Packit Service |
c5cf8c |
move all the operations across.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
for large messages, we want to pipeline.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
david: if we treat them like MPI messages, we could just do sends/receives as
|
|
Packit Service |
c5cf8c |
necessary. this keeps new concepts out of the CA code. otherwise we have this
|
|
Packit Service |
c5cf8c |
``multiple operations in one message'' concept.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
brian: people might find themselves doing the for loop of put()s instead of
|
|
Packit Service |
c5cf8c |
using a datatype. so aggregation makes sense in this case.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
rob: those people are stupid and they should use datatypes.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
anyway, we don't know what the intention is. we should try to help people
|
|
Packit Service |
c5cf8c |
if possible by using aggregation.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
aggregation approaches:
|
|
Packit Service |
c5cf8c |
- keep track of a size, and then cache to that size
|
|
Packit Service |
c5cf8c |
- david points out this will make little things slow. he wants to be
|
|
Packit Service |
c5cf8c |
aggressive about performing operations locally in order to keep small
|
|
Packit Service |
c5cf8c |
things fast
|
|
Packit Service |
c5cf8c |
- keep a count?
|
|
Packit Service |
c5cf8c |
- dynamic optimization? watch patterns on a window and try to cache only
|
|
Packit Service |
c5cf8c |
when it seems like the right thing?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
examples:
|
|
Packit Service |
c5cf8c |
- for loop with little puts and computation between puts
|
|
Packit Service |
c5cf8c |
- vs. for loop with little puts and no computation
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
in the first case you have the opportunity for making communication progress,
|
|
Packit Service |
c5cf8c |
while in the second case you want to wait and aggregate.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
i (rob) think we're going to want both an aggressive and a caching/combining
|
|
Packit Service |
c5cf8c |
mode, because there are situations where one or the other are obviously
|
|
Packit Service |
c5cf8c |
optimal.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
brian: we need to be able to support this ^^^ in our interface in order to be
|
|
Packit Service |
c5cf8c |
able to test and learn which of these modes really works.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
david: there's not really much coordination here, so doing optimal ordering is
|
|
Packit Service |
c5cf8c |
going to be tough. BUT we do have the group on the target...but we'll never
|
|
Packit Service |
c5cf8c |
have as much as we do in the fence case.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
the accumulate is odd as usual. if you're the only person in the target's
|
|
Packit Service |
c5cf8c |
group, then you don't have to do the element-wise locking/atomicity (or
|
|
Packit Service |
c5cf8c |
window-wise locking/unlocking depending on implementaiton).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
brian: it would be helpful for the origin to know if he needs to lock or not
|
|
Packit Service |
c5cf8c |
during accumulate operations. this would allow him to know if he needs to lock
|
|
Packit Service |
c5cf8c |
or not. this is particularly useful in the shmem scenario.
|
|
Packit Service |
c5cf8c |
- something in the window structure, stored in shared memory, could allow for
|
|
Packit Service |
c5cf8c |
this optimization
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
otherwise the accumulate is basically the same as in the lock/unlock window
|
|
Packit Service |
c5cf8c |
op case (gets/puts, lock requests, etc.).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
fence
|
|
Packit Service |
c5cf8c |
-----
|
|
Packit Service |
c5cf8c |
Q: does win_create() fill in for the first fence, or do you need a first fence?
|
|
Packit Service |
c5cf8c |
A: you need the first fence (you have to in order to have created an exposure
|
|
Packit Service |
c5cf8c |
epoch).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
there are optimizations here that we can get from the BSP people (collecting
|
|
Packit Service |
c5cf8c |
and scheduling communication).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
assertions:
|
|
Packit Service |
c5cf8c |
no store - no local stores in previous epoch
|
|
Packit Service |
c5cf8c |
no put - no remote updates to local window in new epoch
|
|
Packit Service |
c5cf8c |
no precede- no local rma calls in previous epoch, collective assert only
|
|
Packit Service |
c5cf8c |
no succeed - no local rma calls in new epoch, collective assert only
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
those last two are obviously useful for the first/last fence cases. they are
|
|
Packit Service |
c5cf8c |
one way to know you're done with fences. you can also do local load/stores in
|
|
Packit Service |
c5cf8c |
those epochs too...it's a way of saying ``i'm only doing local stuff for a
|
|
Packit Service |
c5cf8c |
moment''.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Q: how does one cleanly switch between synchronization modes? in particular
|
|
Packit Service |
c5cf8c |
how does one switch cleanly OUT of the fence mode? I think we're just not
|
|
Packit Service |
c5cf8c |
reading the fence stuff carefully enough.
|
|
Packit Service |
c5cf8c |
A: ``fence starts an exposure epoch IF followed by another fence call and the
|
|
Packit Service |
c5cf8c |
local window is the target of RMA ops between fence calls''. ``the call starts
|
|
Packit Service |
c5cf8c |
an access epoch IF it is followed by another fence call and by RMA
|
|
Packit Service |
c5cf8c |
communications calls issued between the two calls''.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
HA! that's nasty. So hitting a fence really doesn't tell us as much as we
|
|
Packit Service |
c5cf8c |
originally thought it did. We need to figure out what we can do in the context
|
|
Packit Service |
c5cf8c |
of these goofy rules.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
bad example we think is legal
|
|
Packit Service |
c5cf8c |
-----------------------------
|
|
Packit Service |
c5cf8c |
(0) (1) (2) (3)
|
|
Packit Service |
c5cf8c |
fence fence fence fence
|
|
Packit Service |
c5cf8c |
put(1) put(0) start(3) post(2)
|
|
Packit Service |
c5cf8c |
put(3) put(2)
|
|
Packit Service |
c5cf8c |
complete() wait()
|
|
Packit Service |
c5cf8c |
fence fence fence fence
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Is this legal? The epochs are not created on 2 and 3, but not on 0 and 1 as
|
|
Packit Service |
c5cf8c |
a result of the fences. We know the fences are collective, but is the creation
|
|
Packit Service |
c5cf8c |
of the epochs?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
even worse example
|
|
Packit Service |
c5cf8c |
------------------
|
|
Packit Service |
c5cf8c |
(0) (1) (2) (3)
|
|
Packit Service |
c5cf8c |
fence fence fence fence
|
|
Packit Service |
c5cf8c |
put(1) put(0) start(3) post(2)
|
|
Packit Service |
c5cf8c |
put(3) put(2)
|
|
Packit Service |
c5cf8c |
complete() wait()
|
|
Packit Service |
c5cf8c |
barrier(2,3) barrier(2,3)
|
|
Packit Service |
c5cf8c |
put(3) put(2)
|
|
Packit Service |
c5cf8c |
fence fence fence fence
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
What about that one?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Bill: FIX THE TEXT! (meaning we should propose a clarification to the standard)
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
goals:
|
|
Packit Service |
c5cf8c |
1) no mixed-mode stuff between fence and the others...
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
approach:
|
|
Packit Service |
c5cf8c |
0) comb through chapter and see if something is already there.
|
|
Packit Service |
c5cf8c |
1) description of problem (our examples, building up w/ 2)
|
|
Packit Service |
c5cf8c |
2) we know this wasn't intended
|
|
Packit Service |
c5cf8c |
3) propose clarifications
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
what next?
|
|
Packit Service |
c5cf8c |
----------
|
|
Packit Service |
c5cf8c |
david: looking this as a building-block sort of thing as we did with xfer.
|
|
Packit Service |
c5cf8c |
is there a way to approach this in the same way?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
the obvious blocks would be access and/or exposure epochs.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
exposure epochs can be thought of as having a reference count, with the wait()
|
|
Packit Service |
c5cf8c |
(or fence i guess) blocking until the refcount hits 0.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Q: do we, in our code, to explicitly define epochs? Is it harder to follow the
|
|
Packit Service |
c5cf8c |
rules with or without them?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
scenarios:
|
|
Packit Service |
c5cf8c |
- fence, how do you know who did ops/created epochs?
|
|
Packit Service |
c5cf8c |
- which is the right approach?
|
|
Packit Service |
c5cf8c |
- aggressive vs. combined?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we are probably going to punt on detecting errors between overlapping windows
|
|
Packit Service |
c5cf8c |
at first. later we could detect overlapping windows at create time and then do
|
|
Packit Service |
c5cf8c |
error checking for invalid operations between windows on the destination at the
|
|
Packit Service |
c5cf8c |
time the epochs are serviced (or whatever we call that)
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
brainstorm:
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
for non-aggregating case, a put creates a special car (including datatype etc.)
|
|
Packit Service |
c5cf8c |
which sends a special header across to the target. the target understands how
|
|
Packit Service |
c5cf8c |
to receive these cars and will create a matching recv car to receive and store
|
|
Packit Service |
c5cf8c |
the data appropriately (after creating the datatype if necessary...this is
|
|
Packit Service |
c5cf8c |
still an unsolved problem).
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
an accumulate can be performed in the same manner, with a recv_mop being
|
|
Packit Service |
c5cf8c |
created on the target instead of just a recv.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we can use the counter decrement capability we have created for use with cars
|
|
Packit Service |
c5cf8c |
and requests in order to decrement counters in exposure epochs. this will
|
|
Packit Service |
c5cf8c |
allow for easy waits on epochs.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
in the aggregating case, we would send a special header describing the
|
|
Packit Service |
c5cf8c |
aggregated operations across to the target. the target parses this header and
|
|
Packit Service |
c5cf8c |
creates an appropriate car string. the local side has already created the rest
|
|
Packit Service |
c5cf8c |
of the cars necessary to perform the data transfer as well, relying on
|
|
Packit Service |
c5cf8c |
completion dependencies on the local side to get the operations in the right
|
|
Packit Service |
c5cf8c |
order.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we can serialize a datatype in a deterministic manner.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
datatype caching is done on a demand basis. we've talked about this before.
|
|
Packit Service |
c5cf8c |
how does the need for retrieving a datatype fit into this special header scheme
|
|
Packit Service |
c5cf8c |
laid out above?
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
For the heterogeneous case, the datatype definitions need to be expressed in
|
|
Packit Service |
c5cf8c |
terms of element offsets, not bytes offsets. For example, if a indexed type is
|
|
Packit Service |
c5cf8c |
automatically converted and stored in terms of an hindexed type, the definition
|
|
Packit Service |
c5cf8c |
sent to a remote process (with different type sizes) will contain incorrect
|
|
Packit Service |
c5cf8c |
byte offsets for the remote machine. We need to make sure to store the
|
|
Packit Service |
c5cf8c |
original element displacement/offsets in the vector etc. cases where this is
|
|
Packit Service |
c5cf8c |
how the datatype is originally defined, even if we use byte offsets locally.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
With reactive caching, we cannot allow the datatypes sent to the target process
|
|
Packit Service |
c5cf8c |
to be freed before the target has definitely completed operating with the
|
|
Packit Service |
c5cf8c |
datatype. In the lock/unlock and fence cases, the local process implicitly
|
|
Packit Service |
c5cf8c |
knows that the target is done with the datatype when the unlock/fence returns.
|
|
Packit Service |
c5cf8c |
This leaves us with the start/complete case as the only problem case.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
This final case will be handled with a lazy ack based on access epochs in which
|
|
Packit Service |
c5cf8c |
the datatype was used. In other words, the reference count on a datatype is
|
|
Packit Service |
c5cf8c |
incremented the first time the datatype is used by an RMA operation in an
|
|
Packit Service |
c5cf8c |
access epoch and the datatype is "logged" in the access epoch structure. The
|
|
Packit Service |
c5cf8c |
access epoch structure also contains a flag stating whether a put or accumulate
|
|
Packit Service |
c5cf8c |
operation was requested during this access epoch. After Win_complete() detects
|
|
Packit Service |
c5cf8c |
that all get operations have completed, if the flag is not set, it will
|
|
Packit Service |
c5cf8c |
decrements the reference counts of the logged datatypes and free the access
|
|
Packit Service |
c5cf8c |
epoch structure. If the flag is set, the datatype reference counts may only be
|
|
Packit Service |
c5cf8c |
decremented once an explicit ackowledgement has been received from the target
|
|
Packit Service |
c5cf8c |
informing the origin that all operations requested by that access
|
|
Packit Service |
c5cf8c |
epoch have been completed.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
david proposes that rather than a single flag we would instead use a flag on
|
|
Packit Service |
c5cf8c |
each datatype. this would allow us to free the datatypes only used for gets
|
|
Packit Service |
c5cf8c |
immediately, delaying only for the puts/accs.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
it's reasonable for the origin to force the send of the datatype when he knows
|
|
Packit Service |
c5cf8c |
that the target hasn't seen it yet. we should consider this.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we don't have to send basic types. that's important to remember too.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
oops! the fence isn't as well-behaved as we thought. fence only implies local
|
|
Packit Service |
c5cf8c |
completion of the last epoch. so we're going to have to keep up with things
|
|
Packit Service |
c5cf8c |
for fence as well.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
oops! the lock/unlock isn't either :). the public copy on the other side is
|
|
Packit Service |
c5cf8c |
assured to have been "updated", but that doesn't mean that you are necessarily
|
|
Packit Service |
c5cf8c |
done with the datatype.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we could do lazy release consistency a la treadmarks, and it would give us
|
|
Packit Service |
c5cf8c |
performance advantages in some situations, but we aren't going to do that.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we plan to have a single copy of our data. thus our lock/unlock case will be
|
|
Packit Service |
c5cf8c |
ok, as there is no public/private copy issue.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
the fence will be implemented in a similar manner to a complete/wait on the
|
|
Packit Service |
c5cf8c |
previous epochs (access, exposure). we have to ensure that we don't
|
|
Packit Service |
c5cf8c |
inadvertently create epochs that are empty.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
brian's notes mention using a counter per target in order which is exchanged at
|
|
Packit Service |
c5cf8c |
each fence. this tells the target how many operations need to be completed
|
|
Packit Service |
c5cf8c |
before leaving the fence. this works well in the eager case, but is probably
|
|
Packit Service |
c5cf8c |
overkill for the aggregated case, where you're going to pass all the operations
|
|
Packit Service |
c5cf8c |
over anyway. this can also be done with N reductions; we might be able to work
|
|
Packit Service |
c5cf8c |
out an all-to-all reduce that does the right thing for this.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
there are a couple of asserts which will be useful for reducing communication
|
|
Packit Service |
c5cf8c |
here. and we can do a little extra debugging checking based on these as well.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we know we can leave the fence when we have completed the total number of
|
|
Packit Service |
c5cf8c |
operations counted in the reduction operation.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
aside: we could use mprotect() to detect local load/stores on a local window if
|
|
Packit Service |
c5cf8c |
we wanted to for debugging purposes.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
scenario: start/complete (sort of)
|
|
Packit Service |
c5cf8c |
----------------------------------
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
we can get a context id from dup'ing the communicator at create time, or we can
|
|
Packit Service |
c5cf8c |
get a new context id based on the old communicator. generate_new_context_id()
|
|
Packit Service |
c5cf8c |
or something like that. everyone participating in the win create must agree on
|
|
Packit Service |
c5cf8c |
the context_id.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
start must create an access epoch which can be matched to an exposure epoch on
|
|
Packit Service |
c5cf8c |
the other side. we don't think we need to match anything special at start/post
|
|
Packit Service |
c5cf8c |
time in order to match epochs, but we aren't sure.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
our context ids can be used to match the AE to the appropriate window on the
|
|
Packit Service |
c5cf8c |
target.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
targets are going to have to track pending AEs until they hit a point where a
|
|
Packit Service |
c5cf8c |
post (or whatever) has occurred. there will be situations where multiple AEs
|
|
Packit Service |
c5cf8c |
are queued for a single window, and we must handle this as well. all tracking
|
|
Packit Service |
c5cf8c |
is associated with a local window.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
access epochs are just created on the fly by the target as placeholders for
|
|
Packit Service |
c5cf8c |
what is going on. there doesn't have to be anything special about how these
|
|
Packit Service |
c5cf8c |
are identified. some time prior to the origin locally completing an AE, an
|
|
Packit Service |
c5cf8c |
origin-assigned ID is passed to the target. this ID is returned to the origin
|
|
Packit Service |
c5cf8c |
by the target when the AE has been completed on the target side.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
puts/gets/accs don't have to have an origin-assigned id or be matched with more
|
|
Packit Service |
c5cf8c |
than the context id and origin, under the assumption that there are no
|
|
Packit Service |
c5cf8c |
overtaking messages and only one active AE from an origin at one time.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
there is no valid case where one origin has more than one outstanding and
|
|
Packit Service |
c5cf8c |
active AE for the same target window.
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
------------------------------------------------------------------------
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
RMA requirements
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- we need the option of aggregating operations within an epoch
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
Window object
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- states
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- local
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- public
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- exposure epoch tracking (for operation on the local window)
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- need a queue for ordering exposure epochs and ensuring proper
|
|
Packit Service |
c5cf8c |
shared/exclusive semantics for passive target case
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- each epoch needs a queue for storing incoming operation requests associated
|
|
Packit Service |
c5cf8c |
with the exposure epoch
|
|
Packit Service |
c5cf8c |
|
|
Packit Service |
c5cf8c |
- access epoch tracking (per local window?)
|