Blame doc/notes/rma/meetings.txt

Packit Service c5cf8c
\section{Error Checking}
Packit Service c5cf8c
Packit Service c5cf8c
It would be useful to detect erroneous uses of the RMA interface by application
Packit Service c5cf8c
codes.  It should be possible to detect all errors involving a single process
Packit Service c5cf8c
at the origin.  Errors involving multiple processes, however, will undoubtably
Packit Service c5cf8c
need to be detected at the target.  To aid in this detection at the target, one
Packit Service c5cf8c
could either mark bytes in the local window to detect collisions or log each
Packit Service c5cf8c
operation and compare their target buffers for overlaps.  In either case, one
Packit Service c5cf8c
only needs to detect access collisions for the duration of an exposure epoch.
Packit Service c5cf8c
This is complicated slightly by the ability of a shared lock being able to
Packit Service c5cf8c
effectively join an exposure epoch already in progress.
Packit Service c5cf8c
Packit Service c5cf8c
The logging technique would provide slightly more detail since the exact
Packit Service c5cf8c
processes issuing the illegal operations would be known.  Logging has the
Packit Service c5cf8c
disadvtange of potentially unbounded memory consumption.  Detecting overlaps in
Packit Service c5cf8c
any two target buffers will also be computationally expensive.  For passive
Packit Service c5cf8c
target synchronization, this extra computation will almost certainly change the
Packit Service c5cf8c
relative timing between the processes and thus decreasing the chances of
Packit Service c5cf8c
detecting an error.  For active target syncrhonization, the extra overhead of
Packit Service c5cf8c
logging should not affect the ability to detect an error since the epochs are
Packit Service c5cf8c
well defined.
Packit Service c5cf8c
Packit Service c5cf8c
The byte marking technique should provide better performance over logging.
Packit Service c5cf8c
And, while its memory consumption is bounded, that consumption is guaranteed to
Packit Service c5cf8c
be a large fraction of the window.  For each byte in the window, we really need
Packit Service c5cf8c
three bits, one for each of the possible operations (put, get, and accumulate).
Packit Service c5cf8c
It might useful to keep a list of the processes that accessed the window so
Packit Service c5cf8c
that the potential violators can be reported to the user.
Packit Service c5cf8c
Packit Service c5cf8c
Ok.  We can do this with 2 bits actually, indicating what has occurred already:
Packit Service c5cf8c
00 - nothing
Packit Service c5cf8c
01 - put
Packit Service c5cf8c
10 - get
Packit Service c5cf8c
11 - accumulate
Packit Service c5cf8c
Packit Service c5cf8c
Then you can test to see if future operations break the rules.
Packit Service c5cf8c
Packit Service c5cf8c
proposal:
Packit Service c5cf8c
- allow people to at compile time compile OUT additional debugging stuff with a
Packit Service c5cf8c
  flag (in by default)
Packit Service c5cf8c
- allow it to be turned on and off at runtime via env. variable or whatever
Packit Service c5cf8c
Packit Service c5cf8c
----------
Packit Service c5cf8c
Packit Service c5cf8c
\section{passive target accumulate, exclusive lock}
Packit Service c5cf8c
Packit Service c5cf8c
you have an exclusive lock, you're doing a get on one set of things and
Packit Service c5cf8c
an accumulate on another set of things.
Packit Service c5cf8c
Packit Service c5cf8c
assume they are nonoverlapping datatypes.
Packit Service c5cf8c
Packit Service c5cf8c
mpi_win_lock(exclusive, rank, assert, win)
Packit Service c5cf8c
mpi_get(A,...)
Packit Service c5cf8c
mpi_accumulate(B,...,sum)
Packit Service c5cf8c
mpi_win_unlock()
Packit Service c5cf8c
Packit Service c5cf8c
for cache coherent shared memory (only)
Packit Service c5cf8c
---------------------------------------
Packit Service c5cf8c
one option:
Packit Service c5cf8c
  lock the appropriate window with interprocess_lock()
Packit Service c5cf8c
  do the get
Packit Service c5cf8c
  do the accumulate
Packit Service c5cf8c
  interprocess_unlock()
Packit Service c5cf8c
Packit Service c5cf8c
- if the regions of the window didn't overlap, it might be better to 
Packit Service c5cf8c
  lock only the region(s) of interest.
Packit Service c5cf8c
- if the window is local, then this is the option to use
Packit Service c5cf8c
Packit Service c5cf8c
another option:
Packit Service c5cf8c
  don't do anything much on win_lock
Packit Service c5cf8c
  cache get
Packit Service c5cf8c
  cache acc
Packit Service c5cf8c
  interprocess_lock(), do get, do acc, unlock() as a result of win_unlock()
Packit Service c5cf8c
Packit Service c5cf8c
reordering of cached access could maybe be a win...
Packit Service c5cf8c
Packit Service c5cf8c
for remotely accessible memory (only)
Packit Service c5cf8c
-------------------------------------
Packit Service c5cf8c
there will be a lock on the remote system
Packit Service c5cf8c
there will be some sort of agent
Packit Service c5cf8c
Packit Service c5cf8c
one option:
Packit Service c5cf8c
  agent lock request
Packit Service c5cf8c
  do the get, either directly or through the agent
Packit Service c5cf8c
  do the accumulate, again through the agent or directly
Packit Service c5cf8c
  agent unlock request
Packit Service c5cf8c
Packit Service c5cf8c
option two:
Packit Service c5cf8c
   agent lock/start request, including some inefficient stuff
Packit Service c5cf8c
   do direct accesses directly that are efficient
Packit Service c5cf8c
   agent complete/unlock request
Packit Service c5cf8c
Packit Service c5cf8c
option three:
Packit Service c5cf8c
   single message that defines the entire access epoch, start to finish.
Packit Service c5cf8c
Packit Service c5cf8c
these cover the majority of the issues
Packit Service c5cf8c
Packit Service c5cf8c
side note:
Packit Service c5cf8c
  if we have a numa system in some cases it might be more efficient to 
Packit Service c5cf8c
  have the process local to the memory region pack data for a get into a
Packit Service c5cf8c
  contiguous region, then the remote process can grab that instead of 
Packit Service c5cf8c
  some set of discontiguous elements.  the same process could be used 
Packit Service c5cf8c
  for puts or accumulates.
Packit Service c5cf8c
  
Packit Service c5cf8c
if we want to pipeline things, then we need to have something between 
Packit Service c5cf8c
options two and three.  we want to be able to get overlap of computation
Packit Service c5cf8c
(of buffer packing) and communication of rdma options.
Packit Service c5cf8c
Packit Service c5cf8c
we can use the win info structure to help tune when we try to pipeline, when 
Packit Service c5cf8c
we wait to pack at end, and so on.
Packit Service c5cf8c
Packit Service c5cf8c
side note on lapi: 
Packit Service c5cf8c
  things like lapi have atomic counters which we might be able to use
Packit Service c5cf8c
  to avoid explicit unlock calls.  set to one when i'm done
Packit Service c5cf8c
  
Packit Service c5cf8c
  we might be able to use these same counters to perform locks, but that would
Packit Service c5cf8c
  be a nasty polling problem i think.
Packit Service c5cf8c
  
Packit Service c5cf8c
  likewise we can use the same lapi stuff to have agents set values local to
Packit Service c5cf8c
  the process performing operations in order to let it know that a set of
Packit Service c5cf8c
  operations that make up an epoch (which they have previously described to the
Packit Service c5cf8c
  agent) have been completed (and thus the process's unlock can complete)
Packit Service c5cf8c
Packit Service c5cf8c
note:
Packit Service c5cf8c
  there is more optimization that can occur here than what we have discussed so
Packit Service c5cf8c
  far; in particular nonoverlapping exclusive locks don't HAVE to be serialized
Packit Service c5cf8c
  like we have implied we would here.  we should think more about how we can
Packit Service c5cf8c
  allow these things to continue simultaneously.
Packit Service c5cf8c
Packit Service c5cf8c
----------
Packit Service c5cf8c
Packit Service c5cf8c
\section{passive target accumulate, shared lock}
Packit Service c5cf8c
Packit Service c5cf8c
you have an shared lock, you're doing a get on one set of things and
Packit Service c5cf8c
an accumulate on another set of things.
Packit Service c5cf8c
Packit Service c5cf8c
assume they are nonoverlapping datatypes.
Packit Service c5cf8c
Packit Service c5cf8c
mpi_win_lock(shared, rank, assert, win)
Packit Service c5cf8c
mpi_accumulate(B,...,sum)
Packit Service c5cf8c
mpi_win_unlock()
Packit Service c5cf8c
Packit Service c5cf8c
remember that you can call accumulate over and over, even to the same data
Packit Service c5cf8c
element.  and others can be calling accumulate to that element as well.
Packit Service c5cf8c
Packit Service c5cf8c
atomicity is maintained on the per-data-element basis only in the shared case.
Packit Service c5cf8c
Packit Service c5cf8c
brian's implementation idea:
Packit Service c5cf8c
- bust a window into a number of contiguous regions
Packit Service c5cf8c
- allow locks on each one of these regions separately
Packit Service c5cf8c
- in some cases there will be atomic operations supported by the processor, and
Packit Service c5cf8c
  in those cases we might be able to avoid the lock.
Packit Service c5cf8c
- always acquire locks in sequential order
Packit Service c5cf8c
Packit Service c5cf8c
problems with implementation:
Packit Service c5cf8c
- nasty strided datatypes which pass over the region and loop back would be 
Packit Service c5cf8c
  really slow.
Packit Service c5cf8c
  - for those you want to get all the locks ahead of time maybe - how do you
Packit Service c5cf8c
    detect?  maybe nonmonotonically increasing -> ``bad datatype''?
Packit Service c5cf8c
  - maybe cache incoming data and apply operations on elements in monotonically
Packit Service c5cf8c
    increasing order?
Packit Service c5cf8c
  - rather than this, maybe you reorganize both the source and destination 
Packit Service c5cf8c
    datatypes so that the elements arrive in monotonically increasing order?
Packit Service c5cf8c
    THIS IS A BETTER BUT SCARIER IDEA.
Packit Service c5cf8c
    - this is an interesting problem of applying identical transformations on
Packit Service c5cf8c
      the two separate data types
Packit Service c5cf8c
    - worst case this could be done by breaking out a datatype into a huge 
Packit Service c5cf8c
      struct; there should be a better/more efficient way.
Packit Service c5cf8c
    - caching transformed datatypes?
Packit Service c5cf8c
    - this could be done on portions of the datatypes as well to increase 
Packit Service c5cf8c
      the granularity of operations with respect to the locking
Packit Service c5cf8c
      
Packit Service c5cf8c
it sounds like we're going to want to define the mechanism for locking at
Packit Service c5cf8c
window create, or later if possible.
Packit Service c5cf8c
Packit Service c5cf8c
the number of processes within the group on which a window is created should
Packit Service c5cf8c
also be a factor when determining how to implement locking on the window.
Packit Service c5cf8c
Packit Service c5cf8c
another possibility is locking on datatypes instead -- then we need functions
Packit Service c5cf8c
which can look for overlaps between datatypes, which is kinda nasty...
Packit Service c5cf8c
Packit Service c5cf8c
we don't need single writer/mult. reader because writes to elements that are
Packit Service c5cf8c
being read is illegal.  likewise with put you don't have to lock, because it is
Packit Service c5cf8c
illegal for someone to write to the same location twice in the same epoch.
Packit Service c5cf8c
Packit Service c5cf8c
illegal operations will be detected by the bit code above.
Packit Service c5cf8c
Packit Service c5cf8c
so really it's only the accumulate that causes locking issues.
Packit Service c5cf8c
Packit Service c5cf8c
proposal:
Packit Service c5cf8c
start at random offsets into the target datatype when processing.  i (rob)
Packit Service c5cf8c
think that this is an interesting but problematic idea.  it's nondeterministic.
Packit Service c5cf8c
each guy has to get a random number.  the idea though is to try to space out operations on the target to get better lock utilization.
Packit Service c5cf8c
Packit Service c5cf8c
alternatively you could use the rank as a unique number.  divide the datatype
Packit Service c5cf8c
by N, where N = # of guys in the window object.  each guy uses his rank to
Packit Service c5cf8c
determine which block to start in.  there are all sorts of assumptions on the 
Packit Service c5cf8c
nature of the datatype here.
Packit Service c5cf8c
Packit Service c5cf8c
it will help with better utilizing locks on startup in some cases.
Packit Service c5cf8c
Packit Service c5cf8c
you have an shared lock, you're doing a get on one set of things and
Packit Service c5cf8c
an accumulate on another set of things.
Packit Service c5cf8c
Packit Service c5cf8c
back to the example
Packit Service c5cf8c
-------------------
Packit Service c5cf8c
assume they are nonoverlapping datatypes.
Packit Service c5cf8c
Packit Service c5cf8c
mpi_win_lock(shared, rank, assert, win)
Packit Service c5cf8c
mpi_accumulate(B,...,sum)
Packit Service c5cf8c
mpi_win_unlock()
Packit Service c5cf8c
Packit Service c5cf8c
process asks for shared lock (exposure epoch) on remote process
Packit Service c5cf8c
Packit Service c5cf8c
another example, looking at shmem combined with remote access
Packit Service c5cf8c
-------------------------------------------------------------
Packit Service c5cf8c
two pairs of processes on same nodes
Packit Service c5cf8c
Packit Service c5cf8c
shared memory lock on single node allows one process to look directly into the
Packit Service c5cf8c
window information so that he doesn't ahve to go through the agent of the other
Packit Service c5cf8c
process on the local node.  that same lock will be used by the communication
Packit Service c5cf8c
agent for a given process to ensure that things are kept sane in the off-node
Packit Service c5cf8c
case; in other words, this lock is utilized by processes on the same node
Packit Service c5cf8c
directly, but is also used by a single agent in the case of off-node access, in
Packit Service c5cf8c
both cases to coordinate access to the local window information.  by ``direct
Packit Service c5cf8c
access'' this may mean the process directly, or via that process's agent.
Packit Service c5cf8c
Packit Service c5cf8c
implication/question:
Packit Service c5cf8c
starting an exposure or access epoch will need support functions within the
Packit Service c5cf8c
methods.  this would possibly be used as an alternative to going through the
Packit Service c5cf8c
agent in order to have a fast path for these things.
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
idea for assertion (valid for shared or exclusive):
Packit Service c5cf8c
- NONOVERLAPPING
Packit Service c5cf8c
  says that i guarantee that none of the operations in the epoch will overlap.
Packit Service c5cf8c
  is this there already?  if we have this, we can avoid locking entirely,
Packit Service c5cf8c
  relying on the user to have done the right thing (tm) :).
Packit Service c5cf8c
Packit Service c5cf8c
post-wait/start-complete
Packit Service c5cf8c
------------------------
Packit Service c5cf8c
(target)                            (origin)
Packit Service c5cf8c
post(group, assert, win)            start(group, assert, win)
Packit Service c5cf8c
...                                 ...
Packit Service c5cf8c
wait(win)                           complete(win)
Packit Service c5cf8c
Packit Service c5cf8c
complete(win) - only ensures that all operations have been ``locally
Packit Service c5cf8c
completed''; they might not have yet completed at the target.
Packit Service c5cf8c
Packit Service c5cf8c
wait(win) - blocks on all complete()s, and completes all operations at the
Packit Service c5cf8c
target before returning.
Packit Service c5cf8c
Packit Service c5cf8c
start(group, assert, win) - can block until matching post(), but isn't required
Packit Service c5cf8c
to.
Packit Service c5cf8c
Packit Service c5cf8c
this is different from lock/unlock, which ensures that things are completed on
Packit Service c5cf8c
the target.
Packit Service c5cf8c
Packit Service c5cf8c
observation: there is an option passed via info on create that says no_locks;
Packit Service c5cf8c
there will be no passive target stuff.  great!
Packit Service c5cf8c
Packit Service c5cf8c
post(group, assert, win) - doesn't block.
Packit Service c5cf8c
Packit Service c5cf8c
note: all the groups here don't have to be identical; processes wanting to
Packit Service c5cf8c
perform operations on multiple windows will have the targets in their window,
Packit Service c5cf8c
while the ones that are targets will have all the sources in their groups...I
Packit Service c5cf8c
have explained this poorly...
Packit Service c5cf8c
Packit Service c5cf8c
if a post() only contains one member, it is equivalent to a win lock exclusive.
Packit Service c5cf8c
Packit Service c5cf8c
you can't have two outstanding post()s, as best we can tell, because you only
Packit Service c5cf8c
wait() on the window.  you could have different post()s on different windows...
Packit Service c5cf8c
Packit Service c5cf8c
so this approach is similar to the lock/unlock, only the post() doesn't HAVE
Packit Service c5cf8c
to finish the operations on the target (or wait for them to finish).
Packit Service c5cf8c
Packit Service c5cf8c
todo: we need to look at the overlapping window stuff and learn about the
Packit Service c5cf8c
private/public views...
Packit Service c5cf8c
Packit Service c5cf8c
approaches
Packit Service c5cf8c
----------
Packit Service c5cf8c
for little sets of messages, we can use a single message from the origin to
Packit Service c5cf8c
move all the operations across.
Packit Service c5cf8c
Packit Service c5cf8c
for large messages, we want to pipeline.
Packit Service c5cf8c
Packit Service c5cf8c
david: if we treat them like MPI messages, we could just do sends/receives as
Packit Service c5cf8c
necessary.  this keeps new concepts out of the CA code. otherwise we have this
Packit Service c5cf8c
``multiple operations in one message'' concept.
Packit Service c5cf8c
Packit Service c5cf8c
brian: people might find themselves doing the for loop of put()s instead of
Packit Service c5cf8c
using a datatype.  so aggregation makes sense in this case.
Packit Service c5cf8c
Packit Service c5cf8c
rob: those people are stupid and they should use datatypes.
Packit Service c5cf8c
Packit Service c5cf8c
anyway, we don't know what the intention is.  we should try to help people
Packit Service c5cf8c
if possible by using aggregation.
Packit Service c5cf8c
Packit Service c5cf8c
aggregation approaches:
Packit Service c5cf8c
- keep track of a size, and then cache to that size
Packit Service c5cf8c
  - david points out this will make little things slow.  he wants to be
Packit Service c5cf8c
    aggressive about performing operations locally in order to keep small 
Packit Service c5cf8c
    things fast
Packit Service c5cf8c
- keep a count?
Packit Service c5cf8c
- dynamic optimization?  watch patterns on a window and try to cache only
Packit Service c5cf8c
  when it seems like the right thing?
Packit Service c5cf8c
Packit Service c5cf8c
examples:
Packit Service c5cf8c
- for loop with little puts and computation between puts
Packit Service c5cf8c
- vs. for loop with little puts and no computation
Packit Service c5cf8c
Packit Service c5cf8c
in the first case you have the opportunity for making communication progress,
Packit Service c5cf8c
while in the second case you want to wait and aggregate.
Packit Service c5cf8c
Packit Service c5cf8c
i (rob) think we're going to want both an aggressive and a caching/combining
Packit Service c5cf8c
mode, because there are situations where one or the other are obviously
Packit Service c5cf8c
optimal.
Packit Service c5cf8c
Packit Service c5cf8c
brian: we need to be able to support this ^^^ in our interface in order to be
Packit Service c5cf8c
able to test and learn which of these modes really works.
Packit Service c5cf8c
Packit Service c5cf8c
david: there's not really much coordination here, so doing optimal ordering is
Packit Service c5cf8c
going to be tough.  BUT we do have the group on the target...but we'll never
Packit Service c5cf8c
have as much as we do in the fence case.
Packit Service c5cf8c
Packit Service c5cf8c
the accumulate is odd as usual.  if you're the only person in the target's
Packit Service c5cf8c
group, then you don't have to do the element-wise locking/atomicity (or
Packit Service c5cf8c
window-wise locking/unlocking depending on implementaiton).
Packit Service c5cf8c
Packit Service c5cf8c
brian: it would be helpful for the origin to know if he needs to lock or not
Packit Service c5cf8c
during accumulate operations.  this would allow him to know if he needs to lock
Packit Service c5cf8c
or not.  this is particularly useful in the shmem scenario.
Packit Service c5cf8c
- something in the window structure, stored in shared memory, could allow for
Packit Service c5cf8c
  this optimization
Packit Service c5cf8c
  
Packit Service c5cf8c
otherwise the accumulate is basically the same as in the lock/unlock window
Packit Service c5cf8c
op case (gets/puts, lock requests, etc.).
Packit Service c5cf8c
Packit Service c5cf8c
fence
Packit Service c5cf8c
-----
Packit Service c5cf8c
Q: does win_create() fill in for the first fence, or do you need a first fence?
Packit Service c5cf8c
A: you need the first fence (you have to in order to have created an exposure
Packit Service c5cf8c
   epoch).
Packit Service c5cf8c
   
Packit Service c5cf8c
there are optimizations here that we can get from the BSP people (collecting
Packit Service c5cf8c
and scheduling communication).
Packit Service c5cf8c
Packit Service c5cf8c
assertions:
Packit Service c5cf8c
no store - no local stores in previous epoch
Packit Service c5cf8c
no put - no remote updates to local window in new epoch
Packit Service c5cf8c
no precede- no local rma calls in previous epoch, collective assert only
Packit Service c5cf8c
no succeed - no local rma calls in new epoch, collective assert only
Packit Service c5cf8c
Packit Service c5cf8c
those last two are obviously useful for the first/last fence cases.  they are
Packit Service c5cf8c
one way to know you're done with fences.  you can also do local load/stores in
Packit Service c5cf8c
those epochs too...it's a way of saying ``i'm only doing local stuff for a
Packit Service c5cf8c
moment''.
Packit Service c5cf8c
Packit Service c5cf8c
Q: how does one cleanly switch between synchronization modes?  in particular
Packit Service c5cf8c
how does one switch cleanly OUT of the fence mode?  I think we're just not
Packit Service c5cf8c
reading the fence stuff carefully enough.
Packit Service c5cf8c
A: ``fence starts an exposure epoch IF followed by another fence call and the
Packit Service c5cf8c
local window is the target of RMA ops between fence calls''.  ``the call starts
Packit Service c5cf8c
an access epoch IF it is followed by another fence call and by RMA
Packit Service c5cf8c
communications calls issued between the two calls''.
Packit Service c5cf8c
Packit Service c5cf8c
HA!  that's nasty.  So hitting a fence really doesn't tell us as much as we
Packit Service c5cf8c
originally thought it did.  We need to figure out what we can do in the context
Packit Service c5cf8c
of these goofy rules.
Packit Service c5cf8c
Packit Service c5cf8c
bad example we think is legal
Packit Service c5cf8c
-----------------------------
Packit Service c5cf8c
(0)               (1)                 (2)                (3)
Packit Service c5cf8c
fence             fence               fence              fence
Packit Service c5cf8c
put(1)            put(0)              start(3)           post(2)
Packit Service c5cf8c
                                      put(3)             put(2)
Packit Service c5cf8c
                                      complete()         wait()
Packit Service c5cf8c
fence             fence               fence              fence
Packit Service c5cf8c
Packit Service c5cf8c
Is this legal?  The epochs are not created on 2 and 3, but not on 0 and 1 as
Packit Service c5cf8c
a result of the fences.  We know the fences are collective, but is the creation
Packit Service c5cf8c
of the epochs?
Packit Service c5cf8c
Packit Service c5cf8c
even worse example
Packit Service c5cf8c
------------------
Packit Service c5cf8c
(0)               (1)                 (2)                (3)
Packit Service c5cf8c
fence             fence               fence              fence
Packit Service c5cf8c
put(1)            put(0)              start(3)           post(2)
Packit Service c5cf8c
                                      put(3)             put(2)
Packit Service c5cf8c
                                      complete()         wait()
Packit Service c5cf8c
                                      barrier(2,3)       barrier(2,3)
Packit Service c5cf8c
                                      put(3)             put(2)
Packit Service c5cf8c
fence             fence               fence              fence
Packit Service c5cf8c
Packit Service c5cf8c
What about that one?
Packit Service c5cf8c
Packit Service c5cf8c
Bill: FIX THE TEXT! (meaning we should propose a clarification to the standard)
Packit Service c5cf8c
Packit Service c5cf8c
goals:
Packit Service c5cf8c
1) no mixed-mode stuff between fence and the others...
Packit Service c5cf8c
Packit Service c5cf8c
approach:
Packit Service c5cf8c
0) comb through chapter and see if something is already there.
Packit Service c5cf8c
1) description of problem (our examples, building up w/ 2)
Packit Service c5cf8c
2) we know this wasn't intended
Packit Service c5cf8c
3) propose clarifications
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
what next?
Packit Service c5cf8c
----------
Packit Service c5cf8c
david: looking this as a building-block sort of thing as we did with xfer.
Packit Service c5cf8c
is there a way to approach this in the same way?
Packit Service c5cf8c
Packit Service c5cf8c
the obvious blocks would be access and/or exposure epochs.
Packit Service c5cf8c
Packit Service c5cf8c
exposure epochs can be thought of as having a reference count, with the wait()
Packit Service c5cf8c
(or fence i guess) blocking until the refcount hits 0.
Packit Service c5cf8c
Packit Service c5cf8c
Q: do we, in our code, to explicitly define epochs?  Is it harder to follow the
Packit Service c5cf8c
rules with or without them?
Packit Service c5cf8c
Packit Service c5cf8c
scenarios:
Packit Service c5cf8c
- fence, how do you know who did ops/created epochs?
Packit Service c5cf8c
- which is the right approach?
Packit Service c5cf8c
- aggressive vs. combined?
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
we are probably going to punt on detecting errors between overlapping windows
Packit Service c5cf8c
at first.  later we could detect overlapping windows at create time and then do
Packit Service c5cf8c
error checking for invalid operations between windows on the destination at the
Packit Service c5cf8c
time the epochs are serviced (or whatever we call that)
Packit Service c5cf8c
Packit Service c5cf8c
Packit Service c5cf8c
brainstorm:
Packit Service c5cf8c
Packit Service c5cf8c
for non-aggregating case, a put creates a special car (including datatype etc.)
Packit Service c5cf8c
which sends a special header across to the target.  the target understands how
Packit Service c5cf8c
to receive these cars and will create a matching recv car to receive and store
Packit Service c5cf8c
the data appropriately (after creating the datatype if necessary...this is
Packit Service c5cf8c
still an unsolved problem).
Packit Service c5cf8c
Packit Service c5cf8c
an accumulate can be performed in the same manner, with a recv_mop being
Packit Service c5cf8c
created on the target instead of just a recv.
Packit Service c5cf8c
Packit Service c5cf8c
we can use the counter decrement capability we have created for use with cars
Packit Service c5cf8c
and requests in order to decrement counters in exposure epochs.  this will
Packit Service c5cf8c
allow for easy waits on epochs.
Packit Service c5cf8c
Packit Service c5cf8c
in the aggregating case, we would send a special header describing the
Packit Service c5cf8c
aggregated operations across to the target.  the target parses this header and
Packit Service c5cf8c
creates an appropriate car string.  the local side has already created the rest
Packit Service c5cf8c
of the cars necessary to perform the data transfer as well, relying on
Packit Service c5cf8c
completion dependencies on the local side to get the operations in the right
Packit Service c5cf8c
order.
Packit Service c5cf8c
Packit Service c5cf8c
we can serialize a datatype in a deterministic manner.
Packit Service c5cf8c
Packit Service c5cf8c
datatype caching is done on a demand basis.  we've talked about this before.
Packit Service c5cf8c
how does the need for retrieving a datatype fit into this special header scheme
Packit Service c5cf8c
laid out above?
Packit Service c5cf8c
Packit Service c5cf8c
For the heterogeneous case, the datatype definitions need to be expressed in
Packit Service c5cf8c
terms of element offsets, not bytes offsets.  For example, if a indexed type is
Packit Service c5cf8c
automatically converted and stored in terms of an hindexed type, the definition
Packit Service c5cf8c
sent to a remote process (with different type sizes) will contain incorrect
Packit Service c5cf8c
byte offsets for the remote machine.  We need to make sure to store the
Packit Service c5cf8c
original element displacement/offsets in the vector etc. cases where this is
Packit Service c5cf8c
how the datatype is originally defined, even if we use byte offsets locally.
Packit Service c5cf8c
Packit Service c5cf8c
With reactive caching, we cannot allow the datatypes sent to the target process
Packit Service c5cf8c
to be freed before the target has definitely completed operating with the
Packit Service c5cf8c
datatype.  In the lock/unlock and fence cases, the local process implicitly
Packit Service c5cf8c
knows that the target is done with the datatype when the unlock/fence returns.
Packit Service c5cf8c
This leaves us with the start/complete case as the only problem case.
Packit Service c5cf8c
Packit Service c5cf8c
This final case will be handled with a lazy ack based on access epochs in which
Packit Service c5cf8c
the datatype was used.  In other words, the reference count on a datatype is
Packit Service c5cf8c
incremented the first time the datatype is used by an RMA operation in an
Packit Service c5cf8c
access epoch and the datatype is "logged" in the access epoch structure.  The
Packit Service c5cf8c
access epoch structure also contains a flag stating whether a put or accumulate
Packit Service c5cf8c
operation was requested during this access epoch.  After Win_complete() detects
Packit Service c5cf8c
that all get operations have completed, if the flag is not set, it will
Packit Service c5cf8c
decrements the reference counts of the logged datatypes and free the access
Packit Service c5cf8c
epoch structure.  If the flag is set, the datatype reference counts may only be
Packit Service c5cf8c
decremented once an explicit ackowledgement has been received from the target
Packit Service c5cf8c
informing the origin that all operations requested by that access
Packit Service c5cf8c
epoch have been completed.
Packit Service c5cf8c
Packit Service c5cf8c
david proposes that rather than a single flag we would instead use a flag on
Packit Service c5cf8c
each datatype.  this would allow us to free the datatypes only used for gets
Packit Service c5cf8c
immediately, delaying only for the puts/accs.
Packit Service c5cf8c
Packit Service c5cf8c
it's reasonable for the origin to force the send of the datatype when he knows
Packit Service c5cf8c
that the target hasn't seen it yet.  we should consider this.
Packit Service c5cf8c
Packit Service c5cf8c
we don't have to send basic types.  that's important to remember too.
Packit Service c5cf8c
Packit Service c5cf8c
oops!  the fence isn't as well-behaved as we thought.  fence only implies local
Packit Service c5cf8c
completion of the last epoch.  so we're going to have to keep up with things
Packit Service c5cf8c
for fence as well.
Packit Service c5cf8c
Packit Service c5cf8c
oops!  the lock/unlock isn't either :).  the public copy on the other side is
Packit Service c5cf8c
assured to have been "updated", but that doesn't mean that you are necessarily
Packit Service c5cf8c
done with the datatype.
Packit Service c5cf8c
Packit Service c5cf8c
we could do lazy release consistency a la treadmarks, and it would give us
Packit Service c5cf8c
performance advantages in some situations, but we aren't going to do that.
Packit Service c5cf8c
Packit Service c5cf8c
we plan to have a single copy of our data.  thus our lock/unlock case will be
Packit Service c5cf8c
ok, as there is no public/private copy issue.
Packit Service c5cf8c
Packit Service c5cf8c
the fence will be implemented in a similar manner to a complete/wait on the
Packit Service c5cf8c
previous epochs (access, exposure).  we have to ensure that we don't
Packit Service c5cf8c
inadvertently create epochs that are empty.
Packit Service c5cf8c
Packit Service c5cf8c
brian's notes mention using a counter per target in order which is exchanged at
Packit Service c5cf8c
each fence.  this tells the target how many operations need to be completed
Packit Service c5cf8c
before leaving the fence.  this works well in the eager case, but is probably
Packit Service c5cf8c
overkill for the aggregated case, where you're going to pass all the operations
Packit Service c5cf8c
over anyway.  this can also be done with N reductions; we might be able to work
Packit Service c5cf8c
out an all-to-all reduce that does the right thing for this.
Packit Service c5cf8c
Packit Service c5cf8c
there are a couple of asserts which will be useful for reducing communication
Packit Service c5cf8c
here.  and we can do a little extra debugging checking based on these as well.
Packit Service c5cf8c
Packit Service c5cf8c
we know we can leave the fence when we have completed the total number of
Packit Service c5cf8c
operations counted in the reduction operation.
Packit Service c5cf8c
Packit Service c5cf8c
aside: we could use mprotect() to detect local load/stores on a local window if
Packit Service c5cf8c
we wanted to for debugging purposes.
Packit Service c5cf8c
Packit Service c5cf8c
scenario: start/complete (sort of)
Packit Service c5cf8c
----------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
we can get a context id from dup'ing the communicator at create time, or we can
Packit Service c5cf8c
get a new context id based on the old communicator. generate_new_context_id()
Packit Service c5cf8c
or something like that.  everyone participating in the win create must agree on
Packit Service c5cf8c
the context_id.
Packit Service c5cf8c
Packit Service c5cf8c
start must create an access epoch which can be matched to an exposure epoch on
Packit Service c5cf8c
the other side.  we don't think we need to match anything special at start/post
Packit Service c5cf8c
time in order to match epochs, but we aren't sure.
Packit Service c5cf8c
Packit Service c5cf8c
our context ids can be used to match the AE to the appropriate window on the
Packit Service c5cf8c
target.
Packit Service c5cf8c
Packit Service c5cf8c
targets are going to have to track pending AEs until they hit a point where a
Packit Service c5cf8c
post (or whatever) has occurred.  there will be situations where multiple AEs
Packit Service c5cf8c
are queued for a single window, and we must handle this as well.  all tracking
Packit Service c5cf8c
is associated with a local window.
Packit Service c5cf8c
Packit Service c5cf8c
access epochs are just created on the fly by the target as placeholders for
Packit Service c5cf8c
what is going on.  there doesn't have to be anything special about how these
Packit Service c5cf8c
are identified.  some time prior to the origin locally completing an AE, an
Packit Service c5cf8c
origin-assigned ID is passed to the target.  this ID is returned to the origin
Packit Service c5cf8c
by the target when the AE has been completed on the target side.
Packit Service c5cf8c
Packit Service c5cf8c
puts/gets/accs don't have to have an origin-assigned id or be matched with more
Packit Service c5cf8c
than the context id and origin, under the assumption that there are no
Packit Service c5cf8c
overtaking messages and only one active AE from an origin at one time.
Packit Service c5cf8c
Packit Service c5cf8c
there is no valid case where one origin has more than one outstanding and
Packit Service c5cf8c
active AE for the same target window.
Packit Service c5cf8c
Packit Service c5cf8c
------------------------------------------------------------------------
Packit Service c5cf8c
Packit Service c5cf8c
RMA requirements
Packit Service c5cf8c
Packit Service c5cf8c
- we need the option of aggregating operations within an epoch
Packit Service c5cf8c
Packit Service c5cf8c
Window object
Packit Service c5cf8c
Packit Service c5cf8c
- states
Packit Service c5cf8c
Packit Service c5cf8c
  - local
Packit Service c5cf8c
Packit Service c5cf8c
  - public
Packit Service c5cf8c
Packit Service c5cf8c
- exposure epoch tracking (for operation on the local window)
Packit Service c5cf8c
Packit Service c5cf8c
  - need a queue for ordering exposure epochs and ensuring proper
Packit Service c5cf8c
    shared/exclusive semantics for passive target case
Packit Service c5cf8c
Packit Service c5cf8c
  - each epoch needs a queue for storing incoming operation requests associated
Packit Service c5cf8c
    with the exposure epoch
Packit Service c5cf8c
Packit Service c5cf8c
- access epoch tracking (per local window?)