\section{Error Checking} It would be useful to detect erroneous uses of the RMA interface by application codes. It should be possible to detect all errors involving a single process at the origin. Errors involving multiple processes, however, will undoubtably need to be detected at the target. To aid in this detection at the target, one could either mark bytes in the local window to detect collisions or log each operation and compare their target buffers for overlaps. In either case, one only needs to detect access collisions for the duration of an exposure epoch. This is complicated slightly by the ability of a shared lock being able to effectively join an exposure epoch already in progress. The logging technique would provide slightly more detail since the exact processes issuing the illegal operations would be known. Logging has the disadvtange of potentially unbounded memory consumption. Detecting overlaps in any two target buffers will also be computationally expensive. For passive target synchronization, this extra computation will almost certainly change the relative timing between the processes and thus decreasing the chances of detecting an error. For active target syncrhonization, the extra overhead of logging should not affect the ability to detect an error since the epochs are well defined. The byte marking technique should provide better performance over logging. And, while its memory consumption is bounded, that consumption is guaranteed to be a large fraction of the window. For each byte in the window, we really need three bits, one for each of the possible operations (put, get, and accumulate). It might useful to keep a list of the processes that accessed the window so that the potential violators can be reported to the user. Ok. We can do this with 2 bits actually, indicating what has occurred already: 00 - nothing 01 - put 10 - get 11 - accumulate Then you can test to see if future operations break the rules. proposal: - allow people to at compile time compile OUT additional debugging stuff with a flag (in by default) - allow it to be turned on and off at runtime via env. variable or whatever ---------- \section{passive target accumulate, exclusive lock} you have an exclusive lock, you're doing a get on one set of things and an accumulate on another set of things. assume they are nonoverlapping datatypes. mpi_win_lock(exclusive, rank, assert, win) mpi_get(A,...) mpi_accumulate(B,...,sum) mpi_win_unlock() for cache coherent shared memory (only) --------------------------------------- one option: lock the appropriate window with interprocess_lock() do the get do the accumulate interprocess_unlock() - if the regions of the window didn't overlap, it might be better to lock only the region(s) of interest. - if the window is local, then this is the option to use another option: don't do anything much on win_lock cache get cache acc interprocess_lock(), do get, do acc, unlock() as a result of win_unlock() reordering of cached access could maybe be a win... for remotely accessible memory (only) ------------------------------------- there will be a lock on the remote system there will be some sort of agent one option: agent lock request do the get, either directly or through the agent do the accumulate, again through the agent or directly agent unlock request option two: agent lock/start request, including some inefficient stuff do direct accesses directly that are efficient agent complete/unlock request option three: single message that defines the entire access epoch, start to finish. these cover the majority of the issues side note: if we have a numa system in some cases it might be more efficient to have the process local to the memory region pack data for a get into a contiguous region, then the remote process can grab that instead of some set of discontiguous elements. the same process could be used for puts or accumulates. if we want to pipeline things, then we need to have something between options two and three. we want to be able to get overlap of computation (of buffer packing) and communication of rdma options. we can use the win info structure to help tune when we try to pipeline, when we wait to pack at end, and so on. side note on lapi: things like lapi have atomic counters which we might be able to use to avoid explicit unlock calls. set to one when i'm done we might be able to use these same counters to perform locks, but that would be a nasty polling problem i think. likewise we can use the same lapi stuff to have agents set values local to the process performing operations in order to let it know that a set of operations that make up an epoch (which they have previously described to the agent) have been completed (and thus the process's unlock can complete) note: there is more optimization that can occur here than what we have discussed so far; in particular nonoverlapping exclusive locks don't HAVE to be serialized like we have implied we would here. we should think more about how we can allow these things to continue simultaneously. ---------- \section{passive target accumulate, shared lock} you have an shared lock, you're doing a get on one set of things and an accumulate on another set of things. assume they are nonoverlapping datatypes. mpi_win_lock(shared, rank, assert, win) mpi_accumulate(B,...,sum) mpi_win_unlock() remember that you can call accumulate over and over, even to the same data element. and others can be calling accumulate to that element as well. atomicity is maintained on the per-data-element basis only in the shared case. brian's implementation idea: - bust a window into a number of contiguous regions - allow locks on each one of these regions separately - in some cases there will be atomic operations supported by the processor, and in those cases we might be able to avoid the lock. - always acquire locks in sequential order problems with implementation: - nasty strided datatypes which pass over the region and loop back would be really slow. - for those you want to get all the locks ahead of time maybe - how do you detect? maybe nonmonotonically increasing -> ``bad datatype''? - maybe cache incoming data and apply operations on elements in monotonically increasing order? - rather than this, maybe you reorganize both the source and destination datatypes so that the elements arrive in monotonically increasing order? THIS IS A BETTER BUT SCARIER IDEA. - this is an interesting problem of applying identical transformations on the two separate data types - worst case this could be done by breaking out a datatype into a huge struct; there should be a better/more efficient way. - caching transformed datatypes? - this could be done on portions of the datatypes as well to increase the granularity of operations with respect to the locking it sounds like we're going to want to define the mechanism for locking at window create, or later if possible. the number of processes within the group on which a window is created should also be a factor when determining how to implement locking on the window. another possibility is locking on datatypes instead -- then we need functions which can look for overlaps between datatypes, which is kinda nasty... we don't need single writer/mult. reader because writes to elements that are being read is illegal. likewise with put you don't have to lock, because it is illegal for someone to write to the same location twice in the same epoch. illegal operations will be detected by the bit code above. so really it's only the accumulate that causes locking issues. proposal: start at random offsets into the target datatype when processing. i (rob) think that this is an interesting but problematic idea. it's nondeterministic. each guy has to get a random number. the idea though is to try to space out operations on the target to get better lock utilization. alternatively you could use the rank as a unique number. divide the datatype by N, where N = # of guys in the window object. each guy uses his rank to determine which block to start in. there are all sorts of assumptions on the nature of the datatype here. it will help with better utilizing locks on startup in some cases. you have an shared lock, you're doing a get on one set of things and an accumulate on another set of things. back to the example ------------------- assume they are nonoverlapping datatypes. mpi_win_lock(shared, rank, assert, win) mpi_accumulate(B,...,sum) mpi_win_unlock() process asks for shared lock (exposure epoch) on remote process another example, looking at shmem combined with remote access ------------------------------------------------------------- two pairs of processes on same nodes shared memory lock on single node allows one process to look directly into the window information so that he doesn't ahve to go through the agent of the other process on the local node. that same lock will be used by the communication agent for a given process to ensure that things are kept sane in the off-node case; in other words, this lock is utilized by processes on the same node directly, but is also used by a single agent in the case of off-node access, in both cases to coordinate access to the local window information. by ``direct access'' this may mean the process directly, or via that process's agent. implication/question: starting an exposure or access epoch will need support functions within the methods. this would possibly be used as an alternative to going through the agent in order to have a fast path for these things. idea for assertion (valid for shared or exclusive): - NONOVERLAPPING says that i guarantee that none of the operations in the epoch will overlap. is this there already? if we have this, we can avoid locking entirely, relying on the user to have done the right thing (tm) :). post-wait/start-complete ------------------------ (target) (origin) post(group, assert, win) start(group, assert, win) ... ... wait(win) complete(win) complete(win) - only ensures that all operations have been ``locally completed''; they might not have yet completed at the target. wait(win) - blocks on all complete()s, and completes all operations at the target before returning. start(group, assert, win) - can block until matching post(), but isn't required to. this is different from lock/unlock, which ensures that things are completed on the target. observation: there is an option passed via info on create that says no_locks; there will be no passive target stuff. great! post(group, assert, win) - doesn't block. note: all the groups here don't have to be identical; processes wanting to perform operations on multiple windows will have the targets in their window, while the ones that are targets will have all the sources in their groups...I have explained this poorly... if a post() only contains one member, it is equivalent to a win lock exclusive. you can't have two outstanding post()s, as best we can tell, because you only wait() on the window. you could have different post()s on different windows... so this approach is similar to the lock/unlock, only the post() doesn't HAVE to finish the operations on the target (or wait for them to finish). todo: we need to look at the overlapping window stuff and learn about the private/public views... approaches ---------- for little sets of messages, we can use a single message from the origin to move all the operations across. for large messages, we want to pipeline. david: if we treat them like MPI messages, we could just do sends/receives as necessary. this keeps new concepts out of the CA code. otherwise we have this ``multiple operations in one message'' concept. brian: people might find themselves doing the for loop of put()s instead of using a datatype. so aggregation makes sense in this case. rob: those people are stupid and they should use datatypes. anyway, we don't know what the intention is. we should try to help people if possible by using aggregation. aggregation approaches: - keep track of a size, and then cache to that size - david points out this will make little things slow. he wants to be aggressive about performing operations locally in order to keep small things fast - keep a count? - dynamic optimization? watch patterns on a window and try to cache only when it seems like the right thing? examples: - for loop with little puts and computation between puts - vs. for loop with little puts and no computation in the first case you have the opportunity for making communication progress, while in the second case you want to wait and aggregate. i (rob) think we're going to want both an aggressive and a caching/combining mode, because there are situations where one or the other are obviously optimal. brian: we need to be able to support this ^^^ in our interface in order to be able to test and learn which of these modes really works. david: there's not really much coordination here, so doing optimal ordering is going to be tough. BUT we do have the group on the target...but we'll never have as much as we do in the fence case. the accumulate is odd as usual. if you're the only person in the target's group, then you don't have to do the element-wise locking/atomicity (or window-wise locking/unlocking depending on implementaiton). brian: it would be helpful for the origin to know if he needs to lock or not during accumulate operations. this would allow him to know if he needs to lock or not. this is particularly useful in the shmem scenario. - something in the window structure, stored in shared memory, could allow for this optimization otherwise the accumulate is basically the same as in the lock/unlock window op case (gets/puts, lock requests, etc.). fence ----- Q: does win_create() fill in for the first fence, or do you need a first fence? A: you need the first fence (you have to in order to have created an exposure epoch). there are optimizations here that we can get from the BSP people (collecting and scheduling communication). assertions: no store - no local stores in previous epoch no put - no remote updates to local window in new epoch no precede- no local rma calls in previous epoch, collective assert only no succeed - no local rma calls in new epoch, collective assert only those last two are obviously useful for the first/last fence cases. they are one way to know you're done with fences. you can also do local load/stores in those epochs too...it's a way of saying ``i'm only doing local stuff for a moment''. Q: how does one cleanly switch between synchronization modes? in particular how does one switch cleanly OUT of the fence mode? I think we're just not reading the fence stuff carefully enough. A: ``fence starts an exposure epoch IF followed by another fence call and the local window is the target of RMA ops between fence calls''. ``the call starts an access epoch IF it is followed by another fence call and by RMA communications calls issued between the two calls''. HA! that's nasty. So hitting a fence really doesn't tell us as much as we originally thought it did. We need to figure out what we can do in the context of these goofy rules. bad example we think is legal ----------------------------- (0) (1) (2) (3) fence fence fence fence put(1) put(0) start(3) post(2) put(3) put(2) complete() wait() fence fence fence fence Is this legal? The epochs are not created on 2 and 3, but not on 0 and 1 as a result of the fences. We know the fences are collective, but is the creation of the epochs? even worse example ------------------ (0) (1) (2) (3) fence fence fence fence put(1) put(0) start(3) post(2) put(3) put(2) complete() wait() barrier(2,3) barrier(2,3) put(3) put(2) fence fence fence fence What about that one? Bill: FIX THE TEXT! (meaning we should propose a clarification to the standard) goals: 1) no mixed-mode stuff between fence and the others... approach: 0) comb through chapter and see if something is already there. 1) description of problem (our examples, building up w/ 2) 2) we know this wasn't intended 3) propose clarifications what next? ---------- david: looking this as a building-block sort of thing as we did with xfer. is there a way to approach this in the same way? the obvious blocks would be access and/or exposure epochs. exposure epochs can be thought of as having a reference count, with the wait() (or fence i guess) blocking until the refcount hits 0. Q: do we, in our code, to explicitly define epochs? Is it harder to follow the rules with or without them? scenarios: - fence, how do you know who did ops/created epochs? - which is the right approach? - aggressive vs. combined? we are probably going to punt on detecting errors between overlapping windows at first. later we could detect overlapping windows at create time and then do error checking for invalid operations between windows on the destination at the time the epochs are serviced (or whatever we call that) brainstorm: for non-aggregating case, a put creates a special car (including datatype etc.) which sends a special header across to the target. the target understands how to receive these cars and will create a matching recv car to receive and store the data appropriately (after creating the datatype if necessary...this is still an unsolved problem). an accumulate can be performed in the same manner, with a recv_mop being created on the target instead of just a recv. we can use the counter decrement capability we have created for use with cars and requests in order to decrement counters in exposure epochs. this will allow for easy waits on epochs. in the aggregating case, we would send a special header describing the aggregated operations across to the target. the target parses this header and creates an appropriate car string. the local side has already created the rest of the cars necessary to perform the data transfer as well, relying on completion dependencies on the local side to get the operations in the right order. we can serialize a datatype in a deterministic manner. datatype caching is done on a demand basis. we've talked about this before. how does the need for retrieving a datatype fit into this special header scheme laid out above? For the heterogeneous case, the datatype definitions need to be expressed in terms of element offsets, not bytes offsets. For example, if a indexed type is automatically converted and stored in terms of an hindexed type, the definition sent to a remote process (with different type sizes) will contain incorrect byte offsets for the remote machine. We need to make sure to store the original element displacement/offsets in the vector etc. cases where this is how the datatype is originally defined, even if we use byte offsets locally. With reactive caching, we cannot allow the datatypes sent to the target process to be freed before the target has definitely completed operating with the datatype. In the lock/unlock and fence cases, the local process implicitly knows that the target is done with the datatype when the unlock/fence returns. This leaves us with the start/complete case as the only problem case. This final case will be handled with a lazy ack based on access epochs in which the datatype was used. In other words, the reference count on a datatype is incremented the first time the datatype is used by an RMA operation in an access epoch and the datatype is "logged" in the access epoch structure. The access epoch structure also contains a flag stating whether a put or accumulate operation was requested during this access epoch. After Win_complete() detects that all get operations have completed, if the flag is not set, it will decrements the reference counts of the logged datatypes and free the access epoch structure. If the flag is set, the datatype reference counts may only be decremented once an explicit ackowledgement has been received from the target informing the origin that all operations requested by that access epoch have been completed. david proposes that rather than a single flag we would instead use a flag on each datatype. this would allow us to free the datatypes only used for gets immediately, delaying only for the puts/accs. it's reasonable for the origin to force the send of the datatype when he knows that the target hasn't seen it yet. we should consider this. we don't have to send basic types. that's important to remember too. oops! the fence isn't as well-behaved as we thought. fence only implies local completion of the last epoch. so we're going to have to keep up with things for fence as well. oops! the lock/unlock isn't either :). the public copy on the other side is assured to have been "updated", but that doesn't mean that you are necessarily done with the datatype. we could do lazy release consistency a la treadmarks, and it would give us performance advantages in some situations, but we aren't going to do that. we plan to have a single copy of our data. thus our lock/unlock case will be ok, as there is no public/private copy issue. the fence will be implemented in a similar manner to a complete/wait on the previous epochs (access, exposure). we have to ensure that we don't inadvertently create epochs that are empty. brian's notes mention using a counter per target in order which is exchanged at each fence. this tells the target how many operations need to be completed before leaving the fence. this works well in the eager case, but is probably overkill for the aggregated case, where you're going to pass all the operations over anyway. this can also be done with N reductions; we might be able to work out an all-to-all reduce that does the right thing for this. there are a couple of asserts which will be useful for reducing communication here. and we can do a little extra debugging checking based on these as well. we know we can leave the fence when we have completed the total number of operations counted in the reduction operation. aside: we could use mprotect() to detect local load/stores on a local window if we wanted to for debugging purposes. scenario: start/complete (sort of) ---------------------------------- we can get a context id from dup'ing the communicator at create time, or we can get a new context id based on the old communicator. generate_new_context_id() or something like that. everyone participating in the win create must agree on the context_id. start must create an access epoch which can be matched to an exposure epoch on the other side. we don't think we need to match anything special at start/post time in order to match epochs, but we aren't sure. our context ids can be used to match the AE to the appropriate window on the target. targets are going to have to track pending AEs until they hit a point where a post (or whatever) has occurred. there will be situations where multiple AEs are queued for a single window, and we must handle this as well. all tracking is associated with a local window. access epochs are just created on the fly by the target as placeholders for what is going on. there doesn't have to be anything special about how these are identified. some time prior to the origin locally completing an AE, an origin-assigned ID is passed to the target. this ID is returned to the origin by the target when the AE has been completed on the target side. puts/gets/accs don't have to have an origin-assigned id or be matched with more than the context id and origin, under the assumption that there are no overtaking messages and only one active AE from an origin at one time. there is no valid case where one origin has more than one outstanding and active AE for the same target window. ------------------------------------------------------------------------ RMA requirements - we need the option of aggregating operations within an epoch Window object - states - local - public - exposure epoch tracking (for operation on the local window) - need a queue for ordering exposure epochs and ensuring proper shared/exclusive semantics for passive target case - each epoch needs a queue for storing incoming operation requests associated with the exposure epoch - access epoch tracking (per local window?)