Implementation of passively synchronized RMA for shared memory machines ------------------------------------------------------------------------ Base Assumptions * All of the local windows are located in shared memory. * Only basic datatypes are supported for the target. ------------------------------------------------------------------------ General Notes ------------------------------------------------------------------------ Data Structures * MPID_Win * struct MPIR_Win * lwin_rwmutexes[np] * region_mutexes[nregions] ------------------------------------------------------------------------ MPID_Win_lock * if MPI_MODE_NOCHECK is not set * if lock_type is MPI_LOCK_SHARED, then acquire proc_rwmutexes[rank] as a reader, otherwise acquire it as a writer NOTE: the read-write mutex should be implemented fairly so that writers are not starved by continual overlapping lock requests by readers. * set process local state to indicate whether or not the rw-mutex is held * set process local state to indicate if the lock is shared or exclusive ------------------------------------------------------------------------ MPID_Win_unlock * release proc_rwmutex, if it is being held ------------------------------------------------------------------------ MPID_Accumulate NOTE: When the lock is shared, we can achieve more parallelism by dividing the local window into regions. Each region would have a separate mutex to guarantee that all data elements with that region were processed atomically. Ideally, dataloops would be optimized such that a region mutex would never be acquired more than once per accumulate operation. NOTE: For machines where intrinsic types must be aligned on boundaries of that types size, we can ensure that a type does not cross a region boundary by aligning the region boundaries at addresses divisible by the size of the largest type and forcing all regions to contain at least as many bytes as the largest type. For all other machines, extra logic will be required to hold multiple mutexes when a type crosses a region boundary. NOTE: When the lock is exclusive, slicing up the local window and optimizing the dataloops to increase region locality is unnecessary.