Example of using LD_PRELOAD with the CUDA component.  
Asim YarKhan (2015)

A short example of using LD_PRELOAD on a Linux system to intercept
function calls and PAPI-enable an un-instrumented CUDA binary.

Several CUDA events (e.g. SM PM counters) require a CUcontext handle
to be a provided since they are context switched. This means that we
cannot use a PAPI_attach from an external process to measure those
events in a preexisting executable.  These events can only be measured
from within the CUcontext, that is, within the CUDA enabled code we
are trying to measure.  If the user is unable to change the source
code, they may be able to use LD_PRELOAD's ability to trap functions
and measure the events for within the executable.

This example is designed to work with the simpleMultiGPU_no_counters
binary in the PAPI CUDA component tests directory.  We use ltrace to
figure out where to attach the PAPI start, PAPI eventset management
and PAPI_stop.  Please note that this is a rough example; return codes
are not be checked and other changes may be required to make sure that
the calls are intercepted at the right moment.

First trace the library calls in simpleMultiGPU_no_counters binary
were traced using ltrace.  Note in the ltrace output that the CUDA C
APIs are different from the CUDA calls visible to nvcc. Then figure
out appropriate place to attach the PAPI calls.  The initialization is
attached to the first entry to cudaSetDevice.  Each cudaSetDevice is
also used to setup the PAPI events for that device.  It was harder to
figure out where to attach the PAPI_start.  After running some tests,
I attached it to the 18th invocation of gettimeofday (kind of
arbitrary! Sorry! May need tweaking).  The PAPI_stop was attached to
the first invocation of cudaFreeHost.


[Note: There are other events that do not require a CUcontext.  The PM
counter for TEX, L2, and FB are not context switched so it would be
possible to sample these values from any context as long as the
context is on the same CUDA device. These events could be measured
using a PAPI_attach from another process using the same CUDA device.]


--------------------------------------------------
How to use this example... please read carefully to make sense of the following.

Build:
make cuda_ld_preload_example.so

Trace the executable using ltrace to figure out where to intercept the calls: 
# Do the tracing with a small example!
# ( export PAPI_DIR=`pwd`/../../.. && export LIBPFM_LIBDIR=`pwd`/../../../libpfm4/lib && export LD_LIBRARY_PATH=./:${PAPI_DIR}:${LIBPFM_LIBDIR}:${LD_LIBRARY_PATH}  && ltrace --output ltrace.out --library /usr/lib64/libcuda.so.1 ./simpleMultiGPU_no_counters  )
# ( export PAPI_DIR=`pwd`/../../.. && export LIBPFM_LIBDIR=`pwd`/../../../libpfm4/lib && export LD_LIBRARY_PATH=./:${PAPI_DIR}:${LIBPFM_LIBDIR}:${LD_LIBRARY_PATH} && LD_PRELOAD=./cuda_ld_preload_example.so ltrace ./simpleMultiGPU_no_counters )

Run using dynamic linking to find the correct libraries:
( export PAPI_DIR=`pwd`/../../.. && export LIBPFM_LIBDIR=`pwd`/../../../libpfm4/lib && export LD_LIBRARY_PATH=./:${PAPI_DIR}:${LIBPFM_LIBDIR}:${LD_LIBRARY_PATH} && LD_PRELOAD=./cuda_ld_preload_example.so ./simpleMultiGPU_no_counters )

make cuda_ld_preload_example.so && ( export PAPI_DIR=`pwd`/../../.. && export LIBPFM_LIBDIR=`pwd`/../../../libpfm4/lib && export LD_LIBRARY_PATH=./:${PAPI_DIR}:${LIBPFM_LIBDIR}:${LD_LIBRARY_PATH} && LD_PRELOAD=./cuda_ld_preload_example.so ./simpleMultiGPU_no_counters )