Example of using LD_PRELOAD with the CUDA component. Asim YarKhan (2015) A short example of using LD_PRELOAD on a Linux system to intercept function calls and PAPI-enable an un-instrumented CUDA binary. Several CUDA events (e.g. SM PM counters) require a CUcontext handle to be a provided since they are context switched. This means that we cannot use a PAPI_attach from an external process to measure those events in a preexisting executable. These events can only be measured from within the CUcontext, that is, within the CUDA enabled code we are trying to measure. If the user is unable to change the source code, they may be able to use LD_PRELOAD's ability to trap functions and measure the events for within the executable. This example is designed to work with the simpleMultiGPU_no_counters binary in the PAPI CUDA component tests directory. We use ltrace to figure out where to attach the PAPI start, PAPI eventset management and PAPI_stop. Please note that this is a rough example; return codes are not be checked and other changes may be required to make sure that the calls are intercepted at the right moment. First trace the library calls in simpleMultiGPU_no_counters binary were traced using ltrace. Note in the ltrace output that the CUDA C APIs are different from the CUDA calls visible to nvcc. Then figure out appropriate place to attach the PAPI calls. The initialization is attached to the first entry to cudaSetDevice. Each cudaSetDevice is also used to setup the PAPI events for that device. It was harder to figure out where to attach the PAPI_start. After running some tests, I attached it to the 18th invocation of gettimeofday (kind of arbitrary! Sorry! May need tweaking). The PAPI_stop was attached to the first invocation of cudaFreeHost. [Note: There are other events that do not require a CUcontext. The PM counter for TEX, L2, and FB are not context switched so it would be possible to sample these values from any context as long as the context is on the same CUDA device. These events could be measured using a PAPI_attach from another process using the same CUDA device.] -------------------------------------------------- How to use this example... please read carefully to make sense of the following. Build: make cuda_ld_preload_example.so Trace the executable using ltrace to figure out where to intercept the calls: # Do the tracing with a small example! # ( export PAPI_DIR=`pwd`/../../.. && export LIBPFM_LIBDIR=`pwd`/../../../libpfm4/lib && export LD_LIBRARY_PATH=./:${PAPI_DIR}:${LIBPFM_LIBDIR}:${LD_LIBRARY_PATH} && ltrace --output ltrace.out --library /usr/lib64/libcuda.so.1 ./simpleMultiGPU_no_counters ) # ( export PAPI_DIR=`pwd`/../../.. && export LIBPFM_LIBDIR=`pwd`/../../../libpfm4/lib && export LD_LIBRARY_PATH=./:${PAPI_DIR}:${LIBPFM_LIBDIR}:${LD_LIBRARY_PATH} && LD_PRELOAD=./cuda_ld_preload_example.so ltrace ./simpleMultiGPU_no_counters ) Run using dynamic linking to find the correct libraries: ( export PAPI_DIR=`pwd`/../../.. && export LIBPFM_LIBDIR=`pwd`/../../../libpfm4/lib && export LD_LIBRARY_PATH=./:${PAPI_DIR}:${LIBPFM_LIBDIR}:${LD_LIBRARY_PATH} && LD_PRELOAD=./cuda_ld_preload_example.so ./simpleMultiGPU_no_counters ) make cuda_ld_preload_example.so && ( export PAPI_DIR=`pwd`/../../.. && export LIBPFM_LIBDIR=`pwd`/../../../libpfm4/lib && export LD_LIBRARY_PATH=./:${PAPI_DIR}:${LIBPFM_LIBDIR}:${LD_LIBRARY_PATH} && LD_PRELOAD=./cuda_ld_preload_example.so ./simpleMultiGPU_no_counters )