Blob Blame History Raw
Capturing Intel(R) Processor Trace (Intel PT) {#capture}
=============================================

<!---
 ! Copyright (c) 2015-2017, Intel Corporation
 !
 ! Redistribution and use in source and binary forms, with or without
 ! modification, are permitted provided that the following conditions are met:
 !
 !  * Redistributions of source code must retain the above copyright notice,
 !    this list of conditions and the following disclaimer.
 !  * Redistributions in binary form must reproduce the above copyright notice,
 !    this list of conditions and the following disclaimer in the documentation
 !    and/or other materials provided with the distribution.
 !  * Neither the name of Intel Corporation nor the names of its contributors
 !    may be used to endorse or promote products derived from this software
 !    without specific prior written permission.
 !
 ! THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 ! AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 ! IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 ! ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
 ! LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
 ! CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
 ! SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 ! INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 ! CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 ! ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 ! POSSIBILITY OF SUCH DAMAGE.
 !-->

This chapter describes how to capture Intel PT for processing with libipt.  For
illustration, we use the sample tools ptdump and ptxed.


## Capturing Intel PT on Linux

Starting with version 4.1, the Linux kernel supports Intel PT via the perf_event
kernel interface.  Starting with version 4.3, the perf user-space tool will
support Intel PT as well.


### Capturing Intel PT via Linux perf_event

We start with setting up a perf_event_attr object for capturing Intel PT.  The
structure is declared in `/usr/include/linux/perf_event.h`.

The Intel PT PMU type is dynamic.  Its value can be read from
`/sys/bus/event_source/devices/intel_pt/type`.

~~~{.c}
    struct perf_event_attr attr;

    memset(&attr, 0, sizeof(attr));
    attr.size = sizeof(attr);
    attr.type = <read type>();

    attr.exclude_kernel = 1;
    ...
~~~


Once all desired fields have been set, we can open a perf_event counter for
Intel PT.  See `man 2 perf_event_open` for details.  In our example, we
configure it for tracing a single thread.

The system call returns a file descriptor on success, `-1` otherwise.

~~~{.c}
    int fd;

    fd = syscall(SYS_perf_event_open, &attr, <pid>, -1, -1, 0);
~~~


The Intel PT trace is captured in the AUX area, which has been introduced with
kernel 4.1.  The DATA area contains sideband information such as image changes
that are necessary for decoding the trace.

In theory, both areas can be configured as circular buffers or as linear buffers
by mapping them read-only or read-write, respectively.  When configured as
circular buffer, new data will overwrite older data.  When configured as linear
buffer, the user is expected to continuously read out the data and update the
buffer's tail pointer.  New data that do not fit into the buffer will be
dropped.

When using the AUX area, its size and offset have to be filled into the
`perf_event_mmap_page`, which is mapped together with the DATA area.  This
requires the DATA area to be mapped read-write and hence configured as linear
buffer.  In our example, we configure the AUX area as circular buffer.

Note that the size of both the AUX and the DATA area has to be a power of two
pages.  The DATA area needs one additional page to contain the
`perf_event_mmap_page`.

~~~{.c}
    struct perf_event_mmap_page *header;
    void *base, *data, *aux;

    base = mmap(NULL, (1+2**n) * PAGE_SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
    if (base == MAP_FAILED)
        return <handle data mmap error>();

    header = base;
    data = base + header->data_offset;

    header->aux_offset = header->data_offset + header->data_size;
    header->aux_size   = (2**m) * PAGE_SIZE;

    aux = mmap(NULL, header->aux_size, PROT_READ, MAP_SHARED, fd,
               header->aux_offset);
    if (aux == MAP_FAILED)
        return <handle aux mmap error>();
~~~


### Capturing Intel PT via the perf user-space tool

Starting with kernel 4.3, the perf user-space tool can be used to capture Intel
PT with the `intel_pt` event.  See tools/perf/Documentation in the Linux kernel
tree for further information.  In this text, we describe how to use the captured
trace with the ptdump and ptxed sample tools.

We start with capturing some Intel PT trace using the intel_pt event.

~~~{.sh}
    $ perf record -e intel_pt//u --per-thread -- grep -r foo /usr/include
    [ perf record: Woken up 26 times to write data ]
    [ perf record: Captured and wrote 51.969 MB perf.data ]
~~~


This generates a `perf.data` file that contains the Intel PT trace, the sideband
information, and some metadata.  To process the trace with libipt, we need to
extract the Intel PT trace into one file per thread or cpu.

Looking at the raw trace dump of `perf script -D`, we notice
`PERF_RECORD_AUXTRACE` records.  The raw Intel PT trace is contained directly
after such records.  We can extract it with the `dd` command.  The arguments to
`dd` can be computed from the record's fields.  This can be done automatically,
for example with an AWK script.

~~~{.awk}
  /PERF_RECORD_AUXTRACE / {
    offset = strtonum($1)
    hsize  = strtonum(substr($2, 2))
    size   = strtonum($5)
    idx    = strtonum($11)

    ofile = sprintf("perf.data-aux-idx%d.bin", idx)
    begin = offset + hsize

    cmd = sprintf("dd if=perf.data of=%s conv=notrunc oflag=append ibs=1 \
                  skip=%d count=%d status=none", ofile, begin, size)

    system(cmd)
  }
~~~

The libipt tree contains such a script in `script/perf-read-aux.bash`.

In addition to the Intel PT trace, we need the traced memory image.  When
tracing a single process where the memory image does not change during tracing,
we can construct the memory image by examining `PERF_RECORD_MMAP` and
`PERF_RECORD_MMAP2` records.  This can again be done automatically, for example
with an AWK script.

~~~{.awk}
  function handle_mmap(file, vaddr) {
    if (match(file, /\[.*\]/) != 0) {
      # ignore 'virtual' file names like [kallsyms]
    }
    else if (match(file, /\.ko$/) != 0) {
      # ignore kernel objects
      #
      # use /proc/kcore
    }
    else {
      printf(" --elf %s:0x%x", file, vaddr)
    }
  }

  /PERF_RECORD_MMAP / {
    vaddr = strtonum(substr($5, 2))
    file = $9

    handle_mmap(file, vaddr)
  }

  /PERF_RECORD_MMAP2 / {
    vaddr = strtonum(substr($5, 2))
    file = $12

    handle_mmap(file, vaddr)
  }
~~~

The above script generates options for the `ptxed` sample tool.  The libipt tree
contains such a script in `script/perf-read-image.bash`.

Let's put it all together.

~~~{.sh}
    $ perf record -e intel_pt//u --per-thread -- grep -r foo /usr/include
    [ perf record: Woken up 26 times to write data ]
    [ perf record: Captured and wrote 51.969 MB perf.data ]
    $ script/perf-read-aux.bash
    $ script/perf-read-image.bash | xargs ptxed --cpu 6/61 --pt perf.data-aux-idx0.bin
~~~


### Sideband support

The above example does not consider sideband information.  It therefore only
works for not-too-complicated single-threaded applications.  For tracing
multi-threaded applications or for system-wide tracing (including ring-3),
sideband information is required for decoding the trace.

Sideband information can be defined as any information necessary for decoding
Intel PT that is not contained in the trace stream itself.  We already supply:

  * the binary files whose execution was traced and the virtual address at which
    each file was loaded
  * the family/model/stepping of the processor on which the trace was recorded
  * some information regarding timing


What's missing is information about changes to the traced memory image while the
trace is being recorded:

  * memory map/unamp information
  * context switch information


On Linux, this information can be found in the form of PERF_EVENT records in the
DATA buffer or in the perf.data file respectively.

Collection and interpretation of this information is currently left completely
to the user.


### Capturing Intel PT via Simple-PT

The Simple-PT project on github supports capturing Intel PT on Linux with an
alternative kernel driver.  The spt decoder supports sideband information.

See the project's page at https://github.com/andikleen/simple-pt for more
information including examples.