Blob Blame History Raw
$Id: virtual.txt,v 1.3 2004/08/09 09:42:22 mikpe Exp $

VIRTUAL PER-PROCESS PERFORMANCE COUNTERS
========================================
This document describes the virtualised per-process performance
counters kernel extension. See "General Model" in low-level-api.txt
for the model of the processor's performance counters.

Contents
========
- Summary
- Design & Implementation Notes
  * State
  * Thread Management Hooks
  * Synchronisation Rules
  * The Pseudo File System
- API For User-Space
  * Opening/Creating the State
  * Updating the Control
  * Unlinking the State
  * Reading the State
  * Resuming After Handling Overflow Signal
  * Reading the Counter Values
- Limitations / TODO List

Summary
=======
The virtualised per-process performance counters facility
(virtual perfctrs) is a kernel extension which extends the
thread state to record perfctr settings and values, and augments
the context-switch code to save perfctr values at suspends and
restore them at resumes. This "virtualises" the performance
counters in much the same way as the kernel already virtualises
general-purpose and floating-point registers.

Virtual perfctrs also adds an API allowing non-privileged
user-space processes to set up and access their perfctrs.

As this facility is primarily intended to support developers
of user-space code, both virtualisation and allowing access
from non-privileged code are essential features.

Design & Implementation Notes
=============================

State
-----
The state of a thread's perfctrs is packaged up in an object of
type 'struct vperfctr'. It consists of CPU-dependent state, a
sampling timer, and some auxiliary administrative data. This is
an independent object, with its own lifetime and access rules.

The state object is attached to the thread via a pointer in its
thread_struct. While attached, the object records the identity
of its owner thread: this is used for user-space API accesses
from threads other than the owner.

The state is separate from the thread_struct for several resons:
- It's potentially large, hence it's allocated only when needed.
- It can outlive its owner thread. The state can be opened as
  a pseudo file: as long as that file is live, so is the object.
- It can be mapped, via mmap() on the pseudo file's descriptor.
  To facilitate this, a full page is allocated and reserved.

Thread Management Hooks
-----------------------
Virtual perfctrs hooks into several thread management events:

- exit_thread(): Calls perfctr_exit_thread() to stop the counters
  and mark the vperfctr object as dead.

- copy_thread(): Calls perfctr_copy_thread() to initialise
  the child's vperfctr pointer. The child gets a new vperfctr
  object containing the same control data as its parent.
  Kernel-generated threads do not inherit any vperfctr state.

- release_task(): Calls perfctr_release_task() to detach the
  vperfctr object from the thread. If the child and its parent
  still have the same perfctr control settings, then the child's
  final counts are propagated back into its parent.

- switch_to():
  * Calls perfctr_suspend_thread() on the previous thread, to
    suspend its counters.
  * Calls perfctr_resume_thread() on the next thread, to resume
    its counters. Also resets the sampling timer (see below).

- update_process_times(): Calls perfctr_sample_thread(), which
  decrements the sampling timer and samples the counters if the
  timer reaches zero.

  Sampling is normally only done at switch_to(), but if too much
  time passes before the next switch_to(), a hardware counter may
  increment by more than its range (usually 2^32). If this occurs,
  the difference from its start value will be incorrect, causing
  its updated sum to also be incorrect. The sampling timer is used
  to prevent this problem, which has been observed on SMP machines,
  and on high clock frequency UP machines.

- set_cpus_allowed(): Calls perfctr_set_cpus_allowed() to detect
  attempts to migrate the thread to a "forbidden" CPU, in which
  case a flag in the vperfctr object is set. perfctr_resume_thread()
  checks this flag, and if set, marks the counters as stopped and
  sends a SIGILL to the thread.

  The notion of forbidden CPUs is a workaround for a design flaw
  in hyper-threaded Pentium 4s and Xeons. See low-level-x86.txt
  for details.

To reduce overheads, these hooks are implemented as inline functions
that check if the thread is using perfctrs before calling the code
that implements the behaviour. The hooks also reduce to no-ops if
CONFIG_PERFCTR_VIRTUAL is disabled.

Synchronisation Rules
---------------------
There are five types of accesses to a thread's perfctr state:

1. Thread management events (see above) done by the thread itself.
   Suspend, resume, and sample are lock-less.

2. API operations done by the thread itself.
   These are lock-less, except when an individual operation
   has specific synchronisation needs. For instance, preemption
   is often disabled to prevent accesses due to context switches.

3. API operations done by a different thread ("monitor thread").
   The owner thread must be suspended for the duration of the operation.
   This is ensured by requiring that the monitor thread is ptrace()ing
   the owner thread, and that the owner thread is in TASK_STOPPED state.

4. set_cpus_allowed().
   The kernel does not lock the target during set_cpus_allowed(),
   so it can execute concurrently with the owner thread or with
   some monitor thread. In particular, the state may be deallocated.

   To solve this problem, both perfctr_set_cpus_allowed() and the
   operations that can change the owner thread's perfctr pointer
   (creat, unlink, exit) perform a task_lock() on the owner thread
   before accessing the perfctr pointer.

5. release_task().
   Reaping a child may or may not be done by the parent of that child.
   When done by the parent, no lock is taken. Otherwise, a task_lock()
   on the parent is done before accessing its thread's perfctr pointer.

The Pseudo File System
----------------------
The perfctr state is accessed from user-space via a file descriptor.

The main reason for this is to enable mmap() on the file descriptor,
which gives read-only access to the state.

The file descriptor is a handle to the perfctr state object. This
allows a very simple implementation of the user-space 'perfex'
program, which runs another program with given perfctr settings
and reports their final values. Without this handle, monitoring
applications like perfex would have to be implemented like debuggers
in order to catch the target thread's exit and retrieve the counter
values before the exit completes and the state disappears.

The file for a perfctr state object belongs to the vperfctrs pseudo
file system. Files in this file system support only a few operations:
- mmap()
- release() decrements the perfctr object's reference count and
  deallocates the object when no references remain
- the listing of a thread's open file descriptors identifies
  perfctr state file descriptors as belonging to "vperfctrfs"
The implementation is based on the code for pipefs.

In previous versions of the perfctr package, the file descriptors
for perfctr state objects also supported the API's ioctl() method.

API For User-Space
==================

Opening/Creating the State
--------------------------
int fd = sys_vperfctr_open(int tid, int creat);

'tid' must be the id of a thread, or 0 which is interpreted as an
alias for the current thread.

This operation returns an open file descriptor which is a handle
on the thread's perfctr state object.

If 'creat' is non-zero and the object did not exist, then it is
created and attached to the thread. The newly created state object
is inactive, with all control fields disabled and all counters
having the value zero. If 'creat' is non-zero and the object
already existed, then an EEXIST error is signalled.

If 'tid' does not denote the current thread, then it must denote a
thread that is stopped and under ptrace control by the current thread.

Notes:
- The access rule in the non-self case is the same as for the
  ptrace() system call. It ensures that no other thread, including
  the target thread itself, can access or change the target thread's
  perfctr state during the operation.
- An open file descriptor for a perfctr state object counts as a
  reference to that object; even if detached from its thread the
  object will not be deallocated until the last reference is gone.
- The file descriptor can be passed to mmap(), for low-overhead
  counter sampling. See "READING THE COUNTER VALUES" for details.
- The file descriptor can be passed to another thread. Accesses
  from threads other than the owner are permitted as long as they
  posses the file descriptor and use ptrace() for synchronisation.

Updating the Control
--------------------
int err = sys_vperfctr_control(int fd, const struct vperfctr_control *control);

'fd' must be the return value from a call to sys_vperfctr_open(),
The perfctr object must still be attached to its owner thread.

This operation stops and samples any currently running counters in
the thread, and then updates the control settings. If the resulting
state has any enabled counters, then the counters are restarted.

Before restarting, the counter sums are reset to zero. However,
if a counter's bit is set in the control object's 'preserve'
bitmask field, then that counter's sum is not reset. The TSC's
sum is only reset if the TSC is disabled in the new state.

If any of the programmable counters are enabled, then the thread's
CPU affinity mask is adjusted to exclude the set of forbidden CPUs.

If the control data activates any interrupt-mode counters, then
a signal (specified by the 'si_signo' control field) will be sent
to the owner thread after an overflow interrupt. The documentation
for sys_vperfctr_iresume() describes this mechanism.

If 'fd' does not denote the current thread, then it must denote a
thread that is stopped and under ptrace control by the current thread.
The perfctr state object denoted by 'fd' must still be attached
to its owner thread.

Notes:
- It is strongly recommended to memset() the vperfctr_control object
  to all-bits-zero before setting the fields of interest.
- Stopping the counters is done by invoking the control operation
  with a control object that activates neither the TSC nor any PMCs.

Unlinking the State
-------------------
int err = sys_vperfctr_unlink(int fd);

'fd' must be the return value from a call to sys_vperfctr_open().

This operation stops and samples the thread's counters, and then
detaches the perfctr state object from the thread. If the object
already had been detached, then no action is performed.

If 'fd' does not denote the current thread, then it must denote a
thread that is stopped and under ptrace control by the current thread.

Reading the State
-----------------
int err = sys_vperfctr_read(int fd, struct perfctr_sum_ctrs *sum,
			    struct vperfctr_control *control,
			    struct perfctr_sum_ctrs *children);

'fd' must be the return value from a call to sys_vperfctr_open().

This operation copies data from the perfctr state object to
user-space. If 'sum' is non-NULL, then the counter sums are
written to it. If 'control' is non-NULL, then the control data
is written to it. If 'children' is non-NULL, then the sums of
exited childrens' counters are written to it.

If the perfctr state object is attached to the current thread,
then the counters are sampled and updated first.

If 'fd' does not denote the current thread, then it must denote a
thread that is stopped and under ptrace control by the current thread.

Notes:
- An alternate and faster way to retrieve the counter sums is described
  below. This system call can be used if the hardware does not permit
  user-space reads of the counters.

Resuming After Handling Overflow Signal
---------------------------------------
int err = sys_vperfctr_iresume(int fd);

'fd' must be the return value from a call to sys_vperfctr_open().
The perfctr object must still be attached to its owner thread.

When an interrupt-mode counter has overflowed, the counters
are sampled and suspended (TSC remains active). Then a signal,
as specified by the 'si_signo' control field, is sent to the
owner thread: the associated 'struct siginfo' has 'si_code'
equal to 'SI_PMC_OVF', and 'si_pmc_ovf_mask' equal to the set
of overflown counters.

The counters are suspended to avoid generating new performance
counter events during the execution of the signal handler, but
the previous settings are saved. Calling sys_vperfctr_iresume()
restores the previous settings and resumes the counters. Doing
this is optional.

If 'fd' does not denote the current thread, then it must denote a
thread that is stopped and under ptrace control by the current thread.

Reading the Counter Values
--------------------------
The value of a counter is computed from three components:

	value = sum + (now - start);

Two of these (sum and start) reside in the kernel's state object,
and the third (now) is the contents of the hardware counter.
To perform this computation in user-space requires access to
the state object. This is achieved by passing the file descriptor
from sys_vperfctr_open() to mmap():

	volatile const struct vperfctr_state *kstate;
	kstate = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0);

Reading the three components is a non-atomic operation. If the
thread is scheduled during the operation, the three values will
not be consistent and the wrong result will be computed.
To detect this situation, user-space should check the kernel
state's TSC start value before and after the operation, and
retry the operation in case of a mismatch.

The algorithm for retrieving the value of counter 'i' is:

	tsc0 = kstate->cpu_state.tsc_start;
	for(;;) {
		rdpmcl(kstate->cpu_state.pmc[i].map, now);
		start = kstate->cpu_state.pmc[i].start;
		sum = kstate->cpu_state.pmc[i].sum;
		tsc1 = kstate->cpu_state.tsc_start;
		if (likely(tsc1 == tsc0))
			break;
		tsc0 = tsc1;
	}
	return sum + (now - start);

The algorithm for retrieving the value of the TSC is similar,
as is the algorithm for retrieving the values of all counters.

Notes:
- Since the state's TSC time-stamps are used, the algorithm requires
  that user-space enables TSC sampling.
- The algorithm requires that the hardware allows user-space reads
  of the counter registers. If this property isn't statically known
  for the architecture, user-space should retrieve the kernel's
  'struct perfctr_info' object and check that the PERFCTR_FEATURE_RDPMC
  flag is set.

Limitations / TODO List
=======================
- Buffering of overflow samples is not implemented. So far, not a
  single user has requested it.