|
Packit |
577717 |
$Id: virtual.txt,v 1.3 2004/08/09 09:42:22 mikpe Exp $
|
|
Packit |
577717 |
|
|
Packit |
577717 |
VIRTUAL PER-PROCESS PERFORMANCE COUNTERS
|
|
Packit |
577717 |
========================================
|
|
Packit |
577717 |
This document describes the virtualised per-process performance
|
|
Packit |
577717 |
counters kernel extension. See "General Model" in low-level-api.txt
|
|
Packit |
577717 |
for the model of the processor's performance counters.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Contents
|
|
Packit |
577717 |
========
|
|
Packit |
577717 |
- Summary
|
|
Packit |
577717 |
- Design & Implementation Notes
|
|
Packit |
577717 |
* State
|
|
Packit |
577717 |
* Thread Management Hooks
|
|
Packit |
577717 |
* Synchronisation Rules
|
|
Packit |
577717 |
* The Pseudo File System
|
|
Packit |
577717 |
- API For User-Space
|
|
Packit |
577717 |
* Opening/Creating the State
|
|
Packit |
577717 |
* Updating the Control
|
|
Packit |
577717 |
* Unlinking the State
|
|
Packit |
577717 |
* Reading the State
|
|
Packit |
577717 |
* Resuming After Handling Overflow Signal
|
|
Packit |
577717 |
* Reading the Counter Values
|
|
Packit |
577717 |
- Limitations / TODO List
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Summary
|
|
Packit |
577717 |
=======
|
|
Packit |
577717 |
The virtualised per-process performance counters facility
|
|
Packit |
577717 |
(virtual perfctrs) is a kernel extension which extends the
|
|
Packit |
577717 |
thread state to record perfctr settings and values, and augments
|
|
Packit |
577717 |
the context-switch code to save perfctr values at suspends and
|
|
Packit |
577717 |
restore them at resumes. This "virtualises" the performance
|
|
Packit |
577717 |
counters in much the same way as the kernel already virtualises
|
|
Packit |
577717 |
general-purpose and floating-point registers.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Virtual perfctrs also adds an API allowing non-privileged
|
|
Packit |
577717 |
user-space processes to set up and access their perfctrs.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
As this facility is primarily intended to support developers
|
|
Packit |
577717 |
of user-space code, both virtualisation and allowing access
|
|
Packit |
577717 |
from non-privileged code are essential features.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Design & Implementation Notes
|
|
Packit |
577717 |
=============================
|
|
Packit |
577717 |
|
|
Packit |
577717 |
State
|
|
Packit |
577717 |
-----
|
|
Packit |
577717 |
The state of a thread's perfctrs is packaged up in an object of
|
|
Packit |
577717 |
type 'struct vperfctr'. It consists of CPU-dependent state, a
|
|
Packit |
577717 |
sampling timer, and some auxiliary administrative data. This is
|
|
Packit |
577717 |
an independent object, with its own lifetime and access rules.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The state object is attached to the thread via a pointer in its
|
|
Packit |
577717 |
thread_struct. While attached, the object records the identity
|
|
Packit |
577717 |
of its owner thread: this is used for user-space API accesses
|
|
Packit |
577717 |
from threads other than the owner.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The state is separate from the thread_struct for several resons:
|
|
Packit |
577717 |
- It's potentially large, hence it's allocated only when needed.
|
|
Packit |
577717 |
- It can outlive its owner thread. The state can be opened as
|
|
Packit |
577717 |
a pseudo file: as long as that file is live, so is the object.
|
|
Packit |
577717 |
- It can be mapped, via mmap() on the pseudo file's descriptor.
|
|
Packit |
577717 |
To facilitate this, a full page is allocated and reserved.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Thread Management Hooks
|
|
Packit |
577717 |
-----------------------
|
|
Packit |
577717 |
Virtual perfctrs hooks into several thread management events:
|
|
Packit |
577717 |
|
|
Packit |
577717 |
- exit_thread(): Calls perfctr_exit_thread() to stop the counters
|
|
Packit |
577717 |
and mark the vperfctr object as dead.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
- copy_thread(): Calls perfctr_copy_thread() to initialise
|
|
Packit |
577717 |
the child's vperfctr pointer. The child gets a new vperfctr
|
|
Packit |
577717 |
object containing the same control data as its parent.
|
|
Packit |
577717 |
Kernel-generated threads do not inherit any vperfctr state.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
- release_task(): Calls perfctr_release_task() to detach the
|
|
Packit |
577717 |
vperfctr object from the thread. If the child and its parent
|
|
Packit |
577717 |
still have the same perfctr control settings, then the child's
|
|
Packit |
577717 |
final counts are propagated back into its parent.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
- switch_to():
|
|
Packit |
577717 |
* Calls perfctr_suspend_thread() on the previous thread, to
|
|
Packit |
577717 |
suspend its counters.
|
|
Packit |
577717 |
* Calls perfctr_resume_thread() on the next thread, to resume
|
|
Packit |
577717 |
its counters. Also resets the sampling timer (see below).
|
|
Packit |
577717 |
|
|
Packit |
577717 |
- update_process_times(): Calls perfctr_sample_thread(), which
|
|
Packit |
577717 |
decrements the sampling timer and samples the counters if the
|
|
Packit |
577717 |
timer reaches zero.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Sampling is normally only done at switch_to(), but if too much
|
|
Packit |
577717 |
time passes before the next switch_to(), a hardware counter may
|
|
Packit |
577717 |
increment by more than its range (usually 2^32). If this occurs,
|
|
Packit |
577717 |
the difference from its start value will be incorrect, causing
|
|
Packit |
577717 |
its updated sum to also be incorrect. The sampling timer is used
|
|
Packit |
577717 |
to prevent this problem, which has been observed on SMP machines,
|
|
Packit |
577717 |
and on high clock frequency UP machines.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
- set_cpus_allowed(): Calls perfctr_set_cpus_allowed() to detect
|
|
Packit |
577717 |
attempts to migrate the thread to a "forbidden" CPU, in which
|
|
Packit |
577717 |
case a flag in the vperfctr object is set. perfctr_resume_thread()
|
|
Packit |
577717 |
checks this flag, and if set, marks the counters as stopped and
|
|
Packit |
577717 |
sends a SIGILL to the thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The notion of forbidden CPUs is a workaround for a design flaw
|
|
Packit |
577717 |
in hyper-threaded Pentium 4s and Xeons. See low-level-x86.txt
|
|
Packit |
577717 |
for details.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
To reduce overheads, these hooks are implemented as inline functions
|
|
Packit |
577717 |
that check if the thread is using perfctrs before calling the code
|
|
Packit |
577717 |
that implements the behaviour. The hooks also reduce to no-ops if
|
|
Packit |
577717 |
CONFIG_PERFCTR_VIRTUAL is disabled.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Synchronisation Rules
|
|
Packit |
577717 |
---------------------
|
|
Packit |
577717 |
There are five types of accesses to a thread's perfctr state:
|
|
Packit |
577717 |
|
|
Packit |
577717 |
1. Thread management events (see above) done by the thread itself.
|
|
Packit |
577717 |
Suspend, resume, and sample are lock-less.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
2. API operations done by the thread itself.
|
|
Packit |
577717 |
These are lock-less, except when an individual operation
|
|
Packit |
577717 |
has specific synchronisation needs. For instance, preemption
|
|
Packit |
577717 |
is often disabled to prevent accesses due to context switches.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
3. API operations done by a different thread ("monitor thread").
|
|
Packit |
577717 |
The owner thread must be suspended for the duration of the operation.
|
|
Packit |
577717 |
This is ensured by requiring that the monitor thread is ptrace()ing
|
|
Packit |
577717 |
the owner thread, and that the owner thread is in TASK_STOPPED state.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
4. set_cpus_allowed().
|
|
Packit |
577717 |
The kernel does not lock the target during set_cpus_allowed(),
|
|
Packit |
577717 |
so it can execute concurrently with the owner thread or with
|
|
Packit |
577717 |
some monitor thread. In particular, the state may be deallocated.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
To solve this problem, both perfctr_set_cpus_allowed() and the
|
|
Packit |
577717 |
operations that can change the owner thread's perfctr pointer
|
|
Packit |
577717 |
(creat, unlink, exit) perform a task_lock() on the owner thread
|
|
Packit |
577717 |
before accessing the perfctr pointer.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
5. release_task().
|
|
Packit |
577717 |
Reaping a child may or may not be done by the parent of that child.
|
|
Packit |
577717 |
When done by the parent, no lock is taken. Otherwise, a task_lock()
|
|
Packit |
577717 |
on the parent is done before accessing its thread's perfctr pointer.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The Pseudo File System
|
|
Packit |
577717 |
----------------------
|
|
Packit |
577717 |
The perfctr state is accessed from user-space via a file descriptor.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The main reason for this is to enable mmap() on the file descriptor,
|
|
Packit |
577717 |
which gives read-only access to the state.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The file descriptor is a handle to the perfctr state object. This
|
|
Packit |
577717 |
allows a very simple implementation of the user-space 'perfex'
|
|
Packit |
577717 |
program, which runs another program with given perfctr settings
|
|
Packit |
577717 |
and reports their final values. Without this handle, monitoring
|
|
Packit |
577717 |
applications like perfex would have to be implemented like debuggers
|
|
Packit |
577717 |
in order to catch the target thread's exit and retrieve the counter
|
|
Packit |
577717 |
values before the exit completes and the state disappears.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The file for a perfctr state object belongs to the vperfctrs pseudo
|
|
Packit |
577717 |
file system. Files in this file system support only a few operations:
|
|
Packit |
577717 |
- mmap()
|
|
Packit |
577717 |
- release() decrements the perfctr object's reference count and
|
|
Packit |
577717 |
deallocates the object when no references remain
|
|
Packit |
577717 |
- the listing of a thread's open file descriptors identifies
|
|
Packit |
577717 |
perfctr state file descriptors as belonging to "vperfctrfs"
|
|
Packit |
577717 |
The implementation is based on the code for pipefs.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
In previous versions of the perfctr package, the file descriptors
|
|
Packit |
577717 |
for perfctr state objects also supported the API's ioctl() method.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
API For User-Space
|
|
Packit |
577717 |
==================
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Opening/Creating the State
|
|
Packit |
577717 |
--------------------------
|
|
Packit |
577717 |
int fd = sys_vperfctr_open(int tid, int creat);
|
|
Packit |
577717 |
|
|
Packit |
577717 |
'tid' must be the id of a thread, or 0 which is interpreted as an
|
|
Packit |
577717 |
alias for the current thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
This operation returns an open file descriptor which is a handle
|
|
Packit |
577717 |
on the thread's perfctr state object.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If 'creat' is non-zero and the object did not exist, then it is
|
|
Packit |
577717 |
created and attached to the thread. The newly created state object
|
|
Packit |
577717 |
is inactive, with all control fields disabled and all counters
|
|
Packit |
577717 |
having the value zero. If 'creat' is non-zero and the object
|
|
Packit |
577717 |
already existed, then an EEXIST error is signalled.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If 'tid' does not denote the current thread, then it must denote a
|
|
Packit |
577717 |
thread that is stopped and under ptrace control by the current thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Notes:
|
|
Packit |
577717 |
- The access rule in the non-self case is the same as for the
|
|
Packit |
577717 |
ptrace() system call. It ensures that no other thread, including
|
|
Packit |
577717 |
the target thread itself, can access or change the target thread's
|
|
Packit |
577717 |
perfctr state during the operation.
|
|
Packit |
577717 |
- An open file descriptor for a perfctr state object counts as a
|
|
Packit |
577717 |
reference to that object; even if detached from its thread the
|
|
Packit |
577717 |
object will not be deallocated until the last reference is gone.
|
|
Packit |
577717 |
- The file descriptor can be passed to mmap(), for low-overhead
|
|
Packit |
577717 |
counter sampling. See "READING THE COUNTER VALUES" for details.
|
|
Packit |
577717 |
- The file descriptor can be passed to another thread. Accesses
|
|
Packit |
577717 |
from threads other than the owner are permitted as long as they
|
|
Packit |
577717 |
posses the file descriptor and use ptrace() for synchronisation.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Updating the Control
|
|
Packit |
577717 |
--------------------
|
|
Packit |
577717 |
int err = sys_vperfctr_control(int fd, const struct vperfctr_control *control);
|
|
Packit |
577717 |
|
|
Packit |
577717 |
'fd' must be the return value from a call to sys_vperfctr_open(),
|
|
Packit |
577717 |
The perfctr object must still be attached to its owner thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
This operation stops and samples any currently running counters in
|
|
Packit |
577717 |
the thread, and then updates the control settings. If the resulting
|
|
Packit |
577717 |
state has any enabled counters, then the counters are restarted.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Before restarting, the counter sums are reset to zero. However,
|
|
Packit |
577717 |
if a counter's bit is set in the control object's 'preserve'
|
|
Packit |
577717 |
bitmask field, then that counter's sum is not reset. The TSC's
|
|
Packit |
577717 |
sum is only reset if the TSC is disabled in the new state.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If any of the programmable counters are enabled, then the thread's
|
|
Packit |
577717 |
CPU affinity mask is adjusted to exclude the set of forbidden CPUs.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If the control data activates any interrupt-mode counters, then
|
|
Packit |
577717 |
a signal (specified by the 'si_signo' control field) will be sent
|
|
Packit |
577717 |
to the owner thread after an overflow interrupt. The documentation
|
|
Packit |
577717 |
for sys_vperfctr_iresume() describes this mechanism.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If 'fd' does not denote the current thread, then it must denote a
|
|
Packit |
577717 |
thread that is stopped and under ptrace control by the current thread.
|
|
Packit |
577717 |
The perfctr state object denoted by 'fd' must still be attached
|
|
Packit |
577717 |
to its owner thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Notes:
|
|
Packit |
577717 |
- It is strongly recommended to memset() the vperfctr_control object
|
|
Packit |
577717 |
to all-bits-zero before setting the fields of interest.
|
|
Packit |
577717 |
- Stopping the counters is done by invoking the control operation
|
|
Packit |
577717 |
with a control object that activates neither the TSC nor any PMCs.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Unlinking the State
|
|
Packit |
577717 |
-------------------
|
|
Packit |
577717 |
int err = sys_vperfctr_unlink(int fd);
|
|
Packit |
577717 |
|
|
Packit |
577717 |
'fd' must be the return value from a call to sys_vperfctr_open().
|
|
Packit |
577717 |
|
|
Packit |
577717 |
This operation stops and samples the thread's counters, and then
|
|
Packit |
577717 |
detaches the perfctr state object from the thread. If the object
|
|
Packit |
577717 |
already had been detached, then no action is performed.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If 'fd' does not denote the current thread, then it must denote a
|
|
Packit |
577717 |
thread that is stopped and under ptrace control by the current thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Reading the State
|
|
Packit |
577717 |
-----------------
|
|
Packit |
577717 |
int err = sys_vperfctr_read(int fd, struct perfctr_sum_ctrs *sum,
|
|
Packit |
577717 |
struct vperfctr_control *control,
|
|
Packit |
577717 |
struct perfctr_sum_ctrs *children);
|
|
Packit |
577717 |
|
|
Packit |
577717 |
'fd' must be the return value from a call to sys_vperfctr_open().
|
|
Packit |
577717 |
|
|
Packit |
577717 |
This operation copies data from the perfctr state object to
|
|
Packit |
577717 |
user-space. If 'sum' is non-NULL, then the counter sums are
|
|
Packit |
577717 |
written to it. If 'control' is non-NULL, then the control data
|
|
Packit |
577717 |
is written to it. If 'children' is non-NULL, then the sums of
|
|
Packit |
577717 |
exited childrens' counters are written to it.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If the perfctr state object is attached to the current thread,
|
|
Packit |
577717 |
then the counters are sampled and updated first.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If 'fd' does not denote the current thread, then it must denote a
|
|
Packit |
577717 |
thread that is stopped and under ptrace control by the current thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Notes:
|
|
Packit |
577717 |
- An alternate and faster way to retrieve the counter sums is described
|
|
Packit |
577717 |
below. This system call can be used if the hardware does not permit
|
|
Packit |
577717 |
user-space reads of the counters.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Resuming After Handling Overflow Signal
|
|
Packit |
577717 |
---------------------------------------
|
|
Packit |
577717 |
int err = sys_vperfctr_iresume(int fd);
|
|
Packit |
577717 |
|
|
Packit |
577717 |
'fd' must be the return value from a call to sys_vperfctr_open().
|
|
Packit |
577717 |
The perfctr object must still be attached to its owner thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
When an interrupt-mode counter has overflowed, the counters
|
|
Packit |
577717 |
are sampled and suspended (TSC remains active). Then a signal,
|
|
Packit |
577717 |
as specified by the 'si_signo' control field, is sent to the
|
|
Packit |
577717 |
owner thread: the associated 'struct siginfo' has 'si_code'
|
|
Packit |
577717 |
equal to 'SI_PMC_OVF', and 'si_pmc_ovf_mask' equal to the set
|
|
Packit |
577717 |
of overflown counters.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The counters are suspended to avoid generating new performance
|
|
Packit |
577717 |
counter events during the execution of the signal handler, but
|
|
Packit |
577717 |
the previous settings are saved. Calling sys_vperfctr_iresume()
|
|
Packit |
577717 |
restores the previous settings and resumes the counters. Doing
|
|
Packit |
577717 |
this is optional.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
If 'fd' does not denote the current thread, then it must denote a
|
|
Packit |
577717 |
thread that is stopped and under ptrace control by the current thread.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Reading the Counter Values
|
|
Packit |
577717 |
--------------------------
|
|
Packit |
577717 |
The value of a counter is computed from three components:
|
|
Packit |
577717 |
|
|
Packit |
577717 |
value = sum + (now - start);
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Two of these (sum and start) reside in the kernel's state object,
|
|
Packit |
577717 |
and the third (now) is the contents of the hardware counter.
|
|
Packit |
577717 |
To perform this computation in user-space requires access to
|
|
Packit |
577717 |
the state object. This is achieved by passing the file descriptor
|
|
Packit |
577717 |
from sys_vperfctr_open() to mmap():
|
|
Packit |
577717 |
|
|
Packit |
577717 |
volatile const struct vperfctr_state *kstate;
|
|
Packit |
577717 |
kstate = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0);
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Reading the three components is a non-atomic operation. If the
|
|
Packit |
577717 |
thread is scheduled during the operation, the three values will
|
|
Packit |
577717 |
not be consistent and the wrong result will be computed.
|
|
Packit |
577717 |
To detect this situation, user-space should check the kernel
|
|
Packit |
577717 |
state's TSC start value before and after the operation, and
|
|
Packit |
577717 |
retry the operation in case of a mismatch.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The algorithm for retrieving the value of counter 'i' is:
|
|
Packit |
577717 |
|
|
Packit |
577717 |
tsc0 = kstate->cpu_state.tsc_start;
|
|
Packit |
577717 |
for(;;) {
|
|
Packit |
577717 |
rdpmcl(kstate->cpu_state.pmc[i].map, now);
|
|
Packit |
577717 |
start = kstate->cpu_state.pmc[i].start;
|
|
Packit |
577717 |
sum = kstate->cpu_state.pmc[i].sum;
|
|
Packit |
577717 |
tsc1 = kstate->cpu_state.tsc_start;
|
|
Packit |
577717 |
if (likely(tsc1 == tsc0))
|
|
Packit |
577717 |
break;
|
|
Packit |
577717 |
tsc0 = tsc1;
|
|
Packit |
577717 |
}
|
|
Packit |
577717 |
return sum + (now - start);
|
|
Packit |
577717 |
|
|
Packit |
577717 |
The algorithm for retrieving the value of the TSC is similar,
|
|
Packit |
577717 |
as is the algorithm for retrieving the values of all counters.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Notes:
|
|
Packit |
577717 |
- Since the state's TSC time-stamps are used, the algorithm requires
|
|
Packit |
577717 |
that user-space enables TSC sampling.
|
|
Packit |
577717 |
- The algorithm requires that the hardware allows user-space reads
|
|
Packit |
577717 |
of the counter registers. If this property isn't statically known
|
|
Packit |
577717 |
for the architecture, user-space should retrieve the kernel's
|
|
Packit |
577717 |
'struct perfctr_info' object and check that the PERFCTR_FEATURE_RDPMC
|
|
Packit |
577717 |
flag is set.
|
|
Packit |
577717 |
|
|
Packit |
577717 |
Limitations / TODO List
|
|
Packit |
577717 |
=======================
|
|
Packit |
577717 |
- Buffering of overflow samples is not implemented. So far, not a
|
|
Packit |
577717 |
single user has requested it.
|