$Id: overview.txt,v 1.2 2004/07/17 00:30:49 mikpe Exp $
AN OVERVIEW OF PERFCTR
======================
The perfctr package adds support to the Linux kernel for using
the performance-monitoring counters found in many processors.
Perfctr is internally organised in three layers:
- The low-level drivers, one for each supported architecture.
Currently there are two, one for 32 and 64-bit x86 processors,
and one for 32-bit PowerPC processors.
low-level-api.txt documents the model of the performance counters
used in this package, and the internal API to the low-level drivers.
low-level-{x86,ppc}.txt provide documentation specific for those
architectures and their low-level drivers.
- The high-level services.
There is currently one, a kernel extension adding support for
virtualised per-process performance counters.
See virtual.txt for documentation on this kernel extension.
[There used to be a second high-level service, a simple driver
to control and access all performance counters in all processors.
This driver is currently removed, pending an acceptable new API.]
- The top-level, which performs initialisation and implements
common procedures and system calls.
Rationale
---------
The perfctr package solves three problems:
- Hardware invariably restricts programming of the performance
counter registers to kernel-level code, and sometimes also
restricts reading the counters to kernel-level code.
Perfctr adds APIs allowing user-space code access the counters.
In the case of the per-process counters kernel extension,
even non-privileged processes are allowed access.
- Hardware often limits the precision of the hardware counters,
making them unsuitable for storing total event counts.
The counts are instead maintained as 64-bit values in software,
with the hardware counters used to derive increments over given
time periods.
- In a non-modified kernel, the thread state does not include the
performance monitoring counters, and the context switch code
does not save and restore them. In this situation the counters
are system-wide, making them unreliable and inaccurate when used
for monitoring specific processes or specific segments of code.
The per-process counters kernel extension treats the counter state as
part of the thread state, solving the reliability and accuracy problems.
Non-goals
---------
Providing high-level interfaces that abstract and hide the
underlying hardware is a non-goal. Such abstractions can
and should be implemented in user-space, for several reasons:
- The complexity and variability of the hardware means that
any abstraction would be inaccurate. There would be both
loss of functionality, and presence of functionality which
isn't supportable on any given processor. User-space tools
and libraries can implement this, on top of the processor-
specific interfaces provided by the kernel.
- The implementation of such an abstraction would be large
and complex. (Consider ESCR register assignment on P4.)
Performing complex actions in user-space simplifies the
kernel, allowing it to concentrate on validating control
data, managing processes, and driving the hardware.
(C.f. the role of compilers.)
- The abstraction is purely a user-convenience thing. The
kernel-level components have no need for it.
Common System Calls
===================
This lists those system calls that are not tied to
a specific high-level service/driver.
Querying CPU and Driver Information
-----------------------------------
int err = sys_perfctr_info(struct perfctr_info *info,
struct perfctr_cpu_mask *cpus,
struct perfctr_cpu_mask *forbidden);
This operation retrieves information from the kernel about
the processors in the system.
If non-NULL, '*info' will be updated with information about the
capabilities of the processor and the low-level driver.
If non-NULL, '*cpus' will be updated with a bitmask listing the
set of processors in the system. The size of this bitmask is not
statically known, so the protocol is:
1. User-space initialises cpus->nrwords to the number of elements
allocated for cpus->mask[].
2. The kernel reads cpus->nrwords, and then writes the required
number of words to cpus->nrwords.
3. If the required number of words is less than the original value
of cpus->nrwords, then an EOVERFLOW error is signalled.
4. Otherwise, the kernel converts its internal cpumask_t value
to the external format and writes that to cpus->mask[].
If non-NULL, '*forbidden' will be updated with a bitmask listing
the set of processors in the system on which users must not try
to use performance counters. This is currently only relevant for
hyper-threaded Pentium 4/Xeon systems. The protocol is the same
as for '*cpus'.
Notes:
- The internal representation of a cpumask_t is as an array of
unsigned long. This representation is unsuitable for user-space,
because it is not binary-compatible between 32 and 64-bit
variants of a big-endian processor. The 'struct perfctr_cpu_mask'
type uses an array of unsigned 32-bit integers.
- The protocol for retrieving a 'struct perfctr_cpu_mask' was
designed to allow user-space to quickly determine the correct
size of the 'mask[]' array. Other system calls use weaker protocols,
which force user-space to guess increasingly larger values in a
loop, until finally an acceptable value was guessed.