Blob Blame History Raw
$Id: overview.txt,v 1.2 2004/07/17 00:30:49 mikpe Exp $

AN OVERVIEW OF PERFCTR
======================
The perfctr package adds support to the Linux kernel for using
the performance-monitoring counters found in many processors.

Perfctr is internally organised in three layers:

- The low-level drivers, one for each supported architecture.
  Currently there are two, one for 32 and 64-bit x86 processors,
  and one for 32-bit PowerPC processors.

  low-level-api.txt documents the model of the performance counters
  used in this package, and the internal API to the low-level drivers.

  low-level-{x86,ppc}.txt provide documentation specific for those
  architectures and their low-level drivers.

- The high-level services.
  There is currently one, a kernel extension adding support for
  virtualised per-process performance counters.
  See virtual.txt for documentation on this kernel extension.

  [There used to be a second high-level service, a simple driver
  to control and access all performance counters in all processors.
  This driver is currently removed, pending an acceptable new API.]

- The top-level, which performs initialisation and implements
  common procedures and system calls.

Rationale
---------
The perfctr package solves three problems:

- Hardware invariably restricts programming of the performance
  counter registers to kernel-level code, and sometimes also
  restricts reading the counters to kernel-level code.

  Perfctr adds APIs allowing user-space code access the counters.
  In the case of the per-process counters kernel extension,
  even non-privileged processes are allowed access.

- Hardware often limits the precision of the hardware counters,
  making them unsuitable for storing total event counts.

  The counts are instead maintained as 64-bit values in software,
  with the hardware counters used to derive increments over given
  time periods.

- In a non-modified kernel, the thread state does not include the
  performance monitoring counters, and the context switch code
  does not save and restore them. In this situation the counters
  are system-wide, making them unreliable and inaccurate when used
  for monitoring specific processes or specific segments of code.

  The per-process counters kernel extension treats the counter state as
  part of the thread state, solving the reliability and accuracy problems.

Non-goals
---------
Providing high-level interfaces that abstract and hide the
underlying hardware is a non-goal. Such abstractions can
and should be implemented in user-space, for several reasons:

- The complexity and variability of the hardware means that
  any abstraction would be inaccurate. There would be both
  loss of functionality, and presence of functionality which
  isn't supportable on any given processor. User-space tools
  and libraries can implement this, on top of the processor-
  specific interfaces provided by the kernel.

- The implementation of such an abstraction would be large
  and complex. (Consider ESCR register assignment on P4.)
  Performing complex actions in user-space simplifies the
  kernel, allowing it to concentrate on validating control
  data, managing processes, and driving the hardware.
  (C.f. the role of compilers.)

- The abstraction is purely a user-convenience thing. The
  kernel-level components have no need for it.

Common System Calls
===================
This lists those system calls that are not tied to
a specific high-level service/driver.

Querying CPU and Driver Information
-----------------------------------
int err = sys_perfctr_info(struct perfctr_info *info,
			   struct perfctr_cpu_mask *cpus,
			   struct perfctr_cpu_mask *forbidden);

This operation retrieves information from the kernel about
the processors in the system.

If non-NULL, '*info' will be updated with information about the
capabilities of the processor and the low-level driver.

If non-NULL, '*cpus' will be updated with a bitmask listing the
set of processors in the system. The size of this bitmask is not
statically known, so the protocol is:

1. User-space initialises cpus->nrwords to the number of elements
   allocated for cpus->mask[].
2. The kernel reads cpus->nrwords, and then writes the required
   number of words to cpus->nrwords.
3. If the required number of words is less than the original value
   of cpus->nrwords, then an EOVERFLOW error is signalled.
4. Otherwise, the kernel converts its internal cpumask_t value
   to the external format and writes that to cpus->mask[].

If non-NULL, '*forbidden' will be updated with a bitmask listing
the set of processors in the system on which users must not try
to use performance counters. This is currently only relevant for
hyper-threaded Pentium 4/Xeon systems. The protocol is the same
as for '*cpus'.

Notes:
- The internal representation of a cpumask_t is as an array of
  unsigned long. This representation is unsuitable for user-space,
  because it is not binary-compatible between 32 and 64-bit
  variants of a big-endian processor. The 'struct perfctr_cpu_mask'
  type uses an array of unsigned 32-bit integers.
- The protocol for retrieving a 'struct perfctr_cpu_mask' was
  designed to allow user-space to quickly determine the correct
  size of the 'mask[]' array. Other system calls use weaker protocols,
  which force user-space to guess increasingly larger values in a
  loop, until finally an acceptable value was guessed.