Blob Blame History Raw
$Id: low-level-x86.txt,v 1.2 2004/07/11 17:12:28 mikpe Exp $

PERFCTRS X86 LOW-LEVEL API
==========================

See low-level-api.txt for the common low-level API.
This document only describes x86-specific behaviour.
For detailed hardware control register layouts, see
the manufacturers' documentation.

Contents
========
- Supported processors
- Contents of <asm-i386/perfctr.h>
- Processor-specific Notes
- Implementation Notes

Supported processors
====================
- Intel P5, P5MMX, P6, P4.
- AMD K7, K8. (P6 clones, with some changes)
- Cyrix 6x86MX, MII, and III. (good P5 clones)
- Centaur WinChip C6, 2, and 3. (bad P5 clones)
- VIA C3. (bad P6 clone)
- Any generic x86 with a TSC.

Contents of <asm-i386/perfctr.h>
================================

"struct perfctr_sum_ctrs"
-------------------------
struct perfctr_sum_ctrs {
	unsigned long long tsc;
	unsigned long long pmc[18];
};

The pmc[] array has room for 18 counters.

"struct perfctr_cpu_control"
----------------------------
struct perfctr_cpu_control {
	unsigned int tsc_on;
	unsigned int nractrs;		/* # of a-mode counters */
	unsigned int nrictrs;		/* # of i-mode counters */
	unsigned int pmc_map[18];
	unsigned int evntsel[18];	/* one per counter, even on P5 */
	struct {
		unsigned int escr[18];
		unsigned int pebs_enable;	/* for replay tagging */
		unsigned int pebs_matrix_vert;	/* for replay tagging */
	} p4;
	int ireset[18];			/* < 0, for i-mode counters */
	unsigned int _reserved1;
	unsigned int _reserved2;
	unsigned int _reserved3;
	unsigned int _reserved4;
};

The per-counter arrays have room for 18 elements.

ireset[] values must be negative, since overflow occurs on
the negative-to-non-negative transition.

The p4 sub-struct contains P4-specific control data:
- escr[]: the control data to write to the ESCR register
  associatied with the counter
- pebs_enable: the control data to write to the PEBS_ENABLE MSR
- pebs_matrix_vert: the control data to write to the
  PEBS_MATRIX_VERT MSR

"struct perfctr_cpu_state"
--------------------------
struct perfctr_cpu_state {
	unsigned int cstatus;
	struct {	/* k1 is opaque in the user ABI */
		unsigned int id;
		int isuspend_cpu;
	} k1;
	/* The two tsc fields must be inlined. Placing them in a
	   sub-struct causes unwanted internal padding on x86-64. */
	unsigned int tsc_start;
	unsigned long long tsc_sum;
	struct {
		unsigned int map;
		unsigned int start;
		unsigned long long sum;
	} pmc[18];	/* the size is not part of the user ABI */
#ifdef __KERNEL__
	struct perfctr_cpu_control control;
	unsigned int p4_escr_map[18];
#endif
};

The k1 sub-struct is used by the low-level driver for
caching purposes. "id" identifies the control data, and
"isuspend_cpu" identifies the CPU on which the i-mode
counters were last suspended.

The pmc[] array has room for 18 elements.

p4_escr_map[] is computed from control by the low-level driver,
and provides the MSR number for the counter's associated ESCR.

User-space overflow signal handler items
----------------------------------------
#ifdef __KERNEL__
#define SI_PMC_OVF	(__SI_FAULT|'P')
#else
#define SI_PMC_OVF	('P')
#endif
#define si_pmc_ovf_mask	_sifields._pad[0]

Kernel-internal API
-------------------

In perfctr_cpu_update_control(), the is_global parameter controls
whether monitoring the other thread (T1) on HT P4s is permitted
or not. On other processors the parameter is ignored.

SMP kernels define CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK and
"extern cpumask_t perfctr_cpus_forbidden_mask;".
On HT P4s, resource conflicts can occur because both threads
(T0 and T1) in a processor share the same perfctr registers.
To prevent conflicts, only thread 0 in each processor is allowed
to access the counters. perfctr_cpus_forbidden_mask contains the
smp_processor_id()s of each processor's thread 1, and it is the
responsibility of the high-level driver to ensure that it never
accesses the perfctr state from a forbidden thread.

Overflow interrupt handling requires local APIC support in the kernel.

Processor-specific Notes
========================

General
-------
pmc_map[] contains a counter number, as used by the RDPMC instruction.
It never contains an MSR number.

Counters are 32, 40, or 48 bits wide. The driver always only
reads the low 32 bits. This avoids performance issues, and
errata on some processors.

Writing to counters or their control registers tends to be
very expensive. This is why a-mode counters only use read
operations on the counter registers. Caching of control
register contents is done to avoid writing them. "Suspend CPU"
is recorded for i-mode counters to avoid writing the counter
registers when the counters are resumed (their control
registers must be written at both suspend and resume, however).

Some processors are unable to stop the counters (Centaur/VIA),
and some are unable to reinitialise them to arbitrary values (P6).
Storing the counters' total counts in the hardware counters
would break as soon as context-switches occur. This is another
reason why the accumulate-differences method for maintaining the
counter values is used.

Intel P5
--------
The hardware stores both counters' control data in a single
control register, the CESR MSR. The evntsel values are
limited to 16 bits each, and are combined by the low-level
driver to form the value for the CESR. Apart from that,
the evntsel values are direct images of the CESR.

Bits 0xFE00 in an evntsel value are reserved.
At least one evntsel CPL bit (0x00C0) must be set.

For Cyrix' P5 clones, evntsel bits 0xFA00  are reserved.

For Centaur's P5 clones, evntsel bits 0xFF00 are reserved.
It has no CPL bits to set. The TSC is broken and cannot be used.

Intel P6
--------
The evntsel values are mapped directly onto the counters'
EVNTSEL control registers.

The global enable bit (22) in EVNTSEL0 must be set. That bit is
reserved in EVNTSEL1.

Bits 21 and 19 (0x00280000) in each evntsel are reserved.

For an i-mode counter, bit 20 (0x00100000) of its evntsel must be
set. For a-mode counters, that bit must not be set.

Hardware quirk: Counters are 40 bits wide, but writing to a
counter only writes the low 32 bits: remaining bits are
sign-extended from bit 31.

AMD K7/K8
---------
Similar to Intel P6. The main difference is that each evntsel has
its own enable bit, which must be set.

VIA C3
------
Superficially similar to Intel P6, but only PERFCTR1/EVNTSEL1
are programmable. pmc_map[0] must be 1, if nractrs == 1.

Bits 0xFFFFFE00 in the evntsel are reserved. There are no auxiliary
control bits to set.

Generic
-------
Only permits TSC sampling, with tsc_on == 1 and nractrs == nrictrs == 0
in the control data.

Intel P4
--------
For each counter, its evntsel[] value is mapped onto its CCCR
control register, and its p4.escr[] value is mapped onto its
associated ESCR control register.

The ESCR register number is computed from the hardware counter
number (from pmc_map[]) and the ESCR SELECT field in the CCCR,
and is cached in p4_escr_map[].

pmc_map[] contains the value to pass to RDPMC when reading the
counter. It is strongly recommended to set bit 31 (fast rdpmc).

In each evntsel/CCCR value:
- the OVF, OVF_PMI_T1 and hardware-reserved bits (0xB80007FF)
  are reserved and must not be set
- bit 11 (EXTENDED_CASCADE) is only permitted on P4 models >= 2,
  and for counters 12 and 15-17
- bits 16 and 17 (ACTIVE_THREAD) must both be set on non-HT processors
- at least one of bits 12 (ENABLE), 30 (CASCADE), or 11 (EXTENDED_CASCADE)
  must be set
- bit 26 (OVF_PMI_T0) must be clear for a-mode counters, and set
  for i-mode counters; if bit 25 (FORCE_OVF) also is set, then
  the corresponding ireset[] value must be exactly -1

In each p4.escr[] value:
- bit 32 is reserved and must not be set
- the CPL_T1 field (bits 0 and 1) must be zero except on HT processors
  when global-mode counters are used
- IQ_ESCR0 and IQ_ESCR1 can only be used on P4 models <= 2

PEBS is not supported, but the replay tagging bits in PEBS_ENABLE
and PEBS_MATRIX_VERT may be used.

If p4.pebs_enable is zero, then p4.pebs_matrix_vert must also be zero.

If p4.pebs_enable is non-zero:
- only bits 24, 10, 9, 2, 1, and 0 may be set; note that in contrast
  to Intel's documentation, bit 25 (ENABLE_PEBS_MY_THR) is not needed
  and must not be set
- bit 24 (UOP_TAG) must be set
- at least one of bits 10, 9, 2, 1, or 0 must be set
- in p4.pebs_matrix_vert, all bits except 1 and 0 must be clear,
  and at least one of bits 1 and 0 must be set

Implementation Notes
====================

Caching
-------
Each 'struct perfctr_cpu_state' contains two cache-related fields:
- 'id': a unique identifier for the control data contents
- 'isuspend_cpu': the identity of the CPU on which a state containing
  interrupt-mode counters was last suspended

To this the driver adds a per-CPU cache, recording:
- the 'id' of the control data currently in that CPU
- the current contents of each control register

When perfctr_cpu_update_control() has validated the new control data,
it also updates the id field.

The driver's internal 'write_control' function, called from the
perfctr_cpu_resume() API function, first checks if the state's id
matches that of the CPU's cache, and if so, returns. Otherwise
it checks each control register in the state and updates those
that do not match the cache. Finally, it writes the state's id
to the cache. Tests on various x86 processor types have shown that
MSR writes are very expensive: the purpose of these cache checks
is to avoid MSR writes whenever possible.

Unlike accumulation-mode counters, interrupt-mode counters must be
physically stopped when suspended, primilarly to avoid overflow
interrupts in contexts not expecting them, and secondarily to avoid
increments to the counters themselves (see below).

When suspending interrupt-mode counters, the driver:
- records the CPU identity in the per-CPU cache
- stops each interrupt-mode counter by disabling its control register
- lets the cache and state id values remain the same

Later, when resuming interrupt-mode counters, the driver:
- if the state and cache id values match:
  * the cache id is cleared, to force a reload of the control
    registers stopped at suspend (see below)
  * if the state's "suspend" CPU identity matches the current CPU,
    the counter registers are still valid, and the procedure returns
- if the procedure did not return above, it then loops over each
  interrupt-mode counter:
  * the counter's control register is physically disabled, unless
    the cache indicates that it already is disabled; this is necessary
    to prevent premature events and overflow interrupts if the CPU's
    registers previously belonged to some other state
  * then the counter register itself is restored
After this interrupt-mode specific resume code is complete, the
driver continues by calling 'write_control' as described above.
The state and cache ids will not match, forcing write_control to
reload the disabled interrupt-mode control registers.

Call-site Backpatching
----------------------
The x86 family of processors is quite diverse in how their
performance counters work and are accessed. There are three
main designs (P5, P6, and P4) with several variations.
To handle this the processor type detection and initialisation
code sets up a number of function pointers to point to the
correct procedures for the actual CPU type.

Calls via function pointers are more expensive than direct calls,
so the driver actually performs direct calls to wrappers that
backpatch the original call sites to instead call the actual
CPU-specific functions in the future.

Unsynchronised code backpatching in SMP systems doesn't work
on Intel P6 processors due to an erratum, so the driver performs
a "finalise backpatching" step after the CPU-specific function
pointers have been set up. This step invokes the API procedures
on a temporary state object, set up to force every backpatchable
call site to be invoked and adjusted.

Several low-level API procedures are called in the context-switch
path by the per-process perfctrs kernel extension, which motivates
the efforts to reduce runtime overheads as much as possible.

Overflow Interrupts
-------------------
The x86 hardware enables overflow interrupts via the local
APIC's LVTPC entry, which is only present in P6/K7/K8/P4.

The low-level driver supports overflow interrupts as follows:
- It reserves a local APIC vector, 0xee, as LOCAL_PERFCTR_VECTOR.
- It adds a local APIC exception handler to entry.S, which
  invokes the driver's smp_perfctr_interrupt() procedure.
- It adds code to i8259.c to bind the LOCAL_PERFCTR_VECTOR
  interrupt gate to the exception handler in entry.S.
- During processor type detection, it records whether the
  processor supports the local APIC, and sets up function pointers
  for the suspend and resume operations on interrupt-mode counters.
- When the low-level driver is activated, it enables overflow
  interrupts by writing LOCAL_PERFCTR_VECTOR to each CPU's APIC_LVTPC.
- Overflow interrupts now end up in smp_perfctr_interrupt(), which
  ACKs the interrupt and invokes the interrupt handler installed
  by the high-level service/driver.
- When the low-level driver is deactivated, it disables overflow
  interrupts by masking APIC_LVTPC in each CPU. It then releases
  the local APIC back to the NMI watchdog.

At compile-time, the low-level driver indicates overflow interrupt
support by enabling CONFIG_PERFCTR_INTERRUPT_SUPPORT. If the feature
is also available at runtime, it sets the PERFCTR_FEATURE_PCINT flag
in the perfctr_info object.