$Id: overview.txt,v 1.2 2004/07/17 00:30:49 mikpe Exp $ AN OVERVIEW OF PERFCTR ====================== The perfctr package adds support to the Linux kernel for using the performance-monitoring counters found in many processors. Perfctr is internally organised in three layers: - The low-level drivers, one for each supported architecture. Currently there are two, one for 32 and 64-bit x86 processors, and one for 32-bit PowerPC processors. low-level-api.txt documents the model of the performance counters used in this package, and the internal API to the low-level drivers. low-level-{x86,ppc}.txt provide documentation specific for those architectures and their low-level drivers. - The high-level services. There is currently one, a kernel extension adding support for virtualised per-process performance counters. See virtual.txt for documentation on this kernel extension. [There used to be a second high-level service, a simple driver to control and access all performance counters in all processors. This driver is currently removed, pending an acceptable new API.] - The top-level, which performs initialisation and implements common procedures and system calls. Rationale --------- The perfctr package solves three problems: - Hardware invariably restricts programming of the performance counter registers to kernel-level code, and sometimes also restricts reading the counters to kernel-level code. Perfctr adds APIs allowing user-space code access the counters. In the case of the per-process counters kernel extension, even non-privileged processes are allowed access. - Hardware often limits the precision of the hardware counters, making them unsuitable for storing total event counts. The counts are instead maintained as 64-bit values in software, with the hardware counters used to derive increments over given time periods. - In a non-modified kernel, the thread state does not include the performance monitoring counters, and the context switch code does not save and restore them. In this situation the counters are system-wide, making them unreliable and inaccurate when used for monitoring specific processes or specific segments of code. The per-process counters kernel extension treats the counter state as part of the thread state, solving the reliability and accuracy problems. Non-goals --------- Providing high-level interfaces that abstract and hide the underlying hardware is a non-goal. Such abstractions can and should be implemented in user-space, for several reasons: - The complexity and variability of the hardware means that any abstraction would be inaccurate. There would be both loss of functionality, and presence of functionality which isn't supportable on any given processor. User-space tools and libraries can implement this, on top of the processor- specific interfaces provided by the kernel. - The implementation of such an abstraction would be large and complex. (Consider ESCR register assignment on P4.) Performing complex actions in user-space simplifies the kernel, allowing it to concentrate on validating control data, managing processes, and driving the hardware. (C.f. the role of compilers.) - The abstraction is purely a user-convenience thing. The kernel-level components have no need for it. Common System Calls =================== This lists those system calls that are not tied to a specific high-level service/driver. Querying CPU and Driver Information ----------------------------------- int err = sys_perfctr_info(struct perfctr_info *info, struct perfctr_cpu_mask *cpus, struct perfctr_cpu_mask *forbidden); This operation retrieves information from the kernel about the processors in the system. If non-NULL, '*info' will be updated with information about the capabilities of the processor and the low-level driver. If non-NULL, '*cpus' will be updated with a bitmask listing the set of processors in the system. The size of this bitmask is not statically known, so the protocol is: 1. User-space initialises cpus->nrwords to the number of elements allocated for cpus->mask[]. 2. The kernel reads cpus->nrwords, and then writes the required number of words to cpus->nrwords. 3. If the required number of words is less than the original value of cpus->nrwords, then an EOVERFLOW error is signalled. 4. Otherwise, the kernel converts its internal cpumask_t value to the external format and writes that to cpus->mask[]. If non-NULL, '*forbidden' will be updated with a bitmask listing the set of processors in the system on which users must not try to use performance counters. This is currently only relevant for hyper-threaded Pentium 4/Xeon systems. The protocol is the same as for '*cpus'. Notes: - The internal representation of a cpumask_t is as an array of unsigned long. This representation is unsuitable for user-space, because it is not binary-compatible between 32 and 64-bit variants of a big-endian processor. The 'struct perfctr_cpu_mask' type uses an array of unsigned 32-bit integers. - The protocol for retrieving a 'struct perfctr_cpu_mask' was designed to allow user-space to quickly determine the correct size of the 'mask[]' array. Other system calls use weaker protocols, which force user-space to guess increasingly larger values in a loop, until finally an acceptable value was guessed.