Blame jemalloc/TUNING.md

Packit 345191
This document summarizes the common approaches for performance fine tuning with
Packit 345191
jemalloc (as of 5.1.0).  The default configuration of jemalloc tends to work
Packit 345191
reasonably well in practice, and most applications should not have to tune any
Packit 345191
options. However, in order to cover a wide range of applications and avoid
Packit 345191
pathological cases, the default setting is sometimes kept conservative and
Packit 345191
suboptimal, even for many common workloads.  When jemalloc is properly tuned for
Packit 345191
a specific application / workload, it is common to improve system level metrics
Packit 345191
by a few percent, or make favorable trade-offs.
Packit 345191
Packit 345191
Packit 345191
## Notable runtime options for performance tuning
Packit 345191
Packit 345191
Runtime options can be set via
Packit 345191
[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning).
Packit 345191
Packit 345191
* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread)
Packit 345191
Packit 345191
    Enabling jemalloc background threads generally improves the tail latency for
Packit 345191
    application threads, since unused memory purging is shifted to the dedicated
Packit 345191
    background threads.  In addition, unintended purging delay caused by
Packit 345191
    application inactivity is avoided with background threads.
Packit 345191
Packit 345191
    Suggested: `background_thread:true` when jemalloc managed threads can be
Packit 345191
    allowed.
Packit 345191
Packit 345191
* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp)
Packit 345191
Packit 345191
    Allowing jemalloc to utilize transparent huge pages for its internal
Packit 345191
    metadata usually reduces TLB misses significantly, especially for programs
Packit 345191
    with large memory footprint and frequent allocation / deallocation
Packit 345191
    activities.  Metadata memory usage may increase due to the use of huge
Packit 345191
    pages.
Packit 345191
Packit 345191
    Suggested for allocation intensive programs: `metadata_thp:auto` or
Packit 345191
    `metadata_thp:always`, which is expected to improve CPU utilization at a
Packit 345191
    small memory cost.
Packit 345191
Packit 345191
* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and
Packit 345191
  [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms)
Packit 345191
Packit 345191
    Decay time determines how fast jemalloc returns unused pages back to the
Packit 345191
    operating system, and therefore provides a fairly straightforward trade-off
Packit 345191
    between CPU and memory usage.  Shorter decay time purges unused pages faster
Packit 345191
    to reduces memory usage (usually at the cost of more CPU cycles spent on
Packit 345191
    purging), and vice versa.
Packit 345191
Packit 345191
    Suggested: tune the values based on the desired trade-offs.
Packit 345191
Packit 345191
* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas)
Packit 345191
Packit 345191
    By default jemalloc uses multiple arenas to reduce internal lock contention.
Packit 345191
    However high arena count may also increase overall memory fragmentation,
Packit 345191
    since arenas manage memory independently.  When high degree of parallelism
Packit 345191
    is not expected at the allocator level, lower number of arenas often
Packit 345191
    improves memory usage.
Packit 345191
Packit 345191
    Suggested: if low parallelism is expected, try lower arena count while
Packit 345191
    monitoring CPU and memory usage.
Packit 345191
Packit 345191
* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena)
Packit 345191
Packit 345191
    Enable dynamic thread to arena association based on running CPU.  This has
Packit 345191
    the potential to improve locality, e.g. when thread to CPU affinity is
Packit 345191
    present.
Packit 345191
Packit 345191
    Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if
Packit 345191
    thread migration between processors is expected to be infrequent.
Packit 345191
Packit 345191
Examples:
Packit 345191
Packit 345191
* High resource consumption application, prioritizing CPU utilization:
Packit 345191
Packit 345191
    `background_thread:true,metadata_thp:auto` combined with relaxed decay time
Packit 345191
    (increased `dirty_decay_ms` and / or `muzzy_decay_ms`,
Packit 345191
    e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`).
Packit 345191
Packit 345191
* High resource consumption application, prioritizing memory usage:
Packit 345191
Packit 345191
    `background_thread:true` combined with shorter decay time (decreased
Packit 345191
    `dirty_decay_ms` and / or `muzzy_decay_ms`,
Packit 345191
    e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count
Packit 345191
    (e.g. number of CPUs).
Packit 345191
Packit 345191
* Low resource consumption application:
Packit 345191
Packit 345191
    `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased
Packit 345191
    `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g.
Packit 345191
    `dirty_decay_ms:1000,muzzy_decay_ms:0`).
Packit 345191
Packit 345191
* Extremely conservative -- minimize memory usage at all costs, only suitable when
Packit 345191
allocation activity is very rare:
Packit 345191
Packit 345191
    `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0`
Packit 345191
Packit 345191
Note that it is recommended to combine the options with `abort_conf:true` which
Packit 345191
aborts immediately on illegal options.
Packit 345191
Packit 345191
## Beyond runtime options
Packit 345191
Packit 345191
In addition to the runtime options, there are a number of programmatic ways to
Packit 345191
improve application performance with jemalloc.
Packit 345191
Packit 345191
* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create)
Packit 345191
Packit 345191
    Manually created arenas can help performance in various ways, e.g. by
Packit 345191
    managing locality and contention for specific usages.  For example,
Packit 345191
    applications can explicitly allocate frequently accessed objects from a
Packit 345191
    dedicated arena with
Packit 345191
    [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve
Packit 345191
    locality.  In addition, explicit arenas often benefit from individually
Packit 345191
    tuned options, e.g. relaxed [decay
Packit 345191
    time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if
Packit 345191
    frequent reuse is expected.
Packit 345191
Packit 345191
* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks)
Packit 345191
Packit 345191
    Extent hooks allow customization for managing underlying memory.  One use
Packit 345191
    case for performance purpose is to utilize huge pages -- for example,
Packit 345191
    [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp)
Packit 345191
    uses explicit arenas with customized extent hooks to manage 1GB huge pages
Packit 345191
    for frequently accessed data, which reduces TLB misses significantly.
Packit 345191
Packit 345191
* [Explicit thread-to-arena
Packit 345191
  binding](http://jemalloc.net/jemalloc.3.html#thread.arena)
Packit 345191
Packit 345191
    It is common for some threads in an application to have different memory
Packit 345191
    access / allocation patterns.  Threads with heavy workloads often benefit
Packit 345191
    from explicit binding, e.g. binding very active threads to dedicated arenas
Packit 345191
    may reduce contention at the allocator level.