Tree - source-git/memcached - CentOS Git server

source-git / memcached

Blame doc/storage.txt

Blob History Raw

Packit	4e8bc4	`Storage system notes`
Packit	4e8bc4	`--------------------`
Packit	4e8bc4
Packit	4e8bc4	`extstore.h defines the API.`
Packit	4e8bc4
Packit	4e8bc4	`extstore_write() is a synchronous call which memcpy's the input buffer into a`
Packit	4e8bc4	`write buffer for an active page. A failure is not usually a hard failure, but`
Packit	4e8bc4	`indicates caller can try again another time. IE: it might be busy freeing`
Packit	4e8bc4	`pages or assigning new ones.`
Packit	4e8bc4
Packit	4e8bc4	`As of this writing the write() implementation doesn't have an internal loop,`
Packit	4e8bc4	`so it can give spurious failures (good for testing integration)`
Packit	4e8bc4
Packit	4e8bc4	`extstore_read() is an asynchronous call which takes a stack of IO objects and`
Packit	4e8bc4	`adds it to the end of a queue. It then signals the IO thread to run. Once an`
Packit	4e8bc4	`IO stack is submitted the caller must not touch the submitted objects anymore`
Packit	4e8bc4	`(they are relinked internally).`
Packit	4e8bc4
Packit	4e8bc4	`extstore_delete() is a synchronous call which informs the storage engine an`
Packit	4e8bc4	`item has been removed from that page. It's important to call this as items are`
Packit	4e8bc4	`actively deleted or passively reaped due to TTL expiration. This allows the`
Packit	4e8bc4	`engine to intelligently reclaim pages.`
Packit	4e8bc4
Packit	4e8bc4	`The IO threads execute each object in turn (or in bulk of running in the`
Packit	4e8bc4	`future libaio mode).`
Packit	4e8bc4
Packit	4e8bc4	`Callbacks are issued from the IO threads. It's thus important to keep`
Packit	4e8bc4	`processing to a minimum. Callbacks may be issued out of order, and it is the`
Packit	4e8bc4	`caller's responsibility to know when its stack has been fully processed so it`
Packit	4e8bc4	`may reclaim the memory.`
Packit	4e8bc4
Packit	4e8bc4	`With DIRECT_IO support, buffers submitted for read/write will need to be`
Packit	4e8bc4	`aligned with posix_memalign() or similar.`
Packit	4e8bc4
Packit	4e8bc4	`Buckets`
Packit	4e8bc4	`-------`
Packit	4e8bc4
Packit	4e8bc4	`During extstore_init(), a number of active buckets is specified. Pages are`
Packit	4e8bc4	`handled overall as a global pool, but writes can be redirected to specific`
Packit	4e8bc4	`active pages.`
Packit	4e8bc4
Packit	4e8bc4	`This allows a lot of flexibility, ie:`
Packit	4e8bc4
Packit	4e8bc4	`1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400`
Packit	4e8bc4	`goes into bucket 0, rest into bucket 1. Co-locating low TTL items means`
Packit	4e8bc4	`those pages can reach zero objects and free up more easily.`
Packit	4e8bc4
Packit	4e8bc4	`2) Extended: "low TTL" is one bucket, and then one bucket per slab class.`
Packit	4e8bc4	`If TTL's are low, mixed sized objects can go together as they are likely to`
Packit	4e8bc4	`expire before cycling out of flash (depending on workload, of course).`
Packit	4e8bc4	`For higher TTL items, pages are stored on chunk barriers. This means less`
Packit	4e8bc4	`space is wasted as items should fit nearly exactly into write buffers and`
Packit	4e8bc4	`pages. It also means you can blindly read items back if the system wants to`
Packit	4e8bc4	`free a page and we can indicate to the caller somehow which pages are up for`
Packit	4e8bc4	`probation. ie; issue a read against page 3 version 1 for byte range 0->1MB,`
Packit	4e8bc4	`then chunk and look up objects. Then read next 1MB chunk/etc. If there's`
Packit	4e8bc4	`anything we want to keep, pull it back into RAM before pages is freed.`
Packit	4e8bc4
Packit	4e8bc4	`Pages are assigned into buckets on demand, so if you make 30 but use 1 there`
Packit	4e8bc4	`will only be a single active page with write buffers.`
Packit	4e8bc4
Packit	4e8bc4	`Memcached integration`
Packit	4e8bc4	`---------------------`
Packit	4e8bc4
Packit	4e8bc4	`With the POC: items.c's lru_maintainer_thread calls writes to storage if all`
Packit	4e8bc4	`memory has been allocated out to slab classes, and there is less than an`
Packit	4e8bc4	`amount of memory free. Original objects are swapped with items marked with`
Packit	4e8bc4	`ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the`
Packit	4e8bc4	`header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr`
Packit	4e8bc4	`*), which describes enough information to retrieve the original object from`
Packit	4e8bc4	`storage.`
Packit	4e8bc4
Packit	4e8bc4	`To get best performance is important that reads can be deeply pipelined.`
Packit	4e8bc4	`As much processing as possible is done ahead of time, IO's are submitted, and`
Packit	4e8bc4	`once IO's are done processing a minimal amount of code is executed before`
Packit	4e8bc4	`transmit() is possible. This should amortize the amount of latency incurred by`
Packit	4e8bc4	`hopping threads and waiting on IO.`
Packit	4e8bc4
Packit	4e8bc4	`Recaching`
Packit	4e8bc4	`---------`
Packit	4e8bc4
Packit	4e8bc4	`If a header is hit twice overall, and the second time within ~60s of the first`
Packit	4e8bc4	`time, it will have a chance of getting recached. "recache_rate" is a simple`
Packit	4e8bc4	`"counter % rate == 0" check. Setting to 1000 means one out of every 1000`
Packit	4e8bc4	`instances of an item being hit twice within ~60s it will be recached into`
Packit	4e8bc4	`memory. Very hot items will get pulled out of storage relatively quickly.`
Packit	4e8bc4
Packit	4e8bc4	`Compaction`
Packit	4e8bc4	`----------`
Packit	4e8bc4
Packit	4e8bc4	`A target fragmentation limit is set: "0.9", meaning "run compactions if pages`
Packit	4e8bc4	`exist which have less than 90% of their bytes used".`
Packit	4e8bc4
Packit	4e8bc4	`This value is slewed based on the number of free pages in the system, and`
Packit	4e8bc4	`activates when half of the pages used. The percentage of free pages is`
Packit	4e8bc4	`multiplied against the target fragmentation limit, ie:`
Packit	4e8bc4	`limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64`
Packit	4e8bc4	`megabytes, pages with less than 28.8 megabytes used would be targeted for`
Packit	4e8bc4	`compaction. If 0 pges are free, anything less than 90% used is targeted, which`
Packit	4e8bc4	`means it has to rewrite 10 pages to free one page.`
Packit	4e8bc4
Packit	4e8bc4	`In memcached's integration, a second bucket is used for objects rewritten via`
Packit	4e8bc4	`the compactor. Potentially objects around long enough to get compacted might`
Packit	4e8bc4	`continue to stick around, so co-locating them could reduce fragmentation work.`
Packit	4e8bc4
Packit	4e8bc4	`If an exclusive lock is made on a valid object header, the flash locations are`
Packit	4e8bc4	`rewritten directly in the object. As of this writing, if an object header is`
Packit	4e8bc4	`busy for some reason, the write is dropped (COW needs to be implemented). This`
Packit	4e8bc4	`is an unlikely scenario however.`
Packit	4e8bc4
Packit	4e8bc4	`Objects are read back along the boundaries of a write buffer. If an 8 meg`
Packit	4e8bc4	`write buffer is used, 8 megs are read back at once and iterated for objects.`
Packit	4e8bc4
Packit	4e8bc4	`This needs a fair amount of tuning, possibly more throttling. It will still`
Packit	4e8bc4	`evict pages if the compactor gets behind.`
Packit	4e8bc4
Packit	4e8bc4	`TODO`
Packit	4e8bc4	`----`
Packit	4e8bc4
Packit	4e8bc4	`Sharing my broad TODO items into here. While doing the work they get split up`
Packit	4e8bc4	`more into local notes. Adding this so others can follow along:`
Packit	4e8bc4
Packit	4e8bc4	`(a bunch of the TODO has been completed and removed)`
Packit	4e8bc4	`- O_DIRECT support`
Packit	4e8bc4	`- libaio support (requires O_DIRECT)`
Packit	4e8bc4	`- JBOD support (not in first pass)`
Packit	4e8bc4	`- 1-2 active pages per device. potentially dedicated IO threads per device.`
Packit	4e8bc4	`with a RAID setup you risk any individual disk doing a GC pause stalling`
Packit	4e8bc4	`all writes. could also simply rotate devices on a per-bucket basis.`
Packit	4e8bc4
Packit	4e8bc4	`on memcached end:`
Packit	4e8bc4	`- O_DIRECT support; mostly memalign pages, but also making chunks grow`
Packit	4e8bc4	`aligned to sector sizes once they are >= a single sector.`
Packit	4e8bc4	`- specify storage size by MB/G/T/etc instead of page count`

source-git / memcached

Source Code

Blame doc/storage.txt