Storage system notes -------------------- extstore.h defines the API. extstore_write() is a synchronous call which memcpy's the input buffer into a write buffer for an active page. A failure is not usually a hard failure, but indicates caller can try again another time. IE: it might be busy freeing pages or assigning new ones. As of this writing the write() implementation doesn't have an internal loop, so it can give spurious failures (good for testing integration) extstore_read() is an asynchronous call which takes a stack of IO objects and adds it to the end of a queue. It then signals the IO thread to run. Once an IO stack is submitted the caller must not touch the submitted objects anymore (they are relinked internally). extstore_delete() is a synchronous call which informs the storage engine an item has been removed from that page. It's important to call this as items are actively deleted or passively reaped due to TTL expiration. This allows the engine to intelligently reclaim pages. The IO threads execute each object in turn (or in bulk of running in the future libaio mode). Callbacks are issued from the IO threads. It's thus important to keep processing to a minimum. Callbacks may be issued out of order, and it is the caller's responsibility to know when its stack has been fully processed so it may reclaim the memory. With DIRECT_IO support, buffers submitted for read/write will need to be aligned with posix_memalign() or similar. Buckets ------- During extstore_init(), a number of active buckets is specified. Pages are handled overall as a global pool, but writes can be redirected to specific active pages. This allows a lot of flexibility, ie: 1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400 goes into bucket 0, rest into bucket 1. Co-locating low TTL items means those pages can reach zero objects and free up more easily. 2) Extended: "low TTL" is one bucket, and then one bucket per slab class. If TTL's are low, mixed sized objects can go together as they are likely to expire before cycling out of flash (depending on workload, of course). For higher TTL items, pages are stored on chunk barriers. This means less space is wasted as items should fit nearly exactly into write buffers and pages. It also means you can blindly read items back if the system wants to free a page and we can indicate to the caller somehow which pages are up for probation. ie; issue a read against page 3 version 1 for byte range 0->1MB, then chunk and look up objects. Then read next 1MB chunk/etc. If there's anything we want to keep, pull it back into RAM before pages is freed. Pages are assigned into buckets on demand, so if you make 30 but use 1 there will only be a single active page with write buffers. Memcached integration --------------------- With the POC: items.c's lru_maintainer_thread calls writes to storage if all memory has been allocated out to slab classes, and there is less than an amount of memory free. Original objects are swapped with items marked with ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr *), which describes enough information to retrieve the original object from storage. To get best performance is important that reads can be deeply pipelined. As much processing as possible is done ahead of time, IO's are submitted, and once IO's are done processing a minimal amount of code is executed before transmit() is possible. This should amortize the amount of latency incurred by hopping threads and waiting on IO. Recaching --------- If a header is hit twice overall, and the second time within ~60s of the first time, it will have a chance of getting recached. "recache_rate" is a simple "counter % rate == 0" check. Setting to 1000 means one out of every 1000 instances of an item being hit twice within ~60s it will be recached into memory. Very hot items will get pulled out of storage relatively quickly. Compaction ---------- A target fragmentation limit is set: "0.9", meaning "run compactions if pages exist which have less than 90% of their bytes used". This value is slewed based on the number of free pages in the system, and activates when half of the pages used. The percentage of free pages is multiplied against the target fragmentation limit, ie: limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64 megabytes, pages with less than 28.8 megabytes used would be targeted for compaction. If 0 pges are free, anything less than 90% used is targeted, which means it has to rewrite 10 pages to free one page. In memcached's integration, a second bucket is used for objects rewritten via the compactor. Potentially objects around long enough to get compacted might continue to stick around, so co-locating them could reduce fragmentation work. If an exclusive lock is made on a valid object header, the flash locations are rewritten directly in the object. As of this writing, if an object header is busy for some reason, the write is dropped (COW needs to be implemented). This is an unlikely scenario however. Objects are read back along the boundaries of a write buffer. If an 8 meg write buffer is used, 8 megs are read back at once and iterated for objects. This needs a fair amount of tuning, possibly more throttling. It will still evict pages if the compactor gets behind. TODO ---- Sharing my broad TODO items into here. While doing the work they get split up more into local notes. Adding this so others can follow along: (a bunch of the TODO has been completed and removed) - O_DIRECT support - libaio support (requires O_DIRECT) - JBOD support (not in first pass) - 1-2 active pages per device. potentially dedicated IO threads per device. with a RAID setup you risk any individual disk doing a GC pause stalling all writes. could also simply rotate devices on a per-bucket basis. on memcached end: - O_DIRECT support; mostly memalign pages, but also making chunks grow aligned to sector sizes once they are >= a single sector. - specify storage size by MB/G/T/etc instead of page count