Blame doc/storage.txt

Packit 4e8bc4
Storage system notes
Packit 4e8bc4
--------------------
Packit 4e8bc4
Packit 4e8bc4
extstore.h defines the API.
Packit 4e8bc4
Packit 4e8bc4
extstore_write() is a synchronous call which memcpy's the input buffer into a
Packit 4e8bc4
write buffer for an active page. A failure is not usually a hard failure, but
Packit 4e8bc4
indicates caller can try again another time. IE: it might be busy freeing
Packit 4e8bc4
pages or assigning new ones.
Packit 4e8bc4
Packit 4e8bc4
As of this writing the write() implementation doesn't have an internal loop,
Packit 4e8bc4
so it can give spurious failures (good for testing integration)
Packit 4e8bc4
Packit 4e8bc4
extstore_read() is an asynchronous call which takes a stack of IO objects and
Packit 4e8bc4
adds it to the end of a queue. It then signals the IO thread to run. Once an
Packit 4e8bc4
IO stack is submitted the caller must not touch the submitted objects anymore
Packit 4e8bc4
(they are relinked internally).
Packit 4e8bc4
Packit 4e8bc4
extstore_delete() is a synchronous call which informs the storage engine an
Packit 4e8bc4
item has been removed from that page. It's important to call this as items are
Packit 4e8bc4
actively deleted or passively reaped due to TTL expiration. This allows the
Packit 4e8bc4
engine to intelligently reclaim pages.
Packit 4e8bc4
Packit 4e8bc4
The IO threads execute each object in turn (or in bulk of running in the
Packit 4e8bc4
future libaio mode).
Packit 4e8bc4
Packit 4e8bc4
Callbacks are issued from the IO threads. It's thus important to keep
Packit 4e8bc4
processing to a minimum. Callbacks may be issued out of order, and it is the
Packit 4e8bc4
caller's responsibility to know when its stack has been fully processed so it
Packit 4e8bc4
may reclaim the memory.
Packit 4e8bc4
Packit 4e8bc4
With DIRECT_IO support, buffers submitted for read/write will need to be
Packit 4e8bc4
aligned with posix_memalign() or similar.
Packit 4e8bc4
Packit 4e8bc4
Buckets
Packit 4e8bc4
-------
Packit 4e8bc4
Packit 4e8bc4
During extstore_init(), a number of active buckets is specified. Pages are
Packit 4e8bc4
handled overall as a global pool, but writes can be redirected to specific
Packit 4e8bc4
active pages.
Packit 4e8bc4
Packit 4e8bc4
This allows a lot of flexibility, ie:
Packit 4e8bc4
Packit 4e8bc4
1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400
Packit 4e8bc4
goes into bucket 0, rest into bucket 1. Co-locating low TTL items means
Packit 4e8bc4
those pages can reach zero objects and free up more easily.
Packit 4e8bc4
Packit 4e8bc4
2) Extended: "low TTL" is one bucket, and then one bucket per slab class.
Packit 4e8bc4
If TTL's are low, mixed sized objects can go together as they are likely to
Packit 4e8bc4
expire before cycling out of flash (depending on workload, of course).
Packit 4e8bc4
For higher TTL items, pages are stored on chunk barriers. This means less
Packit 4e8bc4
space is wasted as items should fit nearly exactly into write buffers and
Packit 4e8bc4
pages. It also means you can blindly read items back if the system wants to
Packit 4e8bc4
free a page and we can indicate to the caller somehow which pages are up for
Packit 4e8bc4
probation. ie; issue a read against page 3 version 1 for byte range 0->1MB,
Packit 4e8bc4
then chunk and look up objects. Then read next 1MB chunk/etc. If there's
Packit 4e8bc4
anything we want to keep, pull it back into RAM before pages is freed.
Packit 4e8bc4
Packit 4e8bc4
Pages are assigned into buckets on demand, so if you make 30 but use 1 there
Packit 4e8bc4
will only be a single active page with write buffers.
Packit 4e8bc4
Packit 4e8bc4
Memcached integration
Packit 4e8bc4
---------------------
Packit 4e8bc4
Packit 4e8bc4
With the POC: items.c's lru_maintainer_thread calls writes to storage if all
Packit 4e8bc4
memory has been allocated out to slab classes, and there is less than an
Packit 4e8bc4
amount of memory free. Original objects are swapped with items marked with
Packit 4e8bc4
ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the
Packit 4e8bc4
header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr
Packit 4e8bc4
*), which describes enough information to retrieve the original object from
Packit 4e8bc4
storage.
Packit 4e8bc4
Packit 4e8bc4
To get best performance is important that reads can be deeply pipelined.
Packit 4e8bc4
As much processing as possible is done ahead of time, IO's are submitted, and
Packit 4e8bc4
once IO's are done processing a minimal amount of code is executed before
Packit 4e8bc4
transmit() is possible. This should amortize the amount of latency incurred by
Packit 4e8bc4
hopping threads and waiting on IO.
Packit 4e8bc4
Packit 4e8bc4
Recaching
Packit 4e8bc4
---------
Packit 4e8bc4
Packit 4e8bc4
If a header is hit twice overall, and the second time within ~60s of the first
Packit 4e8bc4
time, it will have a chance of getting recached. "recache_rate" is a simple
Packit 4e8bc4
"counter % rate == 0" check. Setting to 1000 means one out of every 1000
Packit 4e8bc4
instances of an item being hit twice within ~60s it will be recached into
Packit 4e8bc4
memory. Very hot items will get pulled out of storage relatively quickly.
Packit 4e8bc4
Packit 4e8bc4
Compaction
Packit 4e8bc4
----------
Packit 4e8bc4
Packit 4e8bc4
A target fragmentation limit is set: "0.9", meaning "run compactions if pages
Packit 4e8bc4
exist which have less than 90% of their bytes used".
Packit 4e8bc4
Packit 4e8bc4
This value is slewed based on the number of free pages in the system, and
Packit 4e8bc4
activates when half of the pages used. The percentage of free pages is
Packit 4e8bc4
multiplied against the target fragmentation limit, ie:
Packit 4e8bc4
limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64
Packit 4e8bc4
megabytes, pages with less than 28.8 megabytes used would be targeted for
Packit 4e8bc4
compaction. If 0 pges are free, anything less than 90% used is targeted, which
Packit 4e8bc4
means it has to rewrite 10 pages to free one page.
Packit 4e8bc4
Packit 4e8bc4
In memcached's integration, a second bucket is used for objects rewritten via
Packit 4e8bc4
the compactor. Potentially objects around long enough to get compacted might
Packit 4e8bc4
continue to stick around, so co-locating them could reduce fragmentation work.
Packit 4e8bc4
Packit 4e8bc4
If an exclusive lock is made on a valid object header, the flash locations are
Packit 4e8bc4
rewritten directly in the object. As of this writing, if an object header is
Packit 4e8bc4
busy for some reason, the write is dropped (COW needs to be implemented). This
Packit 4e8bc4
is an unlikely scenario however.
Packit 4e8bc4
Packit 4e8bc4
Objects are read back along the boundaries of a write buffer. If an 8 meg
Packit 4e8bc4
write buffer is used, 8 megs are read back at once and iterated for objects.
Packit 4e8bc4
Packit 4e8bc4
This needs a fair amount of tuning, possibly more throttling. It will still
Packit 4e8bc4
evict pages if the compactor gets behind.
Packit 4e8bc4
Packit 4e8bc4
TODO
Packit 4e8bc4
----
Packit 4e8bc4
Packit 4e8bc4
Sharing my broad TODO items into here. While doing the work they get split up
Packit 4e8bc4
more into local notes. Adding this so others can follow along:
Packit 4e8bc4
Packit 4e8bc4
(a bunch of the TODO has been completed and removed)
Packit 4e8bc4
- O_DIRECT support
Packit 4e8bc4
- libaio support (requires O_DIRECT)
Packit 4e8bc4
- JBOD support (not in first pass)
Packit 4e8bc4
  - 1-2 active pages per device. potentially dedicated IO threads per device.
Packit 4e8bc4
    with a RAID setup you risk any individual disk doing a GC pause stalling
Packit 4e8bc4
    all writes. could also simply rotate devices on a per-bucket basis.
Packit 4e8bc4
Packit 4e8bc4
on memcached end:
Packit 4e8bc4
- O_DIRECT support; mostly memalign pages, but also making chunks grow
Packit 4e8bc4
  aligned to sector sizes once they are >= a single sector.
Packit 4e8bc4
- specify storage size by MB/G/T/etc instead of page count