Blame doc/storage.txt

Packit Service 584ef9
Storage system notes
Packit Service 584ef9
--------------------
Packit Service 584ef9
Packit Service 584ef9
extstore.h defines the API.
Packit Service 584ef9
Packit Service 584ef9
extstore_write() is a synchronous call which memcpy's the input buffer into a
Packit Service 584ef9
write buffer for an active page. A failure is not usually a hard failure, but
Packit Service 584ef9
indicates caller can try again another time. IE: it might be busy freeing
Packit Service 584ef9
pages or assigning new ones.
Packit Service 584ef9
Packit Service 584ef9
As of this writing the write() implementation doesn't have an internal loop,
Packit Service 584ef9
so it can give spurious failures (good for testing integration)
Packit Service 584ef9
Packit Service 584ef9
extstore_read() is an asynchronous call which takes a stack of IO objects and
Packit Service 584ef9
adds it to the end of a queue. It then signals the IO thread to run. Once an
Packit Service 584ef9
IO stack is submitted the caller must not touch the submitted objects anymore
Packit Service 584ef9
(they are relinked internally).
Packit Service 584ef9
Packit Service 584ef9
extstore_delete() is a synchronous call which informs the storage engine an
Packit Service 584ef9
item has been removed from that page. It's important to call this as items are
Packit Service 584ef9
actively deleted or passively reaped due to TTL expiration. This allows the
Packit Service 584ef9
engine to intelligently reclaim pages.
Packit Service 584ef9
Packit Service 584ef9
The IO threads execute each object in turn (or in bulk of running in the
Packit Service 584ef9
future libaio mode).
Packit Service 584ef9
Packit Service 584ef9
Callbacks are issued from the IO threads. It's thus important to keep
Packit Service 584ef9
processing to a minimum. Callbacks may be issued out of order, and it is the
Packit Service 584ef9
caller's responsibility to know when its stack has been fully processed so it
Packit Service 584ef9
may reclaim the memory.
Packit Service 584ef9
Packit Service 584ef9
With DIRECT_IO support, buffers submitted for read/write will need to be
Packit Service 584ef9
aligned with posix_memalign() or similar.
Packit Service 584ef9
Packit Service 584ef9
Buckets
Packit Service 584ef9
-------
Packit Service 584ef9
Packit Service 584ef9
During extstore_init(), a number of active buckets is specified. Pages are
Packit Service 584ef9
handled overall as a global pool, but writes can be redirected to specific
Packit Service 584ef9
active pages.
Packit Service 584ef9
Packit Service 584ef9
This allows a lot of flexibility, ie:
Packit Service 584ef9
Packit Service 584ef9
1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400
Packit Service 584ef9
goes into bucket 0, rest into bucket 1. Co-locating low TTL items means
Packit Service 584ef9
those pages can reach zero objects and free up more easily.
Packit Service 584ef9
Packit Service 584ef9
2) Extended: "low TTL" is one bucket, and then one bucket per slab class.
Packit Service 584ef9
If TTL's are low, mixed sized objects can go together as they are likely to
Packit Service 584ef9
expire before cycling out of flash (depending on workload, of course).
Packit Service 584ef9
For higher TTL items, pages are stored on chunk barriers. This means less
Packit Service 584ef9
space is wasted as items should fit nearly exactly into write buffers and
Packit Service 584ef9
pages. It also means you can blindly read items back if the system wants to
Packit Service 584ef9
free a page and we can indicate to the caller somehow which pages are up for
Packit Service 584ef9
probation. ie; issue a read against page 3 version 1 for byte range 0->1MB,
Packit Service 584ef9
then chunk and look up objects. Then read next 1MB chunk/etc. If there's
Packit Service 584ef9
anything we want to keep, pull it back into RAM before pages is freed.
Packit Service 584ef9
Packit Service 584ef9
Pages are assigned into buckets on demand, so if you make 30 but use 1 there
Packit Service 584ef9
will only be a single active page with write buffers.
Packit Service 584ef9
Packit Service 584ef9
Memcached integration
Packit Service 584ef9
---------------------
Packit Service 584ef9
Packit Service 584ef9
With the POC: items.c's lru_maintainer_thread calls writes to storage if all
Packit Service 584ef9
memory has been allocated out to slab classes, and there is less than an
Packit Service 584ef9
amount of memory free. Original objects are swapped with items marked with
Packit Service 584ef9
ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the
Packit Service 584ef9
header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr
Packit Service 584ef9
*), which describes enough information to retrieve the original object from
Packit Service 584ef9
storage.
Packit Service 584ef9
Packit Service 584ef9
To get best performance is important that reads can be deeply pipelined.
Packit Service 584ef9
As much processing as possible is done ahead of time, IO's are submitted, and
Packit Service 584ef9
once IO's are done processing a minimal amount of code is executed before
Packit Service 584ef9
transmit() is possible. This should amortize the amount of latency incurred by
Packit Service 584ef9
hopping threads and waiting on IO.
Packit Service 584ef9
Packit Service 584ef9
Recaching
Packit Service 584ef9
---------
Packit Service 584ef9
Packit Service 584ef9
If a header is hit twice overall, and the second time within ~60s of the first
Packit Service 584ef9
time, it will have a chance of getting recached. "recache_rate" is a simple
Packit Service 584ef9
"counter % rate == 0" check. Setting to 1000 means one out of every 1000
Packit Service 584ef9
instances of an item being hit twice within ~60s it will be recached into
Packit Service 584ef9
memory. Very hot items will get pulled out of storage relatively quickly.
Packit Service 584ef9
Packit Service 584ef9
Compaction
Packit Service 584ef9
----------
Packit Service 584ef9
Packit Service 584ef9
A target fragmentation limit is set: "0.9", meaning "run compactions if pages
Packit Service 584ef9
exist which have less than 90% of their bytes used".
Packit Service 584ef9
Packit Service 584ef9
This value is slewed based on the number of free pages in the system, and
Packit Service 584ef9
activates when half of the pages used. The percentage of free pages is
Packit Service 584ef9
multiplied against the target fragmentation limit, ie:
Packit Service 584ef9
limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64
Packit Service 584ef9
megabytes, pages with less than 28.8 megabytes used would be targeted for
Packit Service 584ef9
compaction. If 0 pges are free, anything less than 90% used is targeted, which
Packit Service 584ef9
means it has to rewrite 10 pages to free one page.
Packit Service 584ef9
Packit Service 584ef9
In memcached's integration, a second bucket is used for objects rewritten via
Packit Service 584ef9
the compactor. Potentially objects around long enough to get compacted might
Packit Service 584ef9
continue to stick around, so co-locating them could reduce fragmentation work.
Packit Service 584ef9
Packit Service 584ef9
If an exclusive lock is made on a valid object header, the flash locations are
Packit Service 584ef9
rewritten directly in the object. As of this writing, if an object header is
Packit Service 584ef9
busy for some reason, the write is dropped (COW needs to be implemented). This
Packit Service 584ef9
is an unlikely scenario however.
Packit Service 584ef9
Packit Service 584ef9
Objects are read back along the boundaries of a write buffer. If an 8 meg
Packit Service 584ef9
write buffer is used, 8 megs are read back at once and iterated for objects.
Packit Service 584ef9
Packit Service 584ef9
This needs a fair amount of tuning, possibly more throttling. It will still
Packit Service 584ef9
evict pages if the compactor gets behind.
Packit Service 584ef9
Packit Service 584ef9
TODO
Packit Service 584ef9
----
Packit Service 584ef9
Packit Service 584ef9
Sharing my broad TODO items into here. While doing the work they get split up
Packit Service 584ef9
more into local notes. Adding this so others can follow along:
Packit Service 584ef9
Packit Service 584ef9
(a bunch of the TODO has been completed and removed)
Packit Service 584ef9
- O_DIRECT support
Packit Service 584ef9
- libaio support (requires O_DIRECT)
Packit Service 584ef9
- JBOD support (not in first pass)
Packit Service 584ef9
  - 1-2 active pages per device. potentially dedicated IO threads per device.
Packit Service 584ef9
    with a RAID setup you risk any individual disk doing a GC pause stalling
Packit Service 584ef9
    all writes. could also simply rotate devices on a per-bucket basis.
Packit Service 584ef9
Packit Service 584ef9
on memcached end:
Packit Service 584ef9
- O_DIRECT support; mostly memalign pages, but also making chunks grow
Packit Service 584ef9
  aligned to sector sizes once they are >= a single sector.
Packit Service 584ef9
- specify storage size by MB/G/T/etc instead of page count