Blob Blame History Raw
Storage system notes
--------------------

extstore.h defines the API.

extstore_write() is a synchronous call which memcpy's the input buffer into a
write buffer for an active page. A failure is not usually a hard failure, but
indicates caller can try again another time. IE: it might be busy freeing
pages or assigning new ones.

As of this writing the write() implementation doesn't have an internal loop,
so it can give spurious failures (good for testing integration)

extstore_read() is an asynchronous call which takes a stack of IO objects and
adds it to the end of a queue. It then signals the IO thread to run. Once an
IO stack is submitted the caller must not touch the submitted objects anymore
(they are relinked internally).

extstore_delete() is a synchronous call which informs the storage engine an
item has been removed from that page. It's important to call this as items are
actively deleted or passively reaped due to TTL expiration. This allows the
engine to intelligently reclaim pages.

The IO threads execute each object in turn (or in bulk of running in the
future libaio mode).

Callbacks are issued from the IO threads. It's thus important to keep
processing to a minimum. Callbacks may be issued out of order, and it is the
caller's responsibility to know when its stack has been fully processed so it
may reclaim the memory.

With DIRECT_IO support, buffers submitted for read/write will need to be
aligned with posix_memalign() or similar.

Buckets
-------

During extstore_init(), a number of active buckets is specified. Pages are
handled overall as a global pool, but writes can be redirected to specific
active pages.

This allows a lot of flexibility, ie:

1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400
goes into bucket 0, rest into bucket 1. Co-locating low TTL items means
those pages can reach zero objects and free up more easily.

2) Extended: "low TTL" is one bucket, and then one bucket per slab class.
If TTL's are low, mixed sized objects can go together as they are likely to
expire before cycling out of flash (depending on workload, of course).
For higher TTL items, pages are stored on chunk barriers. This means less
space is wasted as items should fit nearly exactly into write buffers and
pages. It also means you can blindly read items back if the system wants to
free a page and we can indicate to the caller somehow which pages are up for
probation. ie; issue a read against page 3 version 1 for byte range 0->1MB,
then chunk and look up objects. Then read next 1MB chunk/etc. If there's
anything we want to keep, pull it back into RAM before pages is freed.

Pages are assigned into buckets on demand, so if you make 30 but use 1 there
will only be a single active page with write buffers.

Memcached integration
---------------------

With the POC: items.c's lru_maintainer_thread calls writes to storage if all
memory has been allocated out to slab classes, and there is less than an
amount of memory free. Original objects are swapped with items marked with
ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the
header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr
*), which describes enough information to retrieve the original object from
storage.

To get best performance is important that reads can be deeply pipelined.
As much processing as possible is done ahead of time, IO's are submitted, and
once IO's are done processing a minimal amount of code is executed before
transmit() is possible. This should amortize the amount of latency incurred by
hopping threads and waiting on IO.

Recaching
---------

If a header is hit twice overall, and the second time within ~60s of the first
time, it will have a chance of getting recached. "recache_rate" is a simple
"counter % rate == 0" check. Setting to 1000 means one out of every 1000
instances of an item being hit twice within ~60s it will be recached into
memory. Very hot items will get pulled out of storage relatively quickly.

Compaction
----------

A target fragmentation limit is set: "0.9", meaning "run compactions if pages
exist which have less than 90% of their bytes used".

This value is slewed based on the number of free pages in the system, and
activates when half of the pages used. The percentage of free pages is
multiplied against the target fragmentation limit, ie:
limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64
megabytes, pages with less than 28.8 megabytes used would be targeted for
compaction. If 0 pges are free, anything less than 90% used is targeted, which
means it has to rewrite 10 pages to free one page.

In memcached's integration, a second bucket is used for objects rewritten via
the compactor. Potentially objects around long enough to get compacted might
continue to stick around, so co-locating them could reduce fragmentation work.

If an exclusive lock is made on a valid object header, the flash locations are
rewritten directly in the object. As of this writing, if an object header is
busy for some reason, the write is dropped (COW needs to be implemented). This
is an unlikely scenario however.

Objects are read back along the boundaries of a write buffer. If an 8 meg
write buffer is used, 8 megs are read back at once and iterated for objects.

This needs a fair amount of tuning, possibly more throttling. It will still
evict pages if the compactor gets behind.

TODO
----

Sharing my broad TODO items into here. While doing the work they get split up
more into local notes. Adding this so others can follow along:

(a bunch of the TODO has been completed and removed)
- O_DIRECT support
- libaio support (requires O_DIRECT)
- JBOD support (not in first pass)
  - 1-2 active pages per device. potentially dedicated IO threads per device.
    with a RAID setup you risk any individual disk doing a GC pause stalling
    all writes. could also simply rotate devices on a per-bucket basis.

on memcached end:
- O_DIRECT support; mostly memalign pages, but also making chunks grow
  aligned to sector sizes once they are >= a single sector.
- specify storage size by MB/G/T/etc instead of page count