|
Packit |
4e8bc4 |
Storage system notes
|
|
Packit |
4e8bc4 |
--------------------
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
extstore.h defines the API.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
extstore_write() is a synchronous call which memcpy's the input buffer into a
|
|
Packit |
4e8bc4 |
write buffer for an active page. A failure is not usually a hard failure, but
|
|
Packit |
4e8bc4 |
indicates caller can try again another time. IE: it might be busy freeing
|
|
Packit |
4e8bc4 |
pages or assigning new ones.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
As of this writing the write() implementation doesn't have an internal loop,
|
|
Packit |
4e8bc4 |
so it can give spurious failures (good for testing integration)
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
extstore_read() is an asynchronous call which takes a stack of IO objects and
|
|
Packit |
4e8bc4 |
adds it to the end of a queue. It then signals the IO thread to run. Once an
|
|
Packit |
4e8bc4 |
IO stack is submitted the caller must not touch the submitted objects anymore
|
|
Packit |
4e8bc4 |
(they are relinked internally).
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
extstore_delete() is a synchronous call which informs the storage engine an
|
|
Packit |
4e8bc4 |
item has been removed from that page. It's important to call this as items are
|
|
Packit |
4e8bc4 |
actively deleted or passively reaped due to TTL expiration. This allows the
|
|
Packit |
4e8bc4 |
engine to intelligently reclaim pages.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
The IO threads execute each object in turn (or in bulk of running in the
|
|
Packit |
4e8bc4 |
future libaio mode).
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
Callbacks are issued from the IO threads. It's thus important to keep
|
|
Packit |
4e8bc4 |
processing to a minimum. Callbacks may be issued out of order, and it is the
|
|
Packit |
4e8bc4 |
caller's responsibility to know when its stack has been fully processed so it
|
|
Packit |
4e8bc4 |
may reclaim the memory.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
With DIRECT_IO support, buffers submitted for read/write will need to be
|
|
Packit |
4e8bc4 |
aligned with posix_memalign() or similar.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
Buckets
|
|
Packit |
4e8bc4 |
-------
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
During extstore_init(), a number of active buckets is specified. Pages are
|
|
Packit |
4e8bc4 |
handled overall as a global pool, but writes can be redirected to specific
|
|
Packit |
4e8bc4 |
active pages.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
This allows a lot of flexibility, ie:
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400
|
|
Packit |
4e8bc4 |
goes into bucket 0, rest into bucket 1. Co-locating low TTL items means
|
|
Packit |
4e8bc4 |
those pages can reach zero objects and free up more easily.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
2) Extended: "low TTL" is one bucket, and then one bucket per slab class.
|
|
Packit |
4e8bc4 |
If TTL's are low, mixed sized objects can go together as they are likely to
|
|
Packit |
4e8bc4 |
expire before cycling out of flash (depending on workload, of course).
|
|
Packit |
4e8bc4 |
For higher TTL items, pages are stored on chunk barriers. This means less
|
|
Packit |
4e8bc4 |
space is wasted as items should fit nearly exactly into write buffers and
|
|
Packit |
4e8bc4 |
pages. It also means you can blindly read items back if the system wants to
|
|
Packit |
4e8bc4 |
free a page and we can indicate to the caller somehow which pages are up for
|
|
Packit |
4e8bc4 |
probation. ie; issue a read against page 3 version 1 for byte range 0->1MB,
|
|
Packit |
4e8bc4 |
then chunk and look up objects. Then read next 1MB chunk/etc. If there's
|
|
Packit |
4e8bc4 |
anything we want to keep, pull it back into RAM before pages is freed.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
Pages are assigned into buckets on demand, so if you make 30 but use 1 there
|
|
Packit |
4e8bc4 |
will only be a single active page with write buffers.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
Memcached integration
|
|
Packit |
4e8bc4 |
---------------------
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
With the POC: items.c's lru_maintainer_thread calls writes to storage if all
|
|
Packit |
4e8bc4 |
memory has been allocated out to slab classes, and there is less than an
|
|
Packit |
4e8bc4 |
amount of memory free. Original objects are swapped with items marked with
|
|
Packit |
4e8bc4 |
ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the
|
|
Packit |
4e8bc4 |
header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr
|
|
Packit |
4e8bc4 |
*), which describes enough information to retrieve the original object from
|
|
Packit |
4e8bc4 |
storage.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
To get best performance is important that reads can be deeply pipelined.
|
|
Packit |
4e8bc4 |
As much processing as possible is done ahead of time, IO's are submitted, and
|
|
Packit |
4e8bc4 |
once IO's are done processing a minimal amount of code is executed before
|
|
Packit |
4e8bc4 |
transmit() is possible. This should amortize the amount of latency incurred by
|
|
Packit |
4e8bc4 |
hopping threads and waiting on IO.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
Recaching
|
|
Packit |
4e8bc4 |
---------
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
If a header is hit twice overall, and the second time within ~60s of the first
|
|
Packit |
4e8bc4 |
time, it will have a chance of getting recached. "recache_rate" is a simple
|
|
Packit |
4e8bc4 |
"counter % rate == 0" check. Setting to 1000 means one out of every 1000
|
|
Packit |
4e8bc4 |
instances of an item being hit twice within ~60s it will be recached into
|
|
Packit |
4e8bc4 |
memory. Very hot items will get pulled out of storage relatively quickly.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
Compaction
|
|
Packit |
4e8bc4 |
----------
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
A target fragmentation limit is set: "0.9", meaning "run compactions if pages
|
|
Packit |
4e8bc4 |
exist which have less than 90% of their bytes used".
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
This value is slewed based on the number of free pages in the system, and
|
|
Packit |
4e8bc4 |
activates when half of the pages used. The percentage of free pages is
|
|
Packit |
4e8bc4 |
multiplied against the target fragmentation limit, ie:
|
|
Packit |
4e8bc4 |
limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64
|
|
Packit |
4e8bc4 |
megabytes, pages with less than 28.8 megabytes used would be targeted for
|
|
Packit |
4e8bc4 |
compaction. If 0 pges are free, anything less than 90% used is targeted, which
|
|
Packit |
4e8bc4 |
means it has to rewrite 10 pages to free one page.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
In memcached's integration, a second bucket is used for objects rewritten via
|
|
Packit |
4e8bc4 |
the compactor. Potentially objects around long enough to get compacted might
|
|
Packit |
4e8bc4 |
continue to stick around, so co-locating them could reduce fragmentation work.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
If an exclusive lock is made on a valid object header, the flash locations are
|
|
Packit |
4e8bc4 |
rewritten directly in the object. As of this writing, if an object header is
|
|
Packit |
4e8bc4 |
busy for some reason, the write is dropped (COW needs to be implemented). This
|
|
Packit |
4e8bc4 |
is an unlikely scenario however.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
Objects are read back along the boundaries of a write buffer. If an 8 meg
|
|
Packit |
4e8bc4 |
write buffer is used, 8 megs are read back at once and iterated for objects.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
This needs a fair amount of tuning, possibly more throttling. It will still
|
|
Packit |
4e8bc4 |
evict pages if the compactor gets behind.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
TODO
|
|
Packit |
4e8bc4 |
----
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
Sharing my broad TODO items into here. While doing the work they get split up
|
|
Packit |
4e8bc4 |
more into local notes. Adding this so others can follow along:
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
(a bunch of the TODO has been completed and removed)
|
|
Packit |
4e8bc4 |
- O_DIRECT support
|
|
Packit |
4e8bc4 |
- libaio support (requires O_DIRECT)
|
|
Packit |
4e8bc4 |
- JBOD support (not in first pass)
|
|
Packit |
4e8bc4 |
- 1-2 active pages per device. potentially dedicated IO threads per device.
|
|
Packit |
4e8bc4 |
with a RAID setup you risk any individual disk doing a GC pause stalling
|
|
Packit |
4e8bc4 |
all writes. could also simply rotate devices on a per-bucket basis.
|
|
Packit |
4e8bc4 |
|
|
Packit |
4e8bc4 |
on memcached end:
|
|
Packit |
4e8bc4 |
- O_DIRECT support; mostly memalign pages, but also making chunks grow
|
|
Packit |
4e8bc4 |
aligned to sector sizes once they are >= a single sector.
|
|
Packit |
4e8bc4 |
- specify storage size by MB/G/T/etc instead of page count
|