Blame doc/journaling.txt

Packit Service 360c39
o  Journaling & Replay
Packit Service 360c39
Packit Service 360c39
The fundamental problem with a journaled cluster filesystem is
Packit Service 360c39
handling journal replay with multiple journals.  A single block of
Packit Service 360c39
metadata can be modified sequentially by many different nodes in the
Packit Service 360c39
cluster.  As the block is modified by each node, it gets logged in the
Packit Service 360c39
journal for each node.  If care is not taken, it's possible to get
Packit Service 360c39
into a situation where a journal replay can actually corrupt a
Packit Service 360c39
filesystem.  The error scenario is:
Packit Service 360c39
Packit Service 360c39
1) Node A modifies a metadata block by putting a updated copy into its
Packit Service 360c39
   incore log.
Packit Service 360c39
2) Node B wants to read and modify the block so it requests the lock
Packit Service 360c39
   and a blocking callback is sent to Node A.
Packit Service 360c39
3) Node A flushes its incore log to disk, and then syncs out the
Packit Service 360c39
   metadata block to its inplace location.
Packit Service 360c39
4) Node A then releases the lock.
Packit Service 360c39
5) Node B reads in the block and puts a modified copy into its ondisk
Packit Service 360c39
   log and then the inplace block location.
Packit Service 360c39
6) Node A crashes.
Packit Service 360c39
Packit Service 360c39
At this point, Node A's journal needs to be replayed.  Since there is
Packit Service 360c39
a newer version of block inplace, if that block is replayed, the
Packit Service 360c39
filesystem will be corrupted.  There are a few different ways of
Packit Service 360c39
avoiding this problem.
Packit Service 360c39
Packit Service 360c39
1) Generation Numbers (GFS1)
Packit Service 360c39
Packit Service 360c39
   Each metadata block has header in it that contains a 64-bit
Packit Service 360c39
   generation number.  As each block is logged into a journal, the
Packit Service 360c39
   generation number is incremented.  This provides a strict ordering
Packit Service 360c39
   of the different versions of the block a they are logged in the FS'
Packit Service 360c39
   different journals.  When journal replay happens, each block in the
Packit Service 360c39
   journal is not replayed if generation number in the journal is less
Packit Service 360c39
   than the generation number in place.  This ensures that a newer
Packit Service 360c39
   version of a block is never replaced with an older version.  So,
Packit Service 360c39
   this solution basically allows multiple copies of the same block in
Packit Service 360c39
   different journals, but it allows you to always know which is the
Packit Service 360c39
   correct one.
Packit Service 360c39
Packit Service 360c39
   Pros:
Packit Service 360c39
Packit Service 360c39
   A) This method allows the fastest callbacks.  To release a lock,
Packit Service 360c39
      the incore log for the lock must be flushed and then the inplace
Packit Service 360c39
      data and metadata must be synced.  That's it.  The sync
Packit Service 360c39
      operations involved are: start the log body and wait for it to
Packit Service 360c39
      become stable on the disk, synchronously write the commit block,
Packit Service 360c39
      start the inplace metadata and wait for it to become stable on
Packit Service 360c39
      the disk.
Packit Service 360c39
Packit Service 360c39
   Cons:
Packit Service 360c39
Packit Service 360c39
   A) Maintaining the generation numbers is expensive.  All newly
Packit Service 360c39
      allocated metadata block must be read off the disk in order to
Packit Service 360c39
      figure out what the previous value of the generation number was.
Packit Service 360c39
      When deallocating metadata, extra work and care must be taken to
Packit Service 360c39
      make sure dirty data isn't thrown away in such a way that the
Packit Service 360c39
      generation numbers stop doing their thing.
Packit Service 360c39
   B) You can't continue to modify the filesystem during journal
Packit Service 360c39
      replay.  Basically, replay of a block is a read-modify-write
Packit Service 360c39
      operation: the block is read from disk, the generation number is
Packit Service 360c39
      compared, and (maybe) the new version is written out.  Replay
Packit Service 360c39
      requires that the R-M-W operation is atomic with respect to
Packit Service 360c39
      other R-M-W operations that might be happening (say by a normal
Packit Service 360c39
      I/O process).  Since journal replay doesn't (and can't) play by
Packit Service 360c39
      the normal metadata locking rules, you can't count on them to
Packit Service 360c39
      protect replay.  Hence GFS1, quieces all writes on a filesystem
Packit Service 360c39
      before starting replay.  This provides the mutual exclusion
Packit Service 360c39
      required, but it's slow and unnecessarily interrupts service on
Packit Service 360c39
      the whole cluster.
Packit Service 360c39
Packit Service 360c39
2) Total Metadata Sync (OCFS2)
Packit Service 360c39
Packit Service 360c39
   This method is really simple in that it uses exactly the same
Packit Service 360c39
   infrastructure that a local journaled filesystem uses.  Every time
Packit Service 360c39
   a node receives a callback, it stops all metadata modification,
Packit Service 360c39
   syncs out the whole incore journal, syncs out any dirty data, marks
Packit Service 360c39
   the journal as being clean (unmounted), and then releases the lock.
Packit Service 360c39
   Because journal is marked as clean and recovery won't look at any
Packit Service 360c39
   of the journaled blocks in it, a valid copy of any particular block
Packit Service 360c39
   only exists in one journal at a time and that journal always the
Packit Service 360c39
   journal who modified it last.
Packit Service 360c39
Packit Service 360c39
   Pros:
Packit Service 360c39
Packit Service 360c39
   A) Very simple to implement.
Packit Service 360c39
   B) You can reuse journaling code from other places (such as JBD).
Packit Service 360c39
   C) No quiece necessary for replay.
Packit Service 360c39
   D) No need for generation numbers sprinkled throughout the metadata.
Packit Service 360c39
Packit Service 360c39
   Cons:
Packit Service 360c39
Packit Service 360c39
   A) This method has the slowest possible callbacks.  The sync
Packit Service 360c39
      operations are: stop all metadata operations, start and wait for
Packit Service 360c39
      the log body, write the log commit block, start and wait for all
Packit Service 360c39
      the FS' dirty metadata, write an unmount block.  Writing the
Packit Service 360c39
      metadata for the whole filesystem can be particularly expensive
Packit Service 360c39
      because it can be scattered all over the disk and there can be a
Packit Service 360c39
      whole journal's worth of it.
Packit Service 360c39
Packit Service 360c39
3) Revocation of a lock's buffers (GFS2)
Packit Service 360c39
Packit Service 360c39
   This method prevents a block from appearing in more than one
Packit Service 360c39
   journal by canceling out the metadata blocks in the journal that
Packit Service 360c39
   belong to the lock being released.  Journaling works very similarly
Packit Service 360c39
   to a local filesystem or to #2 above.
Packit Service 360c39
Packit Service 360c39
   The biggest difference is you have to keep track of buffers in the
Packit Service 360c39
   active region of the ondisk journal, even after the inplace blocks
Packit Service 360c39
   have been written back.  This is done in GFS2 by adding a second
Packit Service 360c39
   part to the Active Items List.  The first part (in GFS2 called
Packit Service 360c39
   AIL1) contains a list of all the blocks which have been logged to
Packit Service 360c39
   the journal, but not written back to their inplace location.  Once
Packit Service 360c39
   an item in AIL1 has been written back to its inplace location, it
Packit Service 360c39
   is moved to AIL2.  Once the tail of the log moves past the block's
Packit Service 360c39
   transaction in the log, it can be removed from AIL2.
Packit Service 360c39
Packit Service 360c39
   When a callback occurs, the log is flushed to the disk and the
Packit Service 360c39
   metadata for the lock is synced to disk.  At this point, any
Packit Service 360c39
   metadata blocks for the lock that are in the current active region
Packit Service 360c39
   of the log will be in the AIL2 list.  We then build a transaction
Packit Service 360c39
   that contains revoke tags for each buffer in the AIL2 list that
Packit Service 360c39
   belongs to that lock.
Packit Service 360c39
Packit Service 360c39
   Pros:
Packit Service 360c39
Packit Service 360c39
   A) No quiece necessary for Replay
Packit Service 360c39
   B) No need for generation numbers sprinkled throughout the
Packit Service 360c39
      metadata.
Packit Service 360c39
   C) The sync operations are: stop all metadata operations, start and
Packit Service 360c39
      wait for the log body, write the log commit block, start and
Packit Service 360c39
      wait for all the FS' dirty metadata, start and wait for the log
Packit Service 360c39
      body of a transaction that revokes any of the lock's metadata
Packit Service 360c39
      buffers in the journal's active region, and write the commit
Packit Service 360c39
      block for that transaction.
Packit Service 360c39
Packit Service 360c39
   Cons:
Packit Service 360c39
Packit Service 360c39
   A) Recovery takes two passes, one to find all the revoke tags in
Packit Service 360c39
      the log and one to replay the metadata blocks using the revoke
Packit Service 360c39
      tags as a filter.  This is necessary for a local filesystem and
Packit Service 360c39
      the total sync method, too.  It's just that there will probably
Packit Service 360c39
      be more tags.
Packit Service 360c39
Packit Service 360c39
Comparing #2 and #3, both do extra I/O during a lock callback to make
Packit Service 360c39
sure that any metadata blocks in the log for that lock will be
Packit Service 360c39
removed.  I believe #2 will be slower because syncing out all the
Packit Service 360c39
dirty metadata for entire filesystem requires lots of little,
Packit Service 360c39
scattered I/O across the whole disk.  The extra I/O done by #3 is a
Packit Service 360c39
log write to the disk.  So, not only should it be less I/O, but it
Packit Service 360c39
should also be better suited to get good performance out of the disk
Packit Service 360c39
subsystem.
Packit Service 360c39
Packit Service 360c39
KWP 07/06/05
Packit Service 360c39
Packit Service 360c39
Further notes (Steven Whitehouse)
Packit Service 360c39
-------------
Packit Service 360c39
Packit Service 360c39
Number 3 is slow due to having to do two write/wait transactions
Packit Service 360c39
in the log each time we release a glock. So far as I can see there
Packit Service 360c39
is no way around that, but it should be possible, if we so wish to
Packit Service 360c39
change to using #2 at some future date and still remain backward
Packit Service 360c39
compatible. So that option is open to us, but I'm not sure that we
Packit Service 360c39
want to take it yet. There may well be other ways to speed things
Packit Service 360c39
up in this area. More work remains to be done.
Packit Service 360c39