Blob Blame History Raw
o  Journaling & Replay

The fundamental problem with a journaled cluster filesystem is
handling journal replay with multiple journals.  A single block of
metadata can be modified sequentially by many different nodes in the
cluster.  As the block is modified by each node, it gets logged in the
journal for each node.  If care is not taken, it's possible to get
into a situation where a journal replay can actually corrupt a
filesystem.  The error scenario is:

1) Node A modifies a metadata block by putting a updated copy into its
   incore log.
2) Node B wants to read and modify the block so it requests the lock
   and a blocking callback is sent to Node A.
3) Node A flushes its incore log to disk, and then syncs out the
   metadata block to its inplace location.
4) Node A then releases the lock.
5) Node B reads in the block and puts a modified copy into its ondisk
   log and then the inplace block location.
6) Node A crashes.

At this point, Node A's journal needs to be replayed.  Since there is
a newer version of block inplace, if that block is replayed, the
filesystem will be corrupted.  There are a few different ways of
avoiding this problem.

1) Generation Numbers (GFS1)

   Each metadata block has header in it that contains a 64-bit
   generation number.  As each block is logged into a journal, the
   generation number is incremented.  This provides a strict ordering
   of the different versions of the block a they are logged in the FS'
   different journals.  When journal replay happens, each block in the
   journal is not replayed if generation number in the journal is less
   than the generation number in place.  This ensures that a newer
   version of a block is never replaced with an older version.  So,
   this solution basically allows multiple copies of the same block in
   different journals, but it allows you to always know which is the
   correct one.


   A) This method allows the fastest callbacks.  To release a lock,
      the incore log for the lock must be flushed and then the inplace
      data and metadata must be synced.  That's it.  The sync
      operations involved are: start the log body and wait for it to
      become stable on the disk, synchronously write the commit block,
      start the inplace metadata and wait for it to become stable on
      the disk.


   A) Maintaining the generation numbers is expensive.  All newly
      allocated metadata block must be read off the disk in order to
      figure out what the previous value of the generation number was.
      When deallocating metadata, extra work and care must be taken to
      make sure dirty data isn't thrown away in such a way that the
      generation numbers stop doing their thing.
   B) You can't continue to modify the filesystem during journal
      replay.  Basically, replay of a block is a read-modify-write
      operation: the block is read from disk, the generation number is
      compared, and (maybe) the new version is written out.  Replay
      requires that the R-M-W operation is atomic with respect to
      other R-M-W operations that might be happening (say by a normal
      I/O process).  Since journal replay doesn't (and can't) play by
      the normal metadata locking rules, you can't count on them to
      protect replay.  Hence GFS1, quieces all writes on a filesystem
      before starting replay.  This provides the mutual exclusion
      required, but it's slow and unnecessarily interrupts service on
      the whole cluster.

2) Total Metadata Sync (OCFS2)

   This method is really simple in that it uses exactly the same
   infrastructure that a local journaled filesystem uses.  Every time
   a node receives a callback, it stops all metadata modification,
   syncs out the whole incore journal, syncs out any dirty data, marks
   the journal as being clean (unmounted), and then releases the lock.
   Because journal is marked as clean and recovery won't look at any
   of the journaled blocks in it, a valid copy of any particular block
   only exists in one journal at a time and that journal always the
   journal who modified it last.


   A) Very simple to implement.
   B) You can reuse journaling code from other places (such as JBD).
   C) No quiece necessary for replay.
   D) No need for generation numbers sprinkled throughout the metadata.


   A) This method has the slowest possible callbacks.  The sync
      operations are: stop all metadata operations, start and wait for
      the log body, write the log commit block, start and wait for all
      the FS' dirty metadata, write an unmount block.  Writing the
      metadata for the whole filesystem can be particularly expensive
      because it can be scattered all over the disk and there can be a
      whole journal's worth of it.

3) Revocation of a lock's buffers (GFS2)

   This method prevents a block from appearing in more than one
   journal by canceling out the metadata blocks in the journal that
   belong to the lock being released.  Journaling works very similarly
   to a local filesystem or to #2 above.

   The biggest difference is you have to keep track of buffers in the
   active region of the ondisk journal, even after the inplace blocks
   have been written back.  This is done in GFS2 by adding a second
   part to the Active Items List.  The first part (in GFS2 called
   AIL1) contains a list of all the blocks which have been logged to
   the journal, but not written back to their inplace location.  Once
   an item in AIL1 has been written back to its inplace location, it
   is moved to AIL2.  Once the tail of the log moves past the block's
   transaction in the log, it can be removed from AIL2.

   When a callback occurs, the log is flushed to the disk and the
   metadata for the lock is synced to disk.  At this point, any
   metadata blocks for the lock that are in the current active region
   of the log will be in the AIL2 list.  We then build a transaction
   that contains revoke tags for each buffer in the AIL2 list that
   belongs to that lock.


   A) No quiece necessary for Replay
   B) No need for generation numbers sprinkled throughout the
   C) The sync operations are: stop all metadata operations, start and
      wait for the log body, write the log commit block, start and
      wait for all the FS' dirty metadata, start and wait for the log
      body of a transaction that revokes any of the lock's metadata
      buffers in the journal's active region, and write the commit
      block for that transaction.


   A) Recovery takes two passes, one to find all the revoke tags in
      the log and one to replay the metadata blocks using the revoke
      tags as a filter.  This is necessary for a local filesystem and
      the total sync method, too.  It's just that there will probably
      be more tags.

Comparing #2 and #3, both do extra I/O during a lock callback to make
sure that any metadata blocks in the log for that lock will be
removed.  I believe #2 will be slower because syncing out all the
dirty metadata for entire filesystem requires lots of little,
scattered I/O across the whole disk.  The extra I/O done by #3 is a
log write to the disk.  So, not only should it be less I/O, but it
should also be better suited to get good performance out of the disk

KWP 07/06/05

Further notes (Steven Whitehouse)

Number 3 is slow due to having to do two write/wait transactions
in the log each time we release a glock. So far as I can see there
is no way around that, but it should be possible, if we so wish to
change to using #2 at some future date and still remain backward
compatible. So that option is open to us, but I'm not sure that we
want to take it yet. There may well be other ways to speed things
up in this area. More work remains to be done.