Blame doc/journaling.txt

Packit 6ef888
o  Journaling & Replay
Packit 6ef888
Packit 6ef888
The fundamental problem with a journaled cluster filesystem is
Packit 6ef888
handling journal replay with multiple journals.  A single block of
Packit 6ef888
metadata can be modified sequentially by many different nodes in the
Packit 6ef888
cluster.  As the block is modified by each node, it gets logged in the
Packit 6ef888
journal for each node.  If care is not taken, it's possible to get
Packit 6ef888
into a situation where a journal replay can actually corrupt a
Packit 6ef888
filesystem.  The error scenario is:
Packit 6ef888
Packit 6ef888
1) Node A modifies a metadata block by putting a updated copy into its
Packit 6ef888
   incore log.
Packit 6ef888
2) Node B wants to read and modify the block so it requests the lock
Packit 6ef888
   and a blocking callback is sent to Node A.
Packit 6ef888
3) Node A flushes its incore log to disk, and then syncs out the
Packit 6ef888
   metadata block to its inplace location.
Packit 6ef888
4) Node A then releases the lock.
Packit 6ef888
5) Node B reads in the block and puts a modified copy into its ondisk
Packit 6ef888
   log and then the inplace block location.
Packit 6ef888
6) Node A crashes.
Packit 6ef888
Packit 6ef888
At this point, Node A's journal needs to be replayed.  Since there is
Packit 6ef888
a newer version of block inplace, if that block is replayed, the
Packit 6ef888
filesystem will be corrupted.  There are a few different ways of
Packit 6ef888
avoiding this problem.
Packit 6ef888
Packit 6ef888
1) Generation Numbers (GFS1)
Packit 6ef888
Packit 6ef888
   Each metadata block has header in it that contains a 64-bit
Packit 6ef888
   generation number.  As each block is logged into a journal, the
Packit 6ef888
   generation number is incremented.  This provides a strict ordering
Packit 6ef888
   of the different versions of the block a they are logged in the FS'
Packit 6ef888
   different journals.  When journal replay happens, each block in the
Packit 6ef888
   journal is not replayed if generation number in the journal is less
Packit 6ef888
   than the generation number in place.  This ensures that a newer
Packit 6ef888
   version of a block is never replaced with an older version.  So,
Packit 6ef888
   this solution basically allows multiple copies of the same block in
Packit 6ef888
   different journals, but it allows you to always know which is the
Packit 6ef888
   correct one.
Packit 6ef888
Packit 6ef888
   Pros:
Packit 6ef888
Packit 6ef888
   A) This method allows the fastest callbacks.  To release a lock,
Packit 6ef888
      the incore log for the lock must be flushed and then the inplace
Packit 6ef888
      data and metadata must be synced.  That's it.  The sync
Packit 6ef888
      operations involved are: start the log body and wait for it to
Packit 6ef888
      become stable on the disk, synchronously write the commit block,
Packit 6ef888
      start the inplace metadata and wait for it to become stable on
Packit 6ef888
      the disk.
Packit 6ef888
Packit 6ef888
   Cons:
Packit 6ef888
Packit 6ef888
   A) Maintaining the generation numbers is expensive.  All newly
Packit 6ef888
      allocated metadata block must be read off the disk in order to
Packit 6ef888
      figure out what the previous value of the generation number was.
Packit 6ef888
      When deallocating metadata, extra work and care must be taken to
Packit 6ef888
      make sure dirty data isn't thrown away in such a way that the
Packit 6ef888
      generation numbers stop doing their thing.
Packit 6ef888
   B) You can't continue to modify the filesystem during journal
Packit 6ef888
      replay.  Basically, replay of a block is a read-modify-write
Packit 6ef888
      operation: the block is read from disk, the generation number is
Packit 6ef888
      compared, and (maybe) the new version is written out.  Replay
Packit 6ef888
      requires that the R-M-W operation is atomic with respect to
Packit 6ef888
      other R-M-W operations that might be happening (say by a normal
Packit 6ef888
      I/O process).  Since journal replay doesn't (and can't) play by
Packit 6ef888
      the normal metadata locking rules, you can't count on them to
Packit 6ef888
      protect replay.  Hence GFS1, quieces all writes on a filesystem
Packit 6ef888
      before starting replay.  This provides the mutual exclusion
Packit 6ef888
      required, but it's slow and unnecessarily interrupts service on
Packit 6ef888
      the whole cluster.
Packit 6ef888
Packit 6ef888
2) Total Metadata Sync (OCFS2)
Packit 6ef888
Packit 6ef888
   This method is really simple in that it uses exactly the same
Packit 6ef888
   infrastructure that a local journaled filesystem uses.  Every time
Packit 6ef888
   a node receives a callback, it stops all metadata modification,
Packit 6ef888
   syncs out the whole incore journal, syncs out any dirty data, marks
Packit 6ef888
   the journal as being clean (unmounted), and then releases the lock.
Packit 6ef888
   Because journal is marked as clean and recovery won't look at any
Packit 6ef888
   of the journaled blocks in it, a valid copy of any particular block
Packit 6ef888
   only exists in one journal at a time and that journal always the
Packit 6ef888
   journal who modified it last.
Packit 6ef888
Packit 6ef888
   Pros:
Packit 6ef888
Packit 6ef888
   A) Very simple to implement.
Packit 6ef888
   B) You can reuse journaling code from other places (such as JBD).
Packit 6ef888
   C) No quiece necessary for replay.
Packit 6ef888
   D) No need for generation numbers sprinkled throughout the metadata.
Packit 6ef888
Packit 6ef888
   Cons:
Packit 6ef888
Packit 6ef888
   A) This method has the slowest possible callbacks.  The sync
Packit 6ef888
      operations are: stop all metadata operations, start and wait for
Packit 6ef888
      the log body, write the log commit block, start and wait for all
Packit 6ef888
      the FS' dirty metadata, write an unmount block.  Writing the
Packit 6ef888
      metadata for the whole filesystem can be particularly expensive
Packit 6ef888
      because it can be scattered all over the disk and there can be a
Packit 6ef888
      whole journal's worth of it.
Packit 6ef888
Packit 6ef888
3) Revocation of a lock's buffers (GFS2)
Packit 6ef888
Packit 6ef888
   This method prevents a block from appearing in more than one
Packit 6ef888
   journal by canceling out the metadata blocks in the journal that
Packit 6ef888
   belong to the lock being released.  Journaling works very similarly
Packit 6ef888
   to a local filesystem or to #2 above.
Packit 6ef888
Packit 6ef888
   The biggest difference is you have to keep track of buffers in the
Packit 6ef888
   active region of the ondisk journal, even after the inplace blocks
Packit 6ef888
   have been written back.  This is done in GFS2 by adding a second
Packit 6ef888
   part to the Active Items List.  The first part (in GFS2 called
Packit 6ef888
   AIL1) contains a list of all the blocks which have been logged to
Packit 6ef888
   the journal, but not written back to their inplace location.  Once
Packit 6ef888
   an item in AIL1 has been written back to its inplace location, it
Packit 6ef888
   is moved to AIL2.  Once the tail of the log moves past the block's
Packit 6ef888
   transaction in the log, it can be removed from AIL2.
Packit 6ef888
Packit 6ef888
   When a callback occurs, the log is flushed to the disk and the
Packit 6ef888
   metadata for the lock is synced to disk.  At this point, any
Packit 6ef888
   metadata blocks for the lock that are in the current active region
Packit 6ef888
   of the log will be in the AIL2 list.  We then build a transaction
Packit 6ef888
   that contains revoke tags for each buffer in the AIL2 list that
Packit 6ef888
   belongs to that lock.
Packit 6ef888
Packit 6ef888
   Pros:
Packit 6ef888
Packit 6ef888
   A) No quiece necessary for Replay
Packit 6ef888
   B) No need for generation numbers sprinkled throughout the
Packit 6ef888
      metadata.
Packit 6ef888
   C) The sync operations are: stop all metadata operations, start and
Packit 6ef888
      wait for the log body, write the log commit block, start and
Packit 6ef888
      wait for all the FS' dirty metadata, start and wait for the log
Packit 6ef888
      body of a transaction that revokes any of the lock's metadata
Packit 6ef888
      buffers in the journal's active region, and write the commit
Packit 6ef888
      block for that transaction.
Packit 6ef888
Packit 6ef888
   Cons:
Packit 6ef888
Packit 6ef888
   A) Recovery takes two passes, one to find all the revoke tags in
Packit 6ef888
      the log and one to replay the metadata blocks using the revoke
Packit 6ef888
      tags as a filter.  This is necessary for a local filesystem and
Packit 6ef888
      the total sync method, too.  It's just that there will probably
Packit 6ef888
      be more tags.
Packit 6ef888
Packit 6ef888
Comparing #2 and #3, both do extra I/O during a lock callback to make
Packit 6ef888
sure that any metadata blocks in the log for that lock will be
Packit 6ef888
removed.  I believe #2 will be slower because syncing out all the
Packit 6ef888
dirty metadata for entire filesystem requires lots of little,
Packit 6ef888
scattered I/O across the whole disk.  The extra I/O done by #3 is a
Packit 6ef888
log write to the disk.  So, not only should it be less I/O, but it
Packit 6ef888
should also be better suited to get good performance out of the disk
Packit 6ef888
subsystem.
Packit 6ef888
Packit 6ef888
KWP 07/06/05
Packit 6ef888
Packit 6ef888
Further notes (Steven Whitehouse)
Packit 6ef888
-------------
Packit 6ef888
Packit 6ef888
Number 3 is slow due to having to do two write/wait transactions
Packit 6ef888
in the log each time we release a glock. So far as I can see there
Packit 6ef888
is no way around that, but it should be possible, if we so wish to
Packit 6ef888
change to using #2 at some future date and still remain backward
Packit 6ef888
compatible. So that option is open to us, but I'm not sure that we
Packit 6ef888
want to take it yet. There may well be other ways to speed things
Packit 6ef888
up in this area. More work remains to be done.
Packit 6ef888