o Journaling & Replay
The fundamental problem with a journaled cluster filesystem is
handling journal replay with multiple journals. A single block of
metadata can be modified sequentially by many different nodes in the
cluster. As the block is modified by each node, it gets logged in the
journal for each node. If care is not taken, it's possible to get
into a situation where a journal replay can actually corrupt a
filesystem. The error scenario is:
1) Node A modifies a metadata block by putting a updated copy into its
incore log.
2) Node B wants to read and modify the block so it requests the lock
and a blocking callback is sent to Node A.
3) Node A flushes its incore log to disk, and then syncs out the
metadata block to its inplace location.
4) Node A then releases the lock.
5) Node B reads in the block and puts a modified copy into its ondisk
log and then the inplace block location.
6) Node A crashes.
At this point, Node A's journal needs to be replayed. Since there is
a newer version of block inplace, if that block is replayed, the
filesystem will be corrupted. There are a few different ways of
avoiding this problem.
1) Generation Numbers (GFS1)
Each metadata block has header in it that contains a 64-bit
generation number. As each block is logged into a journal, the
generation number is incremented. This provides a strict ordering
of the different versions of the block a they are logged in the FS'
different journals. When journal replay happens, each block in the
journal is not replayed if generation number in the journal is less
than the generation number in place. This ensures that a newer
version of a block is never replaced with an older version. So,
this solution basically allows multiple copies of the same block in
different journals, but it allows you to always know which is the
correct one.
Pros:
A) This method allows the fastest callbacks. To release a lock,
the incore log for the lock must be flushed and then the inplace
data and metadata must be synced. That's it. The sync
operations involved are: start the log body and wait for it to
become stable on the disk, synchronously write the commit block,
start the inplace metadata and wait for it to become stable on
the disk.
Cons:
A) Maintaining the generation numbers is expensive. All newly
allocated metadata block must be read off the disk in order to
figure out what the previous value of the generation number was.
When deallocating metadata, extra work and care must be taken to
make sure dirty data isn't thrown away in such a way that the
generation numbers stop doing their thing.
B) You can't continue to modify the filesystem during journal
replay. Basically, replay of a block is a read-modify-write
operation: the block is read from disk, the generation number is
compared, and (maybe) the new version is written out. Replay
requires that the R-M-W operation is atomic with respect to
other R-M-W operations that might be happening (say by a normal
I/O process). Since journal replay doesn't (and can't) play by
the normal metadata locking rules, you can't count on them to
protect replay. Hence GFS1, quieces all writes on a filesystem
before starting replay. This provides the mutual exclusion
required, but it's slow and unnecessarily interrupts service on
the whole cluster.
2) Total Metadata Sync (OCFS2)
This method is really simple in that it uses exactly the same
infrastructure that a local journaled filesystem uses. Every time
a node receives a callback, it stops all metadata modification,
syncs out the whole incore journal, syncs out any dirty data, marks
the journal as being clean (unmounted), and then releases the lock.
Because journal is marked as clean and recovery won't look at any
of the journaled blocks in it, a valid copy of any particular block
only exists in one journal at a time and that journal always the
journal who modified it last.
Pros:
A) Very simple to implement.
B) You can reuse journaling code from other places (such as JBD).
C) No quiece necessary for replay.
D) No need for generation numbers sprinkled throughout the metadata.
Cons:
A) This method has the slowest possible callbacks. The sync
operations are: stop all metadata operations, start and wait for
the log body, write the log commit block, start and wait for all
the FS' dirty metadata, write an unmount block. Writing the
metadata for the whole filesystem can be particularly expensive
because it can be scattered all over the disk and there can be a
whole journal's worth of it.
3) Revocation of a lock's buffers (GFS2)
This method prevents a block from appearing in more than one
journal by canceling out the metadata blocks in the journal that
belong to the lock being released. Journaling works very similarly
to a local filesystem or to #2 above.
The biggest difference is you have to keep track of buffers in the
active region of the ondisk journal, even after the inplace blocks
have been written back. This is done in GFS2 by adding a second
part to the Active Items List. The first part (in GFS2 called
AIL1) contains a list of all the blocks which have been logged to
the journal, but not written back to their inplace location. Once
an item in AIL1 has been written back to its inplace location, it
is moved to AIL2. Once the tail of the log moves past the block's
transaction in the log, it can be removed from AIL2.
When a callback occurs, the log is flushed to the disk and the
metadata for the lock is synced to disk. At this point, any
metadata blocks for the lock that are in the current active region
of the log will be in the AIL2 list. We then build a transaction
that contains revoke tags for each buffer in the AIL2 list that
belongs to that lock.
Pros:
A) No quiece necessary for Replay
B) No need for generation numbers sprinkled throughout the
metadata.
C) The sync operations are: stop all metadata operations, start and
wait for the log body, write the log commit block, start and
wait for all the FS' dirty metadata, start and wait for the log
body of a transaction that revokes any of the lock's metadata
buffers in the journal's active region, and write the commit
block for that transaction.
Cons:
A) Recovery takes two passes, one to find all the revoke tags in
the log and one to replay the metadata blocks using the revoke
tags as a filter. This is necessary for a local filesystem and
the total sync method, too. It's just that there will probably
be more tags.
Comparing #2 and #3, both do extra I/O during a lock callback to make
sure that any metadata blocks in the log for that lock will be
removed. I believe #2 will be slower because syncing out all the
dirty metadata for entire filesystem requires lots of little,
scattered I/O across the whole disk. The extra I/O done by #3 is a
log write to the disk. So, not only should it be less I/O, but it
should also be better suited to get good performance out of the disk
subsystem.
KWP 07/06/05
Further notes (Steven Whitehouse)
-------------
Number 3 is slow due to having to do two write/wait transactions
in the log each time we release a glock. So far as I can see there
is no way around that, but it should be possible, if we so wish to
change to using #2 at some future date and still remain backward
compatible. So that option is open to us, but I'm not sure that we
want to take it yet. There may well be other ways to speed things
up in this area. More work remains to be done.