|
Packit |
6ef888 |
o Journaling & Replay
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
The fundamental problem with a journaled cluster filesystem is
|
|
Packit |
6ef888 |
handling journal replay with multiple journals. A single block of
|
|
Packit |
6ef888 |
metadata can be modified sequentially by many different nodes in the
|
|
Packit |
6ef888 |
cluster. As the block is modified by each node, it gets logged in the
|
|
Packit |
6ef888 |
journal for each node. If care is not taken, it's possible to get
|
|
Packit |
6ef888 |
into a situation where a journal replay can actually corrupt a
|
|
Packit |
6ef888 |
filesystem. The error scenario is:
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
1) Node A modifies a metadata block by putting a updated copy into its
|
|
Packit |
6ef888 |
incore log.
|
|
Packit |
6ef888 |
2) Node B wants to read and modify the block so it requests the lock
|
|
Packit |
6ef888 |
and a blocking callback is sent to Node A.
|
|
Packit |
6ef888 |
3) Node A flushes its incore log to disk, and then syncs out the
|
|
Packit |
6ef888 |
metadata block to its inplace location.
|
|
Packit |
6ef888 |
4) Node A then releases the lock.
|
|
Packit |
6ef888 |
5) Node B reads in the block and puts a modified copy into its ondisk
|
|
Packit |
6ef888 |
log and then the inplace block location.
|
|
Packit |
6ef888 |
6) Node A crashes.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
At this point, Node A's journal needs to be replayed. Since there is
|
|
Packit |
6ef888 |
a newer version of block inplace, if that block is replayed, the
|
|
Packit |
6ef888 |
filesystem will be corrupted. There are a few different ways of
|
|
Packit |
6ef888 |
avoiding this problem.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
1) Generation Numbers (GFS1)
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Each metadata block has header in it that contains a 64-bit
|
|
Packit |
6ef888 |
generation number. As each block is logged into a journal, the
|
|
Packit |
6ef888 |
generation number is incremented. This provides a strict ordering
|
|
Packit |
6ef888 |
of the different versions of the block a they are logged in the FS'
|
|
Packit |
6ef888 |
different journals. When journal replay happens, each block in the
|
|
Packit |
6ef888 |
journal is not replayed if generation number in the journal is less
|
|
Packit |
6ef888 |
than the generation number in place. This ensures that a newer
|
|
Packit |
6ef888 |
version of a block is never replaced with an older version. So,
|
|
Packit |
6ef888 |
this solution basically allows multiple copies of the same block in
|
|
Packit |
6ef888 |
different journals, but it allows you to always know which is the
|
|
Packit |
6ef888 |
correct one.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Pros:
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
A) This method allows the fastest callbacks. To release a lock,
|
|
Packit |
6ef888 |
the incore log for the lock must be flushed and then the inplace
|
|
Packit |
6ef888 |
data and metadata must be synced. That's it. The sync
|
|
Packit |
6ef888 |
operations involved are: start the log body and wait for it to
|
|
Packit |
6ef888 |
become stable on the disk, synchronously write the commit block,
|
|
Packit |
6ef888 |
start the inplace metadata and wait for it to become stable on
|
|
Packit |
6ef888 |
the disk.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Cons:
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
A) Maintaining the generation numbers is expensive. All newly
|
|
Packit |
6ef888 |
allocated metadata block must be read off the disk in order to
|
|
Packit |
6ef888 |
figure out what the previous value of the generation number was.
|
|
Packit |
6ef888 |
When deallocating metadata, extra work and care must be taken to
|
|
Packit |
6ef888 |
make sure dirty data isn't thrown away in such a way that the
|
|
Packit |
6ef888 |
generation numbers stop doing their thing.
|
|
Packit |
6ef888 |
B) You can't continue to modify the filesystem during journal
|
|
Packit |
6ef888 |
replay. Basically, replay of a block is a read-modify-write
|
|
Packit |
6ef888 |
operation: the block is read from disk, the generation number is
|
|
Packit |
6ef888 |
compared, and (maybe) the new version is written out. Replay
|
|
Packit |
6ef888 |
requires that the R-M-W operation is atomic with respect to
|
|
Packit |
6ef888 |
other R-M-W operations that might be happening (say by a normal
|
|
Packit |
6ef888 |
I/O process). Since journal replay doesn't (and can't) play by
|
|
Packit |
6ef888 |
the normal metadata locking rules, you can't count on them to
|
|
Packit |
6ef888 |
protect replay. Hence GFS1, quieces all writes on a filesystem
|
|
Packit |
6ef888 |
before starting replay. This provides the mutual exclusion
|
|
Packit |
6ef888 |
required, but it's slow and unnecessarily interrupts service on
|
|
Packit |
6ef888 |
the whole cluster.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
2) Total Metadata Sync (OCFS2)
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
This method is really simple in that it uses exactly the same
|
|
Packit |
6ef888 |
infrastructure that a local journaled filesystem uses. Every time
|
|
Packit |
6ef888 |
a node receives a callback, it stops all metadata modification,
|
|
Packit |
6ef888 |
syncs out the whole incore journal, syncs out any dirty data, marks
|
|
Packit |
6ef888 |
the journal as being clean (unmounted), and then releases the lock.
|
|
Packit |
6ef888 |
Because journal is marked as clean and recovery won't look at any
|
|
Packit |
6ef888 |
of the journaled blocks in it, a valid copy of any particular block
|
|
Packit |
6ef888 |
only exists in one journal at a time and that journal always the
|
|
Packit |
6ef888 |
journal who modified it last.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Pros:
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
A) Very simple to implement.
|
|
Packit |
6ef888 |
B) You can reuse journaling code from other places (such as JBD).
|
|
Packit |
6ef888 |
C) No quiece necessary for replay.
|
|
Packit |
6ef888 |
D) No need for generation numbers sprinkled throughout the metadata.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Cons:
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
A) This method has the slowest possible callbacks. The sync
|
|
Packit |
6ef888 |
operations are: stop all metadata operations, start and wait for
|
|
Packit |
6ef888 |
the log body, write the log commit block, start and wait for all
|
|
Packit |
6ef888 |
the FS' dirty metadata, write an unmount block. Writing the
|
|
Packit |
6ef888 |
metadata for the whole filesystem can be particularly expensive
|
|
Packit |
6ef888 |
because it can be scattered all over the disk and there can be a
|
|
Packit |
6ef888 |
whole journal's worth of it.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
3) Revocation of a lock's buffers (GFS2)
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
This method prevents a block from appearing in more than one
|
|
Packit |
6ef888 |
journal by canceling out the metadata blocks in the journal that
|
|
Packit |
6ef888 |
belong to the lock being released. Journaling works very similarly
|
|
Packit |
6ef888 |
to a local filesystem or to #2 above.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
The biggest difference is you have to keep track of buffers in the
|
|
Packit |
6ef888 |
active region of the ondisk journal, even after the inplace blocks
|
|
Packit |
6ef888 |
have been written back. This is done in GFS2 by adding a second
|
|
Packit |
6ef888 |
part to the Active Items List. The first part (in GFS2 called
|
|
Packit |
6ef888 |
AIL1) contains a list of all the blocks which have been logged to
|
|
Packit |
6ef888 |
the journal, but not written back to their inplace location. Once
|
|
Packit |
6ef888 |
an item in AIL1 has been written back to its inplace location, it
|
|
Packit |
6ef888 |
is moved to AIL2. Once the tail of the log moves past the block's
|
|
Packit |
6ef888 |
transaction in the log, it can be removed from AIL2.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
When a callback occurs, the log is flushed to the disk and the
|
|
Packit |
6ef888 |
metadata for the lock is synced to disk. At this point, any
|
|
Packit |
6ef888 |
metadata blocks for the lock that are in the current active region
|
|
Packit |
6ef888 |
of the log will be in the AIL2 list. We then build a transaction
|
|
Packit |
6ef888 |
that contains revoke tags for each buffer in the AIL2 list that
|
|
Packit |
6ef888 |
belongs to that lock.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Pros:
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
A) No quiece necessary for Replay
|
|
Packit |
6ef888 |
B) No need for generation numbers sprinkled throughout the
|
|
Packit |
6ef888 |
metadata.
|
|
Packit |
6ef888 |
C) The sync operations are: stop all metadata operations, start and
|
|
Packit |
6ef888 |
wait for the log body, write the log commit block, start and
|
|
Packit |
6ef888 |
wait for all the FS' dirty metadata, start and wait for the log
|
|
Packit |
6ef888 |
body of a transaction that revokes any of the lock's metadata
|
|
Packit |
6ef888 |
buffers in the journal's active region, and write the commit
|
|
Packit |
6ef888 |
block for that transaction.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Cons:
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
A) Recovery takes two passes, one to find all the revoke tags in
|
|
Packit |
6ef888 |
the log and one to replay the metadata blocks using the revoke
|
|
Packit |
6ef888 |
tags as a filter. This is necessary for a local filesystem and
|
|
Packit |
6ef888 |
the total sync method, too. It's just that there will probably
|
|
Packit |
6ef888 |
be more tags.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Comparing #2 and #3, both do extra I/O during a lock callback to make
|
|
Packit |
6ef888 |
sure that any metadata blocks in the log for that lock will be
|
|
Packit |
6ef888 |
removed. I believe #2 will be slower because syncing out all the
|
|
Packit |
6ef888 |
dirty metadata for entire filesystem requires lots of little,
|
|
Packit |
6ef888 |
scattered I/O across the whole disk. The extra I/O done by #3 is a
|
|
Packit |
6ef888 |
log write to the disk. So, not only should it be less I/O, but it
|
|
Packit |
6ef888 |
should also be better suited to get good performance out of the disk
|
|
Packit |
6ef888 |
subsystem.
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
KWP 07/06/05
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Further notes (Steven Whitehouse)
|
|
Packit |
6ef888 |
-------------
|
|
Packit |
6ef888 |
|
|
Packit |
6ef888 |
Number 3 is slow due to having to do two write/wait transactions
|
|
Packit |
6ef888 |
in the log each time we release a glock. So far as I can see there
|
|
Packit |
6ef888 |
is no way around that, but it should be possible, if we so wish to
|
|
Packit |
6ef888 |
change to using #2 at some future date and still remain backward
|
|
Packit |
6ef888 |
compatible. So that option is open to us, but I'm not sure that we
|
|
Packit |
6ef888 |
want to take it yet. There may well be other ways to speed things
|
|
Packit |
6ef888 |
up in this area. More work remains to be done.
|
|
Packit |
6ef888 |
|