o Journaling & Replay The fundamental problem with a journaled cluster filesystem is handling journal replay with multiple journals. A single block of metadata can be modified sequentially by many different nodes in the cluster. As the block is modified by each node, it gets logged in the journal for each node. If care is not taken, it's possible to get into a situation where a journal replay can actually corrupt a filesystem. The error scenario is: 1) Node A modifies a metadata block by putting a updated copy into its incore log. 2) Node B wants to read and modify the block so it requests the lock and a blocking callback is sent to Node A. 3) Node A flushes its incore log to disk, and then syncs out the metadata block to its inplace location. 4) Node A then releases the lock. 5) Node B reads in the block and puts a modified copy into its ondisk log and then the inplace block location. 6) Node A crashes. At this point, Node A's journal needs to be replayed. Since there is a newer version of block inplace, if that block is replayed, the filesystem will be corrupted. There are a few different ways of avoiding this problem. 1) Generation Numbers (GFS1) Each metadata block has header in it that contains a 64-bit generation number. As each block is logged into a journal, the generation number is incremented. This provides a strict ordering of the different versions of the block a they are logged in the FS' different journals. When journal replay happens, each block in the journal is not replayed if generation number in the journal is less than the generation number in place. This ensures that a newer version of a block is never replaced with an older version. So, this solution basically allows multiple copies of the same block in different journals, but it allows you to always know which is the correct one. Pros: A) This method allows the fastest callbacks. To release a lock, the incore log for the lock must be flushed and then the inplace data and metadata must be synced. That's it. The sync operations involved are: start the log body and wait for it to become stable on the disk, synchronously write the commit block, start the inplace metadata and wait for it to become stable on the disk. Cons: A) Maintaining the generation numbers is expensive. All newly allocated metadata block must be read off the disk in order to figure out what the previous value of the generation number was. When deallocating metadata, extra work and care must be taken to make sure dirty data isn't thrown away in such a way that the generation numbers stop doing their thing. B) You can't continue to modify the filesystem during journal replay. Basically, replay of a block is a read-modify-write operation: the block is read from disk, the generation number is compared, and (maybe) the new version is written out. Replay requires that the R-M-W operation is atomic with respect to other R-M-W operations that might be happening (say by a normal I/O process). Since journal replay doesn't (and can't) play by the normal metadata locking rules, you can't count on them to protect replay. Hence GFS1, quieces all writes on a filesystem before starting replay. This provides the mutual exclusion required, but it's slow and unnecessarily interrupts service on the whole cluster. 2) Total Metadata Sync (OCFS2) This method is really simple in that it uses exactly the same infrastructure that a local journaled filesystem uses. Every time a node receives a callback, it stops all metadata modification, syncs out the whole incore journal, syncs out any dirty data, marks the journal as being clean (unmounted), and then releases the lock. Because journal is marked as clean and recovery won't look at any of the journaled blocks in it, a valid copy of any particular block only exists in one journal at a time and that journal always the journal who modified it last. Pros: A) Very simple to implement. B) You can reuse journaling code from other places (such as JBD). C) No quiece necessary for replay. D) No need for generation numbers sprinkled throughout the metadata. Cons: A) This method has the slowest possible callbacks. The sync operations are: stop all metadata operations, start and wait for the log body, write the log commit block, start and wait for all the FS' dirty metadata, write an unmount block. Writing the metadata for the whole filesystem can be particularly expensive because it can be scattered all over the disk and there can be a whole journal's worth of it. 3) Revocation of a lock's buffers (GFS2) This method prevents a block from appearing in more than one journal by canceling out the metadata blocks in the journal that belong to the lock being released. Journaling works very similarly to a local filesystem or to #2 above. The biggest difference is you have to keep track of buffers in the active region of the ondisk journal, even after the inplace blocks have been written back. This is done in GFS2 by adding a second part to the Active Items List. The first part (in GFS2 called AIL1) contains a list of all the blocks which have been logged to the journal, but not written back to their inplace location. Once an item in AIL1 has been written back to its inplace location, it is moved to AIL2. Once the tail of the log moves past the block's transaction in the log, it can be removed from AIL2. When a callback occurs, the log is flushed to the disk and the metadata for the lock is synced to disk. At this point, any metadata blocks for the lock that are in the current active region of the log will be in the AIL2 list. We then build a transaction that contains revoke tags for each buffer in the AIL2 list that belongs to that lock. Pros: A) No quiece necessary for Replay B) No need for generation numbers sprinkled throughout the metadata. C) The sync operations are: stop all metadata operations, start and wait for the log body, write the log commit block, start and wait for all the FS' dirty metadata, start and wait for the log body of a transaction that revokes any of the lock's metadata buffers in the journal's active region, and write the commit block for that transaction. Cons: A) Recovery takes two passes, one to find all the revoke tags in the log and one to replay the metadata blocks using the revoke tags as a filter. This is necessary for a local filesystem and the total sync method, too. It's just that there will probably be more tags. Comparing #2 and #3, both do extra I/O during a lock callback to make sure that any metadata blocks in the log for that lock will be removed. I believe #2 will be slower because syncing out all the dirty metadata for entire filesystem requires lots of little, scattered I/O across the whole disk. The extra I/O done by #3 is a log write to the disk. So, not only should it be less I/O, but it should also be better suited to get good performance out of the disk subsystem. KWP 07/06/05 Further notes (Steven Whitehouse) ------------- Number 3 is slow due to having to do two write/wait transactions in the log each time we release a glock. So far as I can see there is no way around that, but it should be possible, if we so wish to change to using #2 at some future date and still remain backward compatible. So that option is open to us, but I'm not sure that we want to take it yet. There may well be other ways to speed things up in this area. More work remains to be done.