Tree - source-git/gfs2-utils - CentOS Git server

source-git / gfs2-utils

Blame doc/journaling.txt

Blob History Raw

Packit	6ef888	`o Journaling & Replay`
Packit	6ef888
Packit	6ef888	`The fundamental problem with a journaled cluster filesystem is`
Packit	6ef888	`handling journal replay with multiple journals. A single block of`
Packit	6ef888	`metadata can be modified sequentially by many different nodes in the`
Packit	6ef888	`cluster. As the block is modified by each node, it gets logged in the`
Packit	6ef888	`journal for each node. If care is not taken, it's possible to get`
Packit	6ef888	`into a situation where a journal replay can actually corrupt a`
Packit	6ef888	`filesystem. The error scenario is:`
Packit	6ef888
Packit	6ef888	`1) Node A modifies a metadata block by putting a updated copy into its`
Packit	6ef888	`incore log.`
Packit	6ef888	`2) Node B wants to read and modify the block so it requests the lock`
Packit	6ef888	`and a blocking callback is sent to Node A.`
Packit	6ef888	`3) Node A flushes its incore log to disk, and then syncs out the`
Packit	6ef888	`metadata block to its inplace location.`
Packit	6ef888	`4) Node A then releases the lock.`
Packit	6ef888	`5) Node B reads in the block and puts a modified copy into its ondisk`
Packit	6ef888	`log and then the inplace block location.`
Packit	6ef888	`6) Node A crashes.`
Packit	6ef888
Packit	6ef888	`At this point, Node A's journal needs to be replayed. Since there is`
Packit	6ef888	`a newer version of block inplace, if that block is replayed, the`
Packit	6ef888	`filesystem will be corrupted. There are a few different ways of`
Packit	6ef888	`avoiding this problem.`
Packit	6ef888
Packit	6ef888	`1) Generation Numbers (GFS1)`
Packit	6ef888
Packit	6ef888	`Each metadata block has header in it that contains a 64-bit`
Packit	6ef888	`generation number. As each block is logged into a journal, the`
Packit	6ef888	`generation number is incremented. This provides a strict ordering`
Packit	6ef888	`of the different versions of the block a they are logged in the FS'`
Packit	6ef888	`different journals. When journal replay happens, each block in the`
Packit	6ef888	`journal is not replayed if generation number in the journal is less`
Packit	6ef888	`than the generation number in place. This ensures that a newer`
Packit	6ef888	`version of a block is never replaced with an older version. So,`
Packit	6ef888	`this solution basically allows multiple copies of the same block in`
Packit	6ef888	`different journals, but it allows you to always know which is the`
Packit	6ef888	`correct one.`
Packit	6ef888
Packit	6ef888	`Pros:`
Packit	6ef888
Packit	6ef888	`A) This method allows the fastest callbacks. To release a lock,`
Packit	6ef888	`the incore log for the lock must be flushed and then the inplace`
Packit	6ef888	`data and metadata must be synced. That's it. The sync`
Packit	6ef888	`operations involved are: start the log body and wait for it to`
Packit	6ef888	`become stable on the disk, synchronously write the commit block,`
Packit	6ef888	`start the inplace metadata and wait for it to become stable on`
Packit	6ef888	`the disk.`
Packit	6ef888
Packit	6ef888	`Cons:`
Packit	6ef888
Packit	6ef888	`A) Maintaining the generation numbers is expensive. All newly`
Packit	6ef888	`allocated metadata block must be read off the disk in order to`
Packit	6ef888	`figure out what the previous value of the generation number was.`
Packit	6ef888	`When deallocating metadata, extra work and care must be taken to`
Packit	6ef888	`make sure dirty data isn't thrown away in such a way that the`
Packit	6ef888	`generation numbers stop doing their thing.`
Packit	6ef888	`B) You can't continue to modify the filesystem during journal`
Packit	6ef888	`replay. Basically, replay of a block is a read-modify-write`
Packit	6ef888	`operation: the block is read from disk, the generation number is`
Packit	6ef888	`compared, and (maybe) the new version is written out. Replay`
Packit	6ef888	`requires that the R-M-W operation is atomic with respect to`
Packit	6ef888	`other R-M-W operations that might be happening (say by a normal`
Packit	6ef888	`I/O process). Since journal replay doesn't (and can't) play by`
Packit	6ef888	`the normal metadata locking rules, you can't count on them to`
Packit	6ef888	`protect replay. Hence GFS1, quieces all writes on a filesystem`
Packit	6ef888	`before starting replay. This provides the mutual exclusion`
Packit	6ef888	`required, but it's slow and unnecessarily interrupts service on`
Packit	6ef888	`the whole cluster.`
Packit	6ef888
Packit	6ef888	`2) Total Metadata Sync (OCFS2)`
Packit	6ef888
Packit	6ef888	`This method is really simple in that it uses exactly the same`
Packit	6ef888	`infrastructure that a local journaled filesystem uses. Every time`
Packit	6ef888	`a node receives a callback, it stops all metadata modification,`
Packit	6ef888	`syncs out the whole incore journal, syncs out any dirty data, marks`
Packit	6ef888	`the journal as being clean (unmounted), and then releases the lock.`
Packit	6ef888	`Because journal is marked as clean and recovery won't look at any`
Packit	6ef888	`of the journaled blocks in it, a valid copy of any particular block`
Packit	6ef888	`only exists in one journal at a time and that journal always the`
Packit	6ef888	`journal who modified it last.`
Packit	6ef888
Packit	6ef888	`Pros:`
Packit	6ef888
Packit	6ef888	`A) Very simple to implement.`
Packit	6ef888	`B) You can reuse journaling code from other places (such as JBD).`
Packit	6ef888	`C) No quiece necessary for replay.`
Packit	6ef888	`D) No need for generation numbers sprinkled throughout the metadata.`
Packit	6ef888
Packit	6ef888	`Cons:`
Packit	6ef888
Packit	6ef888	`A) This method has the slowest possible callbacks. The sync`
Packit	6ef888	`operations are: stop all metadata operations, start and wait for`
Packit	6ef888	`the log body, write the log commit block, start and wait for all`
Packit	6ef888	`the FS' dirty metadata, write an unmount block. Writing the`
Packit	6ef888	`metadata for the whole filesystem can be particularly expensive`
Packit	6ef888	`because it can be scattered all over the disk and there can be a`
Packit	6ef888	`whole journal's worth of it.`
Packit	6ef888
Packit	6ef888	`3) Revocation of a lock's buffers (GFS2)`
Packit	6ef888
Packit	6ef888	`This method prevents a block from appearing in more than one`
Packit	6ef888	`journal by canceling out the metadata blocks in the journal that`
Packit	6ef888	`belong to the lock being released. Journaling works very similarly`
Packit	6ef888	`to a local filesystem or to #2 above.`
Packit	6ef888
Packit	6ef888	`The biggest difference is you have to keep track of buffers in the`
Packit	6ef888	`active region of the ondisk journal, even after the inplace blocks`
Packit	6ef888	`have been written back. This is done in GFS2 by adding a second`
Packit	6ef888	`part to the Active Items List. The first part (in GFS2 called`
Packit	6ef888	`AIL1) contains a list of all the blocks which have been logged to`
Packit	6ef888	`the journal, but not written back to their inplace location. Once`
Packit	6ef888	`an item in AIL1 has been written back to its inplace location, it`
Packit	6ef888	`is moved to AIL2. Once the tail of the log moves past the block's`
Packit	6ef888	`transaction in the log, it can be removed from AIL2.`
Packit	6ef888
Packit	6ef888	`When a callback occurs, the log is flushed to the disk and the`
Packit	6ef888	`metadata for the lock is synced to disk. At this point, any`
Packit	6ef888	`metadata blocks for the lock that are in the current active region`
Packit	6ef888	`of the log will be in the AIL2 list. We then build a transaction`
Packit	6ef888	`that contains revoke tags for each buffer in the AIL2 list that`
Packit	6ef888	`belongs to that lock.`
Packit	6ef888
Packit	6ef888	`Pros:`
Packit	6ef888
Packit	6ef888	`A) No quiece necessary for Replay`
Packit	6ef888	`B) No need for generation numbers sprinkled throughout the`
Packit	6ef888	`metadata.`
Packit	6ef888	`C) The sync operations are: stop all metadata operations, start and`
Packit	6ef888	`wait for the log body, write the log commit block, start and`
Packit	6ef888	`wait for all the FS' dirty metadata, start and wait for the log`
Packit	6ef888	`body of a transaction that revokes any of the lock's metadata`
Packit	6ef888	`buffers in the journal's active region, and write the commit`
Packit	6ef888	`block for that transaction.`
Packit	6ef888
Packit	6ef888	`Cons:`
Packit	6ef888
Packit	6ef888	`A) Recovery takes two passes, one to find all the revoke tags in`
Packit	6ef888	`the log and one to replay the metadata blocks using the revoke`
Packit	6ef888	`tags as a filter. This is necessary for a local filesystem and`
Packit	6ef888	`the total sync method, too. It's just that there will probably`
Packit	6ef888	`be more tags.`
Packit	6ef888
Packit	6ef888	`Comparing #2 and #3, both do extra I/O during a lock callback to make`
Packit	6ef888	`sure that any metadata blocks in the log for that lock will be`
Packit	6ef888	`removed. I believe #2 will be slower because syncing out all the`
Packit	6ef888	`dirty metadata for entire filesystem requires lots of little,`
Packit	6ef888	`scattered I/O across the whole disk. The extra I/O done by #3 is a`
Packit	6ef888	`log write to the disk. So, not only should it be less I/O, but it`
Packit	6ef888	`should also be better suited to get good performance out of the disk`
Packit	6ef888	`subsystem.`
Packit	6ef888
Packit	6ef888	`KWP 07/06/05`
Packit	6ef888
Packit	6ef888	`Further notes (Steven Whitehouse)`
Packit	6ef888	`-------------`
Packit	6ef888
Packit	6ef888	`Number 3 is slow due to having to do two write/wait transactions`
Packit	6ef888	`in the log each time we release a glock. So far as I can see there`
Packit	6ef888	`is no way around that, but it should be possible, if we so wish to`
Packit	6ef888	`change to using #2 at some future date and still remain backward`
Packit	6ef888	`compatible. So that option is open to us, but I'm not sure that we`
Packit	6ef888	`want to take it yet. There may well be other ways to speed things`
Packit	6ef888	`up in this area. More work remains to be done.`
Packit	6ef888

source-git / gfs2-utils

Source Code

Blame doc/journaling.txt