Crash Recovery

Ordering Disk Writes

Order disk writes to avoid dangling pointers on crash. Allow disk block leaks. On reboot after crash, optionally perform a global filesystem check (fsck) to identify inconsistencies (like space leaks). The fsck program can assume that inconsistencies will be limited to certain types (e.g., space leaks are possible but dangling pointers are not). If it finds an inconsistency (e.g., space leak), it will fix it. But, can we always avoid filesystem inconsistencies? Consider the following example:
$ mv a/foo b/foo
In this case, the pointer to foo's inode needs to be removed from directory a, and added to directory b, involving two disk block writes. If we remove the pointer from directory a first, we risk losing the file foo on a crash. On the other hand, if we add the pointer to directory b first, we risk an inconsistency, where the same file foo appears in two directories (this may not be allowed depending on the filesystem semantics).

In this case, on a reboot after crash, the fsck program will notice the inconsistency, but will not know how to resolve it, i.e., should it remove the pointer from directory a or from directory b? At these points, the fsck program can ask the user to resolve this inconsistency by suggesting a choice.

Let's see how long does fsck take. A random read on disk takes about 10 milliseconds. Descending the directory hierarchy might involve a random read per inode. So maybe (n-inodes / 100) seconds? This can be made faster if all inodes (and dir blocks) are read sequentially (in one go), and then descend hierarchy in memory.

fsck takes around 10 minutes per 70GB disk w/ 2 million inodes. Clearly this performance is possible only by reading many inodes sequentially. The time taken by fsck is roughly linear in the size of the disk. As disks grew in size, the cost of running fsck started becoming impractical. We will next study a more modern method of ensuring crash recovery, called logging.

But before that, ordering of disk writes, as we have discussed it so far, would involve synchronous writes to disk. A write-back cache could potentially reorder the disk writes (i.e., writes to buffers in the cache may be in a different order from the actual writes to the disk). This may not be acceptable from a crash-recovery standpoint. On the other hand, a write-through buffer cache is unacceptable from a performance standpoint. (Recall that creating a file and writing a few bytes takes 8 writes, probably 80 ms, so can create only about a dozen small files per second. think about un-tar or rm *)

One solution is to maintain a dependency graph in memory for the buffer cache. The dependency graph indicates an ordering dependency between flushes of different blocks to disk. For example, for the command, mv a/foo b/foo, a dependency edge should be drawn from directory b's block to directory a's block, indicating that b should be flushed before a.

However, consider the following example, where the dependency graph can contain a cycle:

$ mv a/foo b/foo
$ mv b/bar a/bar
The first command draws a dependency edge from b to a, while the second command draws a dependency edge from a to b. At this point, no matter which block is flushed first, there is a chance that one file may get lost (either foo or bar). The solution to this problem is to identify potential cycles in the dependency graph (at the time of making in-memory updates), and avoid them by flushing existing dirty buffers to disk. In this example, before the second command is executed, the OS should flush the blocks for directories a and b (thus eliminating the dependency edge between them) so that the addition of a new dependency edge does not cause a cycle.

Logging

Goals: The basic idea behind logging: Simple logging (as in xv6):
[diagram: buffer cache, FS tree on disk, log on disk] What's wrong with xv6's logging?

Fixing xv6 Logging (as done in Linux ext3):

Closing a transaction: In doing this, we alleviate many of the problems with xv6's logging: