$ mv a/foo b/fooIn this case, the pointer to
foo's inode needs to be removed from directory
a, and added to directory
b, involving two disk block writes. If we remove the pointer from directory
afirst, we risk losing the file
fooon a crash. On the other hand, if we add the pointer to directory
bfirst, we risk an inconsistency, where the same file
fooappears in two directories (this may not be allowed depending on the filesystem semantics).
In this case, on a reboot after crash, the
fsck program will
notice the inconsistency, but will not know how to resolve it, i.e., should
it remove the pointer from directory
a or from directory
At these points, the
fsck program can ask the user to resolve
this inconsistency by suggesting a choice.
Let's see how long does
A random read on disk takes about 10 milliseconds.
Descending the directory hierarchy might involve a random read per inode.
So maybe (n-inodes / 100) seconds?
This can be made faster if all inodes (and dir blocks) are read sequentially (in
one go), and then descend hierarchy in memory.
fsck takes around 10 minutes per 70GB disk w/ 2 million inodes. Clearly this performance is possible only by reading many inodes sequentially. The time taken by fsck is roughly linear in the size of the disk. As disks grew in size, the cost of running fsck started becoming impractical. We will next study a more modern method of ensuring crash recovery, called logging.
But before that, ordering of disk writes, as we have discussed it so far, would involve synchronous writes to disk. A write-back cache could potentially reorder the disk writes (i.e., writes to buffers in the cache may be in a different order from the actual writes to the disk). This may not be acceptable from a crash-recovery standpoint. On the other hand, a write-through buffer cache is unacceptable from a performance standpoint. (Recall that creating a file and writing a few bytes takes 8 writes, probably 80 ms, so can create only about a dozen small files per second. think about un-tar or rm *)
One solution is to maintain a dependency graph in memory for the buffer
cache. The dependency graph indicates an ordering dependency between flushes
of different blocks to disk. For example, for the command,
mv a/foo b/foo,
a dependency edge should be drawn from directory
b's block to
a's block, indicating that
b should be
However, consider the following example, where the dependency graph can contain a cycle:
$ mv a/foo b/foo $ mv b/bar a/barThe first command draws a dependency edge from
a, while the second command draws a dependency edge from
b. At this point, no matter which block is flushed first, there is a chance that one file may get lost (either foo or bar). The solution to this problem is to identify potential cycles in the dependency graph (at the time of making in-memory updates), and avoid them by flushing existing dirty buffers to disk. In this example, before the second command is executed, the OS should flush the blocks for directories
b(thus eliminating the dependency edge between them) so that the addition of a new dependency edge does not cause a cycle.
begin_trans() bp = bread() bp->data = ... log_write(bp) more writes ... commit_trans()
commit_trans()time. This allows only 4-5 disk writes to get batched at a time. It would be much better if hundreds of disk writes get batched together.
Fixing xv6 Logging (as done in Linux ext3):
commit_atomic_op(). The commit of an atomic operation does not cause a commit of the whole transaction.