$ mv a/foo b/fooIn this case, the pointer to
foo
's inode needs to be removed
from directory a
, and added to directory b
, involving
two disk block writes. If we remove the pointer from directory a
first, we risk losing the file foo
on a crash. On the other
hand, if we add the pointer to directory b
first, we risk an
inconsistency, where the same file foo
appears in two
directories (this may not be allowed depending on the filesystem semantics).
In this case, on a reboot after crash, the fsck
program will
notice the inconsistency, but will not know how to resolve it, i.e., should
it remove the pointer from directory a
or from directory b
?
At these points, the fsck
program can ask the user to resolve
this inconsistency by suggesting a choice.
Let's see how long does fsck
take.
A random read on disk takes about 10 milliseconds.
Descending the directory hierarchy might involve a random read per inode.
So maybe (n-inodes / 100) seconds?
This can be made faster if all inodes (and dir blocks) are read sequentially (in
one go), and then descend hierarchy in memory.
fsck takes around 10 minutes per 70GB disk w/ 2 million inodes. Clearly this performance is possible only by reading many inodes sequentially. The time taken by fsck is roughly linear in the size of the disk. As disks grew in size, the cost of running fsck started becoming impractical. We will next study a more modern method of ensuring crash recovery, called logging.
But before that, ordering of disk writes, as we have discussed it so far, would involve synchronous writes to disk. A write-back cache could potentially reorder the disk writes (i.e., writes to buffers in the cache may be in a different order from the actual writes to the disk). This may not be acceptable from a crash-recovery standpoint. On the other hand, a write-through buffer cache is unacceptable from a performance standpoint. (Recall that creating a file and writing a few bytes takes 8 writes, probably 80 ms, so can create only about a dozen small files per second. think about un-tar or rm *)
One solution is to maintain a dependency graph in memory for the buffer
cache. The dependency graph indicates an ordering dependency between flushes
of different blocks to disk. For example, for the command, mv a/foo b/foo
,
a dependency edge should be drawn from directory b
's block to
directory a
's block, indicating that b
should be
flushed before a
.
However, consider the following example, where the dependency graph can contain a cycle:
$ mv a/foo b/foo $ mv b/bar a/barThe first command draws a dependency edge from
b
to a
,
while the second command draws a dependency edge from a
to b
.
At this point, no matter which block is flushed first, there is a chance
that one file may get lost (either foo or bar). The solution to this problem
is to identify potential cycles in the dependency graph (at the time of making in-memory
updates), and avoid them by flushing existing dirty buffers to disk.
In this example, before the second command is executed, the OS should flush the
blocks for directories a
and b
(thus
eliminating the dependency edge between them) so that the addition
of a new dependency edge does not cause a cycle.
begin_trans() bp = bread() bp->data[] = ... log_write(bp) more writes ... commit_trans()
commit_trans()
time. This allows only 4-5
disk writes to get batched at a time. It would be much better if hundreds of
disk writes get batched together.
Fixing xv6 Logging (as done in Linux ext3):
begin_trans()
and commit_trans()
with begin_atomic_op()
and
commit_atomic_op()
. The commit of an atomic operation does not
cause a commit of the whole transaction.
commit_atomic_op()
.