$ echo > a 1. allocate an inode (write to inode region). 2. update the size/status of the inode (write to inode region). 3. add a directory-entry record to the parent directory's data blocks. (write to data block region) 4. update parent directory's inode (size)
$ echo x > a 1. allocate a block (write to block bitmap region) 2. zero out block contents (write to data block region) 3. write "x" to block (write to data block region) 4. update "a"'s inode (size, addrs)
$ rm a 1. write zero to "a"'s directory-entry record in parent directory's data blocks (write to data block region) 2. update parent directory's inode (size) 3. update "a"'s inode to mark it free 4. update free block bitmap to mark data blocks free
Notice that each operation involves 4-8 disk writes and these filesystem accesses are to different regions of disk, i.e., they involve multiple seeks/rotations. Notice that we are assuming a write-through cache. Also, the number of outstanding I/O's (in-flight I/O's for the disk controller) is relatively small (one?) so this is not very efficient usage of the disk.
Solution 1: Use a coarse-grained lock for the entire filesystem.
This is bad because it serializes concurrent accesses to different files. Given that file accesses are anyways very slow, this can be a huge performance problem. Moreover, it restricts the number of outstanding I/Os to a very small number (typically, one), as only one thread could be executing within the filesystem at any time.
Question: does the coarse-grained global filesystem lock need to be stored on-disk or does an in-memory lock suffice? Answer: In-memory lock suffices, as all threads need to access the disk through the OS interfaces which can synchronize using the in-memory lock.
Solution 2: Use fine-grained locks --- in particular, use a lock per block. Use the buffer cache to synchronize. Mandate that a block can only be accessed by first bringing it into the buffer cache. Then use a lock per buffer in the buffer cache. (recall that a buffer corresponds to a disk block). The locks need to be blocking locks, as critical sections are likely to be very large (disk accesses).
xv6 implements per-buffer locks by using a BUSY bit per buffer.
Accesses to the BUSY bit are protected by a global buffer cache spinlock,
bcache.lock. Thus, to access a block:
sleep(buffer address, &bcache.lock). This will release
bcache.lock. A thread that is currently working on this buffer (and has thus marked it BUSY) should call wakeup after clearing the BUSY bit.
bcache.lock. (notice that I now hold a fine-grained lock on the desired buffer, and so do not need
bcache.lock. Instead disk operations are protected per-buffer locks (BUSY bits).
But doesn't the busy bit protect only against concurrent accesses to a single block. What about operations that require accesses to multiple blocks. How are they made atomic?
Simple: mark all the blocks (on which an atomic operation needs to be performed) busy. (Just like taking multiple locks). But . . . we know that fine-grained locks have deadlocks. So, need an order on the setting of busy bits for buffers. Here are some global invariants followed by xv6 (check on your own):
xv6 implements LRU for buffer cache replacement.
Notice that because xv6 uses a write-through buffer cache, many consistency problems become easy. We will next look at the consistency issues that can arise due to a write-back buffer cache.
Recall that concurrent accesses to the disk are synchronized
idelock. Also, recall that the xv6 IDE device driver
allows only one outstanding disk request at any time (the front
of the idequeue). Clearly, this can be improved so that there are
multiple outstanding disk requests and the disk device can schedule
them efficiently (recall elevator algorithm).
How fast can an xv6 application read big files? If the file has contiguous blocks? xv6 has no prefetching, it will wait for the first block to return before it issues command for second block. In doing so, we waste a full rotation of the disk! A better approach would be to prefetch, or to allow multiple outstanding requests to the disk device.
Q: Why does it make sense to have a double copy of I/O? First we read data from disk to buffer cache. Then, from buffer cache to user space. Can we do better? e.g., pass user-space buffer to the disk device driver? Few issues:
Q: how much RAM should we dedicate to disk buffers? Tradeoff: if too big buffer cache, less space for virtual memory pages; and vice-versa. Can be adapted at runtime based on usage patterns.
Example: file creation involves two main steps (in terms of disk writes)
On the other hand, if at power failure time, the opposite is true, i.e., the inode has been allocated/initialized but its entry has not been created in the parent directory, then there is no problem of a dangling pointer in the filesystem tree. In this case, we have a space leak because an inode has been marked allocated but it is not pointed-to by the filesystem tree, and so it will never be used.
A space-leak is more acceptable than a dangling pointer. Using this observation, a filesystem can enforce an order on the disk writes. The order should ensure that there will never be any dangling pointers in the filesystem tree. In this file creation example, this means that an inode should be initialized before an entry is created for it in its parent directory. Similarly at file unlink time, the entry must be removed from the parent directory before deallocating the inode (if done in the opposite order, dangling pointers can result).