File systems

Required reading: readi, writei, fileread, filewrite, create, and dirlink, and code related to these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c

Overview

The next 3 lectures are about file systems:

Basic file system implementation
Naming
Performance
Correctness

Users desire to store their data durably, so that data survives when the user turns off their computer. The primary media for doing so are: magnetic disks, flash memory, and tapes. We focus on magnetic disks (e.g., through the IDE interface in xv6).

To allow users to remember where they stored a file, they can assign a symbolic name to a file, which appears in a directory.

The data in a file can be organized in a structured way or not. The structured variant is often called a database. UNIX uses the unstructured variant: files are streams of bytes. Any particular structure is likely to be useful to only a small class of applications, and other applications will have to work hard to fit their data into one of the pre-defined structures. Besides, if you want structure, you can easily write a user-mode library program that imposes that format on any file. The end-to-end argument in action. (Databases have special requirements and support an important class of applications, and thus have a specialized plan.)

The API for a minimal file system consists of: open, read, write, seek, close, and stat. Dup duplicates a file descriptor. For example:

  fd = open("x/y", O_RDWR);
  read (fd, buf, 100);
  write (fd, buf, 512);
  close (fd)

There's a couple of different things going on here.

Names: how do we get from the name "x/y" to the file we're actually talking about. The next lecture will be all about naming. Kernel interface relies on these names a lot.
Directories: provide a way of assigning user-meaningful names to files. Part of the kernel name interface.
In Unix (and almost all other filesystems) there's some notion of an on-disk file that's independent of the file's name, called an "inode". Seen by the user (e.g. stat) but cannot access a given inode.
Disk blocks: the inode translates offsets in our conceptual file into concrete locations on disk. Not seen by the user.
File descriptors: in Unix, the FD binds to the inode, rather than the pathname, so the pathname lookup is done only when the file is first opened. Application handle for an inode in the Unix API.

We're going to talk about files, blocks, and FDs today; names and directories will be next time.

Maintaining the file offset behind the read/write interface is an interesting design decision. The alternative is that the state of a read operation should be maintained by the process doing the reading (i.e., that the pointer should be passed as an argument to read). This argument is compelling in view of the UNIX fork() semantics, which clones a process which shares the file descriptors of its parent.

With offsets in the file descriptor, a read by the parent of a shared file descriptor (e.g., stdin) changes the read pointer seen by the child. This isn't always desirable: for example, consider a data file, in which the program seeks around and reads various records. If we fork(), the child and parent might interfere. On the other hand, the alternative (no offset in FD) would make it difficult to get "(echo one; echo two) > x" right. Easy to implement separate-offsets if kernel provides shared-offsets (re-open file, mostly), but not the other way around.

The file API turned out to be quite a useful abstraction. Unix uses it for many things that aren't just files on local disk, e.g. pipes, devices in /dev, network storage, etc. Plan9 took this further, and a few of those ideas came back to Linux, like the /proc filesystem.

Unix API doesn't specify that the effects of write are immediately on the disk before a write returns. It is up to the implementation of the file system within certain bounds. Choices include (that aren't non-exclusive):

Before the write returns [most conservative, but bad for performance];
Before close returns [AFS];
At some point in the future, if the system stays up (e.g., after 30 seconds) [highest performance];
Application specified (e.g., before fsync returns) [requires careful application coding];
Before external things (screen, network) indicate the write returned [will read paper about this later].

Some examples of file systems:

Contiguous Allocation: file a = (base=10, len=12)
- Pro: Fast sequential access, easy random access
- Con: External Fragmentation/hard to grow files
Linked-list Allocation: file = linked list of blocks
- Pro: Easy to grow file. Free list can be implemented as another file.
- Con: Bad sequential access performance! Unreliable: Loose block, loose rest of file
- Serious Con: Bad random access
DOS FAT: Cute modification to linked list. Links reside in fixed size File Allocation Table (FAT). Still need to do pointer chasing but the entire FAT can fit in memory so chasing cheap.
- Discussion (Entry size = 16 bits):
  - What's the maximum size of the FAT?
  - Given a 512 byte block, what's the maximum size of FS?
  - Option: go to bigger blocks (called "Allocation Unit Size" at Format Time). Pro? Con?
- Space Overhead of FAT: 2 bytes/512 byte block = approx .4%
- Reliability: create duplicate copies of FAT to protect against errors
- Bootstrapping: Where is root directory? Fixed location on disk / have a table in the bootsector (sector 0)
Indexed Files: Each file has an array holding all it's block pointers.
- Pros: Fast random access, easy growth
- Cons: Slow sequential access, growth beyond index size clumsy. Index size?
Multi-level Indexed Files (BSD/Unix/xv6): Direct blocks, Indirect blocks, Double-indirect blocks. Caters to common case of small files. Handles large files.

What makes filesystems hard/interesting? At one level, it's much like any other part of an OS: trying to achieve good performance and provide useful functionality for lots of different applications with very different workload characteristics. However, more specifically:

A design issue is the semantics of a file system operation that requires multiple disk writes. In particular, what happens if the logical update requires writing multiple disks blocks and the power fails during the update? For example, to create a new file, requires allocating an inode (which requires updating the list of free inodes on disk), writing a directory entry to record the allocated i-node under the name of the new file (which may require allocating a new block and updating the directory inode). If the power fails during the operation, the list of free inodes and blocks may be inconsistent with the blocks and inodes in use. Again this is up to implementation of the file system to keep on disk data structures consistent:

Don't worry about it much, but use a recovery program to bring file system back into a consistent state. Linux ext2 and FAT are such examples. xv6 is almost in this category, except that it has no recovery program.
Journaling state that maintains kernel invariants. Never let the file system metadata get into an inconsistent state. In some sense, this resembles how the kernel deals with killing user processes: it cleans up kernel structures, so kernel invariants are upheld, but ignores application invariants. NTFS does this, as does Linux ext3 by default.
Journaling all file system state (kernel-managed structure as well as user-managed data). This tends to be quite a bit more expensive, in terms of performance. Linux's ext3 has a full data journaling mode that behaves in this fashion.

Another design issue is the semantics are of concurrent writes to the same data item. What is the order of two updates that happen at the same time? For example, two processes open the same file and write to it. Modern Unix operating systems allow the application to lock a file to get exclusive access. If file locking is not used and if the file descriptor is shared, then the bytes of the two writes will get into the file in some order (this happens often for log files). If the file descriptor is not shared, the end result is not defined. For example, one write may overwrite the other one (e.g., if they are writing to the same part of the file.) Lots of other examples with directories, names, etc.

An implementation issue is performance, because writing to magnetic disk is relatively expensive compared to computing. Three primary ways to improve performance are: careful file system layout and data structure design that induces few seeks (locality, btrees, logging, etc), an in-memory cache of frequently-accessed blocks (or even prefetching), and overlap I/O with computation so that file operations don't have to wait until their completion and so that that the disk driver has more data to write, which allows disk scheduling. (We will talk about performance in detail later.)

xv6 code examples

xv6 implements a minimal Unix file system interface. xv6 doesn't pay attention to file system layout. It overlaps computation and I/O, but doesn't do any disk scheduling. Its cache is write-through, which simplifies keeping on disk datastructures consistent, but is bad for performance.

On disk files are represented by an inode (struct dinode in fs.h), and blocks. Small files have up to 12 block addresses in their inode; large files use files the last address in the inode as a disk address for a block with 128 disk addresses (512/4). The size of a file is thus limited to 12 * 512 + 128*512 bytes. What would you change to support larger files? (Ans: e.g., double indirect blocks.)

Directories are files with a bit of structure to them. The file contains of records of the type struct dirent. The entry contains the name for a file (or directory) and its corresponding inode number. How many files can appear in a directory? (max file size / sizeof(struct dirent))

In memory files are represented by struct inode in file.h. What is the role of the additional fields in struct inode (dev, inum, refcount, flags (IBUSY, IVALID))? Why don't we embed dinode in the directory? (Links from multiple directories to the same file.)

Kernel allows users to look at the inode state using stat() or fstat(), which returns a struct stat, from stat.h.

What is xv6's disk layout? Who determines how many inodes, blocks, etc. (mkfs.c). How does xv6 keep track of free blocks and inodes? See balloc()/bfree() and ialloc()/ifree(). Is this layout a good one for performance? What are other options?

Block 0 is unused
Block 1 is super block
Inodes start at block 2
Free-block bitmap at block ninodes/IPB + 3
Free blocks after that

Free blocks are tracked using bitmap. Free inodes are tracked by linearly scanning the inode array to find one that is free. Other options (bitmap for inodes, better locality among inodes and blocks)

let's talk about xv6

FS software layers
  system calls
  name ops | file descriptors
  inode ops
  inode cache (really just active inodes)
  transactions
  buffer cache
  disk driver

on-disk layout
  we will view disk as linear array of 512-byte sectors
    though, when thinking about performance, remember concentric tracks
  0: unused
  1: super block (size, ninodes)
  2: array of inodes, packed into blocks
  X: block in-used bitmap (0=free, 1=inuse)
  Y: file/dir content blocks
  Z: log for transactions
  end of disk

on-disk inode
  type (free, file, directory, device)
  nlink
  size
  addrs[12+1]

direct and indirect blocks

each i-node has an i-number
  easy to turn i-number into inode
  location on disk: sector 2 + 64*inum

directory contents
  directory much like a file
    but user can't directly write
  content is array of dirents
  dirent:
    inum
    14-byte file name
  dirent is free if inum is zero

you should view FS as an on-disk data structure
  [tree: dirs, inodes, blocks]
  two allocation pools: inodes and blocks

let's look at xv6 in action
  focus on disk writes, as in homework
  illustrate on-disk data structures via how updated

[hand out the following]

Q: how does xv6 create a file?

$ echo > a
log_write 4 ialloc (44, from create 54)
log_write 4 iupdate (44, from create 54)
log_write 29 writei (47, from dirlink 48, from create 54)
log_write 2 iupdate

Q: what is writei writing?

Q: what is the 2nd iupdate about?

Q: what if there are concurrent calls to ialloc?
   will they get the same inode?
   note bread / bwrite / brelse in ialloc
   bread sheet 39
   diagram of block cache
   focus on B_BUSY and sleep()
   Q: why goto loop?

Q: how does xv6 write data to a file?

$ echo x > a
log_write 28 balloc (43, from bmap 46, from writei 47)
log_write 417 bzero
log_write 417 writei
log_write 4 iupdate
log_write 417 writei
log_write 4 iupdate

Q: why the iupdate?

Q: why *two* iupdates?

Q: how does xv6 delete a file?

$ rm a
log_write 29 writei
log_write 4 iupdate
log_write 28 bfree
log_write 4 iupdate
log_write 4 iupdate

Q: what's in block 29?

Q: what are the iupdates doing?

Q: what is the block cache replacement policy?
   bget and brelse, sheet 39

Q: is that the best replacement policy?

Q: when does xv6 write user data to disk?
   writei 47, bwrite 39, iderw 38 (which sleeps)

Q: is that a good write policy?
   performance?
   correctness? what's the danger?

Q: when does xv6 write meta-data to disk?
   same policy as user data
   is that a good meta-data write policy?
     performance
     correctness (order)
   tune in next lecture...

Q: what if lots of processes need to read the disk? who goes first?
  iderw 38 appends to idequeue
  idestart looks at head of list
  ideintr pops head, starts next
  so FIFO

Q: is FIFO a good disk scheduling policy?
  priority to interactive programs?
  elevator sort?

Q: how fast can an xv6 application read big files?
  contiguous blocks?
  blow a rotation -- no prefetch?

Q: why does it make sense to have a double copy of I/O?
  disk to buffer cache
  buffer cache to user space
  can we fix it to get better performance?

Q: how much RAM should we dedicate to disk buffers?