When designing a system we desire clean abstractions and good modularity. We'd like a caller not to have to know how a callee implements a particular function. We'd like to hide locking behavior in this way -- for example, it would be nice if the lock on each file-system inode was the private business of methods on inodes.
It turns out it's hard to use locks in a modular way. Two problems come up that require non-modular global knowledge: multi-module invariants and deadlock.
First, multi-module invariants. You could imagine the code
implementing directories exporting various operations like
dir_lookup(d, name)
,
dir_add(d, name, inumber)
, and
dir_del(d, name)
.
These directory operations would each acquire the lock on d
,
do their work, and release the lock.
Then higher-level code could implement operations like
moving a file from one directory to another (one case of
the UNIX rename()
system call):
move(olddir, oldname, newdir, newname){ inumber = dir_lookup(olddir, oldname) dir_del(olddir, oldname) dir_add(newdir, newname, inumber) }This code is pleasingly simple, but has a number of problems. One of them is that there's a period of time when the file is visible in neither directory. Fixing that, sadly, seems to require that the directory locks not be hidden inside the
dir_
operations, thus:
move(olddir, oldname, newdir, newname){ acquire(olddir.lock) acquire(newdir.lock) inumber = dir_lookup(olddir, oldname) dir_del(olddir, oldname) dir_add(newdir, newname, inumber) release(newdir.lock) release(olddir.lock) }
The above code is ugly in that it exposes the implementation of
directories to move()
, but (if all you have is locks)
you have to do it this way.
The second big problem is deadlock. Notice that move()
holds two locks. What would happen if one process called
move(dirA, ..., dir B, ...)
while another process called
move(dirB, ..., dir A, ...)
?
In an operating system the usual solution is to avoid deadlocks by establishing an order on all locks and making sure that every thread acquires its locks in that order. For the file system the rule might be to lock the directory with the lowest inode first. Picking and obeying the order on all locks requires that modules make public their locking behavior, and requires them to know about other module's locking. This can be painful and error-prone.
Fine-grained locks usually make both of these problems worse.
Locks are a way to implement "mutual exclusion", which is the property that only one CPU at a time can be executing a certain piece of code. This is also called "isolation atomicity".
The code between an acquire() and a release() is sometimes called a "critical section."
The more general idea of ensuring one thread waits for another when neccessary for correctness is called "coordination." People have invented a huge number of coordination primitives; locks are just one example, which we use because they are simple. We'll see some more coordination primitives next.
xchg R,M
instruction is expensive because it usually
involves a bus transaction, to ensure that the read and write both happen
atomically. Here is a computer system diagram to explain this better:
------------- | Main Memory | (containing l->locked) ------------- | | --------|-----------------|----------------|------ --------|-----------------|----------------|------ Bus --------|-----------------|----------------|------ | | | | cache0 cache1 -------- -------- | CPU0 | | CPU1 | -------- --------Each CPU usually has a private L1 cache. Consistency between the caches is maintained using a "cache coherence protocol", whereby the system ensures that each memory location has only one valid value at any time.
Thus if CPU0 access location X, a 100 times (without any intervening access to X by CPU1), the first access will result in a bus transaction (whereby X will be fetched either from the main memory, or from CPU1's cache through the cache coherence protocol). All the other 99 accesses will be served locally from cache0, without incurring a bus transaction. In a multiprocessor system, the bus often becomes a performance bottleneck (especially if the number of CPUs is high), and so reducing the number of bus transactions is an important optimization.
In our implementation of acquire()
and release()
,
we use the atomic xchg
instruction that needs to perform
two memory accesses (one read and one write) in an atomic fashion. This
can only be achieved by a bus transaction, because all other processors
need to be informed that the two memory accesses need to be done atomically.
Hence, if a CPU is spin-waiting inside a lock acquire function, it is
constantly making bus transactions, which can severely affect performance.
void acquire(struct lock *l) { Reg = 1; while (xchg(Reg, &l->locked) == 1); } void release(struct lock *l) { l->locked = 0; }
Instead, it may be better to write the code like this:
void acquire(struct lock *l) { Reg = 1; while (xchg(Reg, &l->locked) == 1) { while (l->locked == 1); } } void release(struct lock *l) { l->locked = 0; }In this case, while waiting, we use a regular memory read instruction (in the inner loop) to test the current value of
l->locked
.
For the time that l->locked
is not changed by the other
CPU, l->locked
will get cached into cache0, and the accesses
by a CPU will be only to its local cache (thus avoiding bus transactions).
If the local cache reports that the value has changed (which means another
CPU must have changed its value), this code again tries the xchg
instruction, which may succeed this time. If it fails again, we again
go into the inner waiting loop. This is a more efficient method of waiting
as it reduces bus transactions significantly. Modern processors provide
special instructions (e.g., pause
on x86) to make this
waiting operation even more efficient.
Compiler optimizations typically do not work well with code that
can be executed concurrently. Using locks, we have ensured that the
only code that can execute concurrently are the acquire()
and release()
functions. Let's see what could happen if
the compiler tries to optimize our acquire()
and release()
implementation.
a
gets
register allocated to register R
.
a = 1; b = a + 2; a = a + 1;may get compiled to
load a, Reg //load 'a' from memory to register b = Reg + 2 Reg = Reg + 1 store Reg, a //store current value of 'a' from register to memory
Intelligent register allocation reduces the number of memory accesses, and
thus reduces memory accesses. But what would happen if the
l->locked
variable in our acquire()
function
gets register allocated. It will result in an infinite loop inside
acquire, causing a deadlock.
Because compiler optimizations assume sequential code, they do not
play well with code that could be executed concurrently. Languages typically
provide ways to tell the compiler to not optimize accesses to certain
variables (for example, the volatile
keyword in C). In this
case, we should either write the waiting loop in assembly, or use the
volatile keyword appropriately to ensure that the l->locked
variable does not get register allocated.
a
and b
like this:
a = 2; b = 3;becomes
b = 3; a = 2;For a compiler which only looks at sequential semantics, this is a valid optimization. Similarly, even if the compiler does not reorder the instructions, the hardware is free to reorder the memory accesses at runtime. For example, if the access to variable
"a"
is a cache miss, while
the access to variable "b"
is a cache hit, the access to
"b"
will complete first. This is true for most
modern processors, and this behaviour is called "out-of-order" execution.
In our lock implementation, if the access to a shared variable
gets reordered with the access to the lock variable (l->locked
),
our semantics get violated. For example,
acquire(l); hits++; release(l);may become equivalent to (either at compile-time or at runtime)
acquire(l); release(l); hits++;
To avoid this, it is possible to tell the compiler and the hardware
to not reorder instructions. Many architectures that support out-of-order
execution, also provide "fence
" instructions, which can
be used to instruct the hardware to not reorder memory accesses across
the fence instruction. On the x86 architecture, the locked xchg
instruction implicitly also acts as a fence, so our acquire
implementation is correct even in the presence of memory access reordering.
However, we need to explicitly insert a fence
instruction
in our release()
implementation like this:
void acquire(struct lock *l) { Reg = 1; while (xchg(Reg, &l->locked) == 1) { while (l->locked == 1); } } void release(struct lock *l) { fence; //assembly instruction l->locked = 0; }
For a multiprocessor kernel, spin locks as implemented above
suffice for ensuring mutual exclusion among different CPUs. However, this
still does not protect a CPU from its own interrupt handler. For
example, if an interrupt can occur within a critical section, and the
interrupt handler could access the shared data (which was also being
accessed within the critical section), mutual exclusion would get violated.
On xv6, this is true for the process table ptable
, for example,
where the timer handler may need to inspect the ptable
and
may find it in an inconsistent state, if the timer interrupt occurred
in the middle of an access to the ptable
. Worse, if the
timer handler also tries to acquire the ptable.lock
, it
would result in a deadlock.
To deal with this, xv6's spinlock also
disables the interrupts on a call to acquire
, and reenables
them on a call to release
. To allow a thread to acquire multiple
locks simultaneously, the interrupts are enabled and disabled in a
recursive manner, as follows:
void acquire(struct lock *l) { Reg = 1; pushcli(); while (xchg(Reg, &l->locked) == 1) { while (l->locked == 1); } } void release(struct lock *l) { fence; //assembly instruction l->locked = 0; popcli(); }The
pushcli()
function uses the cli
instruction
internally to disable interrupts. The popcli()
function
uses the sti
instruction internally to re-enable interrupts.
They also maintain a per-CPU counter cpu->ncli
to correctly
handle nested calls to pushcli()
and popcli()
.
void pushcli(void) { if (cpu->ncli == 0) { cli; } cpu->ncli++; } void popcli(void) { cpu->ncli--; if (cpu->ncli == 0) { sti; } }
xchg
instruction
is an unprivileged instruction, and has identical semantics in both user
and kernel modes.
ptable
array is the process table structure. A kernel thread that needs to wait on a lock
acquire
, will set its status to "BLOCKED on lock l", and call the
scheduler. Similarly, when a thread calls release(l)
, it scans the
ptable
and changes the status of one of the blocked processes (if any),
blocked on l
from BLOCKED to READY.
Notice that the accesses to the ptable
also need to be mutually
exclusive. This is done using a spinlock (ptable.lock
in the case of
xv6). Hence, a blocking lock typically uses a spinlock internally. Of course, the
spinlock is released, as soon as the ptable is updated.
If the threads are user-level threads however, all this can be done entirely
in userspace, without any kernel involvement. For user-level threads, the
ptable
itself will be maintained (and manipulated) at the user-level.
One proposal to solve this issue is to allow the same thread to acquire a lock multiple times, but still disallow different threads to acquire the lock at the same time. Here is an example implementation of recursive locks, which uses regular locks internally.
void recursive_acquire(struct rlock *rl) { if (rl->owner == cur_thread) { rl->count++; } else { acquire(&rl->lock); rl->count = 0; rl->owner = cur_thread; } } void recursive_release(struct rlock *rl) { rl->count--; if (rl->count == 0) { release(&rl->lock); rl->owner = -1; } }Usually, recursive locks are a bad idea. Most programmers maintain the invariant that a critical section always begins in a consistent state, i.e., immediately after successfully returning from a lock acquire, the state of the system is consistent. Similarly, they maintain the invariant that the state of the system is left in a consistent state on a lock release, i.e., the state is consistent just before a call to
release()
. (By consistency,
we mean that certain invariants hold about the program state. For example, in our
bank account program, consistency may mean that the total money in the bank
is constant).
Recursive locks encourage the programmer to allow a critical section to begin
in an inconsistent state. Consider a function foo()
which acquires
a lock l
, and then calls another function bar()
. The
call to the function bar()
may be synchronous (i.e., the programmer
has a call chain in her program that eventually leads to
bar
from foo
), or asynchronous (e.g., if bar()
is called within an interrupt handler, and the interrupt is received within
foo()
). In either case, bar()
may be assuming that
the state is consistent when it begins its critical section. However, because the
function was called in the middle of the critical section of foo()
,
bar
may unexpectedly see inconsistent state.
foo() { acquire(); .... bar(); //at this point, state is inconsistent .... release(); } bar() { acquire(); .... // assumes that if acquire succeeds, the state here is consistent. .... release(); }Because recursive locks encourage such bugs, they are not considered very useful.
acquire()
function
returns a SUCCESS or a FAILURE value, depending on whether the acquisition was
successful or not. Here is a sample implementation of a trylock.
bool try_acquire(struct trylock *tl) { Reg = 1; xchg(Reg, &tl->locked); if (Reg == 1) { return false; //FAILURE } else { return true; //SUCCESS } } void try_release(struct trylock *tl) { fence; tl->locked = 0; }The caller of
try_acquire
may decide its action depending on the
return value. A regular acquire can be implemented trivially by calling
try_acquire
in a loop till it succeeds. However, try_acquire
gives the flexibility to the caller to do something else, if the lock is not
currently available.