Let us first understand the typical hardware characteristics. The webserver is possibly running on a modern processor executing at 1-2 GHz frequency, i.e., each instruction executed locally on the CPU (e.g., register access, L1 cache hit, etc.) takes around 1-2 nanoseconds to complete. Instructions that miss the cache, and thus need to access the main memory take 10-100 nanoseconds. In contrast, a disk access takes roughly 1-10 milliseconds to complete, i.e., around one million instructions can execute in the time it takes to complete one disk access!
Why is a disk access so expensive? As we will also see later in our discussion on file systems, a magnetic disk is a mechanical device organized as a cylindrical spindle, with multiple circular platters stacked one above another. Each platter in the cylinder has concentric tracks, which hold the data as magnetic impressions. To read this data, a magnetic disk has a disk head associated with each platter that can move radially and position itself on a particular track. The cylindrical spindle rotates around its central axis so the head can read data off the track, as the spindle rotates.
Notice that reading data involves mechanical radial movement of the disk head arm, and also requires the head to be positioned at the right point within a track, which involves a rotational latency. The time taken for the radial movement of the head to reach the desired track, is called the seek time. Typically, average seek times range between 4 milliseconds (for enterprise drives) to 15ms (for mobile drives). Typical disks rotate at 5000 RPM (for mobile drives) to 15000 RPM (for enterprise drives), and so the average rotational latencies range between 2-5 milliseconds. Overall a disk access can take 5-20 milliseconds to complete.
<-----> moves radially | v------| --------- | | | platter | v------| --------- | | |disk head controller | v------| --------- | | | | v------| --------- | | | spindle rotates
Let's look at the pseudo-code of a webserver:
while (1) { read message from incoming network queue; URL = parse message; read URL file from disk; write reply on outgoing network queue; }Here we assume that there is an incoming network queue, where incoming packets are buffered, and read in FIFO order. If multiple clients are accessing the webserver, their requests will get buffered in the incoming queue in arrival order. This webserver picks and serves one request at a time. Because, each request needs to go to the disk device, and assuming each disk request takes 10 milliseconds, we can serve 100 requests per second. Modern webservers serve tens of thousands of requests per second. We can make this system faster by introducing a cache of disk files in memory.
while (1) { read message from incoming network queue; URL = parse message; if (requested file is in cache) { data = read URL file from disk; new.name = URL; //new is the current cache pointer. new.data = data; new = new + 1; //mark the current cache pointer as used. } write reply on outgoing network queue; }This server is much better, as we have optimized for the common case where there is locality of access, i.e., the same set of pages are accessed close in time, and most of these accesses can be served from memory.
However, still, if even one request results in a disk access (cache miss), then all the following requests in the network queue will have to wait for a long time (even though those requests could have been served much faster from memory). So, let's use multiple threads, each serving one request.
for (i = 0; i < 10; i++) { fork(server); } server() { while (1) { read message from incoming network queue; URL = parse message; if (requested file is in cache) { // A data = read URL file from disk; //new is the current cache pointer. new.name = URL; // B new.data = data; // C //mark the current cache pointer as used. new = new + 1; // D } write reply on outgoing network queue; } }Each thread executes the server function, but now we have concurrency problems. Firstly, assuming there is only one network queue, multiple server threads will be reading from the same input queue, and need correct synchronization for that. We will be discussing this later. Similarly, the output network queue is shared by multiple server threads. Let's however first focus our attention to the shared cache, accesses to which are made in statements A, B, C, and D.
Many race conditions exist in this code and many bad things can happen. We can have a cache with an empty entry, or a bad entry, or worse still, a page belonging to URL1, cached under URL2.
One way to solve this problem is to use a global cache lock, and instrument the entire critical section with it (statements A, B, C, and D). However, that brings us back to the same problem (which we wanted to solve using multiple threads), which is that even if one thread misses the cache and accesses the disk, all other concurrent threads will have to wait (on the lock acquire statement) for a long time. (Notice that we have just changed the waiting point from the input queue to the lock acquisition). So, a full solution will require fine-grained locking (e.g., lock per file/page, a lock for the disk, etc.), and will need careful lock acquisition for correctness.
Let's now look at the shared network queues. On the incoming queue, data (or packets) is produced by a "network thread" (or multiple network threads) and consumed by the "server threads". The network thread reads the incoming packets from the network wire and appends it to the queue's buffer (which is consumed in FIFO order by the server thread). Similarly, on the outgoing queue, data is produced by the server threads, and consumed by another set of outgoing network threads. This is a common situation where there is a shared queue and there are producer and consumer threads accessing this queue. This is also commonly known as the producer-consumer problem.
network() { packet = read from wire; queue.head = packet; queue.head++; } server() { packet = queue.tail; queue.tail++; read packet and process request; }
Queues are often implemented as circular buffers, with head
and tail
pointers, as follows:
char queue[MAX]; //global int head = 0, tail = 0; //global void produce(char data) { if ((head + 1) % MAX == tail) { error("producing to full queue!"); return; } queue[head] = data; head = (head + 1) % MAX; } char consume(void) { if (tail == head) { error("consuming from empty queue!"); return -1; } e = queue[tail]; tail = (tail + 1) % MAX; return e; }The code above implements
produce()
and consume()
functions for a queue implemented as a circular buffer. Additions to
full queues, and consumption from empty queues are not allowed, in this
case.
In a multi-threaded scenario, one of the threads may be consuming from the queue, while another thread may be producing to the queue. Also, if a consumer tries to consume from an empty queue, it may want to wait till the queue becomes non-empty, and then consume an element. Similarly, if a producer wants to produce to a full queue, it may want to wait till the queue becomes not-full, and then produce the element, in the following way:
char queue[MAX]; //global int head = 0, tail = 0; //global void produce(char data) { if ((head + 1) % MAX == tail) { wait for queue to have some empty space; } queue[head] = data; head = (head + 1) % MAX; } char consume(void) { if (tail == head) { wait for queue to have some elements; } e = queue[tail]; tail = (tail + 1) % MAX; return e; }How can a thread wait for a condition to become true? For example, how does a thread wait for the queue to have some empty space? This requires some coordination between the producer and the consumer threads. Such coordination is not possible using locks, as locks only implement mutual exclusion. This is done using a special construct, called condition variables.
Just like locks, the condition variables are also variables, usually shared
by multiple threads, and they provide functions called wait()
and notify
.
struct cv; //condition variable type void wait(struct cv *, struct lock *); void notify(struct cv *);(Let's ignore the second argument of the
wait
function, of
type lock
, for some time). The wait
function causes
the
calling thread to get suspended (or blocked). It is woken up, only if
another thread subsequently calls notify
on the same
condition variable.
In our producer-consumer example,
the producer thread may wait on a condition variable, if it finds that
the buffer is full. The consumer thread, after consuming
an element, may call notify
to wakeup any sleeping producer
thread (as the queue may now be not-full), in the following way:
char queue[MAX]; //global int head = 0, tail = 0; //global struct cv not_full, not_empty; void produce(char data) { if ((head + 1) % MAX == tail) { wait(¬_full, ...); } queue[head] = data; head = (head + 1) % MAX; notify(¬_empty); } char consume(void) { if (tail == head) { wait(¬_empty, ...); } e = queue[tail]; tail = (tail + 1) % MAX; notify(¬_full); return e; }Here, we declared two condition variables,
not_full
and
not_empty
. If the queue
is full, the producer thread waits on not_full
. The
consumer thread calls notify(not_full)
if it consumes
an element. The effect of notify
is to wakeup all
threads waiting on that condition variable. If no threads are currently
waiting on that condition variable, notify
does not have
any effect. Also, notice that we have left the second argument of the
wait
calls blank for now.
Firstly, this code is incorrect because the threads are concurrently
reading shared variables, queue
, head
, and
tail
, and so bad interleavings can cause incorrect results.
We need to use locks to fix this. Secondly, we do not need to call
notify
on every produce and consume operation. Instead,
notify only needs to be called if the queue transitioned from
full to not-full in the case of the
consumer (or empty to not-empty in the case of the producer). Let's try
fixing this code, as follows:
char queue[MAX]; //global int head = 0, tail = 0; //global struct cv not_full, not_empty; struct lock qlock; void produce(char data) { acquire(&qlock); if ((head + 1) % MAX == tail) { wait(¬_full, ...); } queue[head] = data; head = (head + 1) % MAX; if (head == (tail + 1) % MAX) { // the queue just became not-empty notify(¬_empty); } release(&qlock); } char consume(void) { acquire(&qlock); if (tail == head) { wait(¬_empty, ...); } e = queue[tail]; tail = (tail + 1) % MAX; if ((head + 2) % MAX == tail) { // the queue just became not-full notify(¬_full); } release(&qlock); return e; }Notice that now, if a producer starts waiting on the condition variable, the system will deadlock, as it has acquired the lock and so no other thread will be able to access the queue and so the producer will never get notified. A better solution may perhaps be to release the lock before calling wait, and reacquire it afterwards, as follows:
char queue[MAX]; //global int head = 0, tail = 0; //global struct cv not_full, not_empty; struct lock qlock; void produce(char data) { acquire(&qlock); if ((head + 1) % MAX == tail) { release(&qlock); wait(¬_full, ...); acquire(&qlock); } queue[head] = data; head = (head + 1) % MAX; notify(¬_full); release(&qlock); } char consume(void) { acquire(&qlock); if (tail == head) { release(&qlock); wait(¬_empty, ...); acquire(&qlock); } e = queue[tail]; tail = (tail + 1) % MAX; notify(¬_empty); release(&qlock); return e; }This code is also incorrect. Let's see what can happen: it is possible that the producer thread checks the queue, finds it full, and so releases the lock, and is about to call
wait()
. However, before the
producer calls wait()
, the consumer thread, can acquire the
lock, consume an element, notify not_empty
, and release the
lock. In this case, the notify call will be lost, as the producer has
not called wait yet. When the producer calls wait after this, it will
keep sleeping forever, as the consumer will not call notify again, resulting
in a deadlock. This is also called the lost-notify problem.
The lost-notify problem occurred because the release of the lock and
the act of going to sleep were not atomic. Hence, the wait()
function takes an additional argument, lock
. Internally,
wait()
releases the lock before going to sleep (in an
atomic manner), and
tries to reacquire the lock after being notified. The
wait function ensures that no other thread can call notify between the
release of
the lock and the act of going to sleep. The (almost) correct code looks like
the following:
char queue[MAX]; //global int head = 0, tail = 0; //global struct cv not_full, not_empty; struct lock qlock; void produce(char data) { acquire(&qlock); if ((head + 1) % MAX == tail) { wait(¬_full, &qlock); } queue[head] = data; head = (head + 1) % MAX; notify(¬_full); release(&qlock); } char consume(void) { acquire(&qlock); if (tail == head) { wait(¬_empty, &qlock); } e = queue[tail]; tail = (tail + 1) % MAX; notify(¬_empty); release(&qlock); return e; }Notice that, now we do not need the calls to
acquire()
and release()
around the calls to wait()
, as
they are called internally by wait
itself. In this code,
if a producer finds the buffer full, it goes to sleep, releasing the
lock atomically. A consumer which consumes an element can then call
notify to wakeup the producer. The woken-up producer is now
ready to run (though it may not immediately get the CPU). The first
thing that the producer does is to try to reacquire the lock (inside
wait). Because the lock may be already held by the consumer (notice that
the call to release(&qlock)
is after the call to
notify
), and so it may have to wait again. As soon as the
lock is released, the producer may get the lock and can continue to
consume the next element.
wait(cv, lock)
releases the lock
and goes to
sleep on cv
. notify(cv)
wakes up all
threads that are currently waiting on that cv. After waking up, the
wait()
function tries to reacquire the lock
before returning.