Case study: An example on non-linear effects of scheduling: "Receive livelocks in interrupt-driven kernel"

Interrupt-driven kernel : triggers high-priority interrupt on every receive event.
Interrupt handler processes the received message (network packet) and adds it to the desired queue for further processing (e.g., sending a reply). Thus the interrupt handler acts as a producer for this queue.
Another lower-priority thread consumes from the queue. However, the lower priority thread can get interrupted if the network device generates another interrupt for a received packet.
If the rate of received packets is high, the lower-priority consumer thread will never get to run, resulting in packets getting dropped at the queue, and a total of zero throughput.
Draw the curve between input packet rate and the throughput of the kernel.
Better to have polled the network device -- excess input rate would cause a drop at the network device queue (no wasted work!)
Result: Graceful saturation of throughput with increasing input network rate.
Tradeoff between polling and interrupts
- Interrupts provide lower latency, and less overhead at small input rates. But behave poorly at high input rates. e.g., telephone calls
- Polling provides higher throughput at high input rates. But results in very high latencies for low input rates. e.g., emails

Multi-processor coordination without locks

How to avoid using locks. Even with scalable locks, their use can be expensive. Let's start with a simple data structure and try to allow for concurrent access, so that we get a feel for the issues.

Searchable stack using locks

	struct element {
	    int key;
	    int value;
	    struct element *next;
	};

	struct element *top;

	void push(struct element *e) {
	    e->next = top;
	    top = e;
	}

	struct element *pop(void) {
	    struct element *e = top;
	    top = e->next;
	    return e;
	}

	int search(int key) {
	    struct element *e = top;
	    while (e) {
		if (e->key == key)
		    return e->value;
	        e = e->next;
	    }
	    return -1;
	}

This is clearly not going to work on a concurrent system

global spinlock: performance how many concurrent ops? just one CPU at a time bus interactions? bounce cache line for both reads, writes

global read-write lock: performance

how many concurrent ops? theoretically, could do one per CPU
bus interactions? bounce cache line for both reads, writes
draw timing diagram of CPU interactions
we might be slower on two CPUs than on a single CPU!

is it going to get better if we allocate a read-write lock for each item?

concurrency still possible
but penalty even worse: N_{elem} cache line bounces for each search traversal

more scalable locking schemes

example: brlock
N cpus (or threads) would have N locks, one per CPU
readers lock their CPU's lock, writers must lock each CPU's lock
must use up N 64-byte cache lines, even though each lock is just a bit.. (why? because anything at a finer granularity will still cause cache bouncing)

Avoiding locks

why do we want to avoid locks?

performance
complexity
deadlock
priority inversion

Recall compare-and-swap:

	int cmpxchg(int *addr, int old, int new) {
	    int was = *addr;
	    if (was == old)
		*addr = new;
	    return was;
	}

Usage:

	void push(struct element *e) {
    do {
	    e->next = top;
		} while (cmpxchg(&top, e->next, e) != e->next);
	}

	struct element *pop(void) {
    do {
	    struct element *e = top;
		} while (cmpxchg(&top, e, e->next) != e);
	  return e;
	}

What about search? Can be the same as in the non-concurrent case?

But before that: why is this better than not having locks?

readers no longer generate spurious updates, which incurred performance hit
not having to think about deadlock is great

problem 1: lock-free data structures require careful design, hw support

suppose we want to remove arbitrary elements, not just the first one
what could go wrong if we try to use the same cmpxchg?
need DCAS (double-compare-and-swap) to implement remove properly
- must make sure neither previous nor next element changed
x86 hardware doesn't have DCAS
- can do some tricks because x86 provides 8-byte cmpxchg
- only for sequential memory addresses
- can potentially play tricks by staggering the 8 bytes across two pages
- unlikely to win in performance once we start invalidating TLB entries

problem 2: memory reuse
when can we free a memory block, if other CPUs could be accessing it? other CPU's search might be traversing any of the elements

Solution: we could try bumping a refcount on each access, but that introduces cacheline bounces
- and what about CPUs that are just about to bump the refcount?

What if the memory is freed (and reused) while a search() thread is still holding a reference to it?

Worse, reusing a memory block can corrupt list

stack contains three elements
top -> A -> B -> C
CPU 1 about to pop off the top of the stack, preempted just before cmpxchg(&top, A, B)
CPU 2 pops off A, B, frees both of them. top -> C
CPU 2 allocates another item (malloc reuses A) and pushes onto stack top -> A -> C
CPU 1: cmpxchg succeeds, stack now looks like top -> B -> C
This is not right: it should have looked as if two pops, one push and one pop have occurred (i.e., three pops and one push to leave one element on the stack). The problem occurred because A was reused and cmpxchg only relies on the address and not on the logical element for disambiguation.

Memory reuse should be prevented till one is sure that no thread could be holding a reference to it (in its stack or registers), i.e., free() should be prevented till the time any thread could be holding a reference to something that was just popped off.

But how do we know about when it is safe to reuse memory? Would waiting for a long time (say one hour) before freeing work? Well almost -- at least the probability that another thread could be holding a reference to such a location becomes smaller with time, and so the probability of a failure is very low. How does one decide how long to wait before reusing memory? If too small, can result in incorrect behaviour. If too large, cannot reuse memory for a long time, which can be a problem is the available memory is small.