RCU differs from read-write locks, in that, it allows the read-path to execute without any synchronization, i.e., at full speed, and yet ensure correct behaviour in presence of occasional concurrent updates.
As we saw in our last lecture, this can be achieved by ensuring that updates are done in an atomic fashion, e.g., if the updates can be written as one value swap, then atomic CAS instructions can be used.
However, even if updates to a
global data structure can be written through a single CAS
instruction, there may be other operations, such as free() (memory
reuse) that may be
required to complete the operation. In the searchable stack example,
free()
can only be called once we are sure that no other
thread could be holding a reference to the location being freed.
One way to ensure this is through reference counting, wherein each time a thread holds a pointer, it increments its reference count. However maintaining reference counts involve write operations, and suffer from the same problems as those suffered by read-write locks.
The other approach is to use information about the software (of which this data structure is a part). For example, if this data structure were implemented as part of the Linux kernel, and if we know that the Linux kernel is non-preemptible (i.e., a thread cannot be preempted while it is executing in kernel space), then we could devise an algorithm.
First, let's understand what we mean by the Linux kernel being non-preemptible. This means that a thread cannot be context switched-out in the middle of kernel execution. Even if a timer interrupt occurs in the middle of thread execution, the timer interrupt is recorded but the execution of the current thread is resumed (till it reaches a safe point), before context switching to another thread.
Assuming that the searchable stack structure is a part
of the Linux kernel and that a thread cannot be context switched out in the
middle of the search()
procedure, we can be sure of the following
things after a pop()
update:
oldtop
search()
procedure. If they ever call search()
(e.g., in future), they will start at newtop
.
pop()
using CAS, we just need to wait for
all other executing CPUs to execute a context switch. After a context switch,
that CPU cannot be holding a reference to oldtop
(as it must have
exited search()
).
top
. One way to accomplish this is to schedule
a small dummy thread on each CPU. If all the dummy threads execute, a context
switch has definitely taken place.
In a way, RCU trades space for time. It allows the common case (read accesses) to execute faster, by allowing unused locations to not get freed for an extended period of time (i.e., for time greater than what it strictly needs to be). This is a useful approach only if the memory is plentiful, which is true for modern hardware environments.
Using read-write locks would penalize routing table lookups. However, RCU allows lookups to execute unsynchronized (full speed) and yet maintains correctness with respect to occasional updates.
This can be accomplished using read-write locks, but that would mean that all module accesses (e.g., function calls) are preceded by a read-acquire. A much faster method would be to use RCU, where function calls can execute without synchronization, but updates (loading/unloading) will need to be done carefully (CAS, wait for other CPUs to context switch before freeing).
Figure showing monolithic kernel and Apps. The monolithic kernel has subsystems like filesystem, virtual memory, scheduler, device drivers, etc.
The primary weaknesses of the monolithic organization are:
There are alternate OS organizations that have been proposed to alleviate these limitations of the monolithic kernel, two of which are called the microkernel and the exokernel.
Figure showing microkernel (IPC+protection), servers (FS, VM, Drivers, Scheduler, TCP/IP, etc.), and apps.
The kernel becomes much thinner, and isolated from all the other services. A bug or a security hole in one service does not affect the other services or the kernel. Also, different implementations of subsystems can be used for different applications: e.g., different FS implementations can co-exist (as executables and processes) and different apps can use different FS implementations (perhaps tuned for themselves), assuming that the FS implementations are operating on different parts of the disk. While monolithic kernels also support multiple filesystems, the flexibility of a microkernel is far greater. For example, a database-like application that cares about persistence and has structured records, can choose its own filesystem where a file looks very similar to a database table and has stronger persistence guarantees.
The downside of the microkernel architecture is performance: IPC, no matter how fast, is decidedly slower than direct interaction with the kernel. For example, monolithic kernels allow communication between application and the kernel service through pointer exchanges. Such communication will now require marshalling and unmarshalling of data structures across address space boundaries.
While microkernels are not very popular in modern desktop and server environments, microkernels (e.g., L4) are often used in embedded environments that require strong isolation, reliability, and security guarantees.
Figure showing App with FS, VM, scheduler subsystems within the App, and the kernel providing low-level syscall API.
Exokernel example for virtual memory:
Here are the set of kernel abstractions:
pa = AllocPage()
DeallocPage(pa)
CreateMapping(va, pa)
PageFault(va)
PleaseReleaseAPage()
AllocPage()
and DeallocPage()
to control its physical memory footprint. The Createmapping()
syscall
is used for creating a mapping from a process-private virtual address to
a physical address. The process must be authorized to
map the physical address for the CreateMapping() call to succeed.
Similarly, there are upcalls which allows the kernel to invoke
process-registered functions (similar to signal handlers on UNIX) for
a page fault and for requesting the process to release a page. On a
page fault, the process's page fault handler gets called (unlike UNIX
where the page fault handler was inside the kernel), and decides what
to do: e.g., choose a physical page for replacement, create a new mapping,
etc. This allows a process to decide its own replacement policy and other
mechanisms, thus providing far greater flexibility. Also, the
PleaseReleaseAPage()
upcall can be used by the kernel to
request the process to release a page (using DeallocPage()). The kernel
may expect the process to respond to an upcall within a bounded time, after
which it may take coercive action (e.g., forcibly take a page away, kill
the process, etc.).