CSL862 : Virtualization and Cloud Computing : HW2 : Binary Translation and Shadow Page Tables

In this homework, you will study Monee, an experimental under-development VMM for 32-bit x86 based on binary translation.

Checkout the source code from the SVN repository https://svn.iitd.ernet.in/~sbansal/monitor. You can find instructions on using Subversion here (see Subversion heading). The source code is not publicly available and only available internally to CSL862 students; please do not redistribute the source code. Build and run the VMM over Qemu using the instructions given in the README file. Read the following documentation and answer the questions.

Monee is a research prototype VMM which can run only one guest OS. Monee is a bare-metal hypervisor designed to be installable on an existing system. It works by overwriting the boot-sector of an existing disk. Monee can run an unmodified 32-bit x86 guest and performs binary translation to efficiently virtualize the guest. The purpose of this research prototype is to investigate ideas like dynamic compiler optimizations, dynamic OS optimizations, security, reliability, and other applications of client-side virtualization.

The research prototype can successfully boot an unmodified pintos guest. To test some sample pintos guests, type make test. This should run a few pintos tests natively and on the VMM. It also reports some performance characteristics. Because our prototype is under-development, you will find that its performance might be lower than expected.

Binary Translator

The file peep/peep.tab has all the binary translation rules for 32-bit x86. Each rule begins with entry: keyword. The assembly code between entry: and -- marker represents the guest code, and the assembly code between the -- marker and the == marker represents the equivalent translated code.

Example Translation Rule

Here is an example of a translation rule:
entry:
  cli
  --
  movb $0, %gs:(vcpu + VCPU_IF_OFF)
  ==
specifies that the cli instruction should be translated to the corresponding movb instruction. The vcpu struct maintains the software state of the virtual CPU. The %gs prefix is used to access any VMM memory to distinguish these accesses from other guest memory accesses (similar to how it is done in [1]). The constant VCPU_IF_OFF represents the offset of the emulated IF flag in the vcpu struct. We call these translation rules, peephole translation rules.

Another Example

A more complex example of a peephole translation rule is:
entry:
  call *%vr0d
  --
  %tr0d: eax
  %vr0d: no_eax
  --
  pushl $fallthrough_addr
  JUMP_INDIRECT_USE_EAX_TEMP(%vr0d, 0)
  ==
This rule specifies that any indirect function call using the value of register vr0d (vr0d is a placeholder for the actual register that will be substituted at translation time.) as the target should be translated as given. The register placeholders %vr0d and %tr0d are substituted with real register names at translation time. For example, in this case, %vr0d will be substituted with the register name that occurred in the source instruction. %tr0d stands for a placeholder for a temporary register that is used by the translated code. The translator picks the substitution for %tr0d itself. For example, it could pick a dead register to substitute for %tr0d. If all registers are potentially live, then the translator picks one of the live registers arbitrarily and emits appropriate save and restore code for that register before and after the translated code.

The lines "%tr0d: eax" and "%vr0d: no_eax" specify constraints on the placeholders %tr0d and %vr0d respectively. The first line specifies that %tr0d must be substituted only with %eax before using this translation. This means no other register may be used as a temporary. This is necessary because the code in macro JUMP_INDIRECT_USE_EAX_TEMP clobbers register eax and so it must be saved and restored by the translator appropriately (if live). Similarly, the second constraint specifies that this rule can only be used if the source instruction used a register which is not eax (that's what the tag no_eax means).

The translation code in this example also makes use of a special variable called fallthrough_addr which stands for the address of the source instruction just after this call instruction. For example, in this case, fallthrough_addr needs to be pushed onto the stack to emulate the call instruction (done by the first instruction of the translated code). The next line is a GCC macro which gets expanded just like regular macros in C code. The expansion of this macro can be found in peep/peeptab_defs.h which is included from peep/peep.tab. In this case, this macro is used to jump to the indirect address contained in vr0d.

Naming of Placeholders

Register placeholder are named %vr0d, %vr1d, %tr1d, %vr0w, etc. We describe the meaning of these register placeholder names below:

Constant placeholders are named C0, C1, C2, etc. The constants placeholders are substituted using the values occurring in the source instructions. Apart from constant placeholders, constants 0, 1, etc. can also be used in the translation rules. In this case, the translation rule can be activated only if that particular constant appears in the source instruction code.

Segment Registers can also use placeholders. For example, consider the following translation rule:

entry:
  mov %vr0d, %vseg0
  --
  %tr0d: no_eax_esp
  %vseg0: no_cs_gs
  --
  MOV_REG_TO_SEG_USE_NO_EAX_TEMP(%vr0, %vseg0, $vseg0, tr0)
  ==
The input instruction in the translation rule mov %vr0d, %vseg0 specifies that the second operand (%vseg0) can be replaced by one of the segment registers cs, ds, es, fs, gs, ss. The segment register placeholders are named %vseg0, %vseg1, and so on.

Memory accesses can be performed using many different addressing modes. Some addressing modes supported by 32-bit x86 are scale-index-base, base+offset, offset-only, etc. The full list of translation codes can be found in Intel Reference Manual 2A [2]. In most cases, the same translation rule is applicable irrespective of the addressing mode used. Hence, it is useful to have a placeholder for any memory access which can be replaced by the appropriate addressing mode used in the source instruction. We use the MEM32 identifier to specify a placeholder for any memory access using 32-bit addressing. Similarly, MEM16 is used to specify memory access using 16-bit addressing. For example, consider the following rule:

entry:
  lgdt %vseg0:MEM32
  --
  %tr0d: no_eax
  --
  leal MEM32, %tr0d
  CALLOUT2(callout_restore_tr0_and_lgdt, $tr0d, $vseg0)
  ==
This rule pattern matches any lgdt instruction which uses any segment register (specified by placeholder %vseg0) and any 32-bit memory addressing mode (specified by MEM32). (Recall that if no segment is specified in input code, it defaults to %ds). The same identifier MEM32 can be used in the translated code. Because only one identifier MEM32 is supported, one translation rule cannot contain more than one memory access.

Finally, register and segment placeholders can be used in two ways:

Register Constraint Tags

The following register constraint tags can be used to specify constraints on register placeholders: These tags are defined in peep/insntypes.h.

Multiple Peephole Rules Matching a Source Instruction Sequence

If multiple peephole rules match a source instruction (or sequence of instructions), then the one with minimum cost of the translated code is chosen. The choice does not depend on the order of occurrence of the peephole rules.

Source Code of Binary Translator

The function translate() in peep/peep.c performs the actual translation. Here is a description of the arguments to this function and their meaning:
The translate() function consults the peephole table represented as peep_tab_entries[] in peep/peep.c to translate the guest code. The table peep_tab_entries[] is generated using the peep.tab files. The code to parse the translation rules in peep.tab and generate the peephole table can be found in peep/peepgen.c. This file encodes the central logic of the binary translator. peepgen.c compiles to a user-level program called peepgen. peepgen takes the peephole translation rules encoded in peep.tab as input and generates the following files in monee-build/mon build directory: Callouts are implemented for complex functionality. The callouts can be found in peep/callouts.c. References to the callouts can be found in peep/peep.tab. The macros CALLOUT0, CALLOUT1, CALLOUT2, ... are used to make calls to functions with 0, 1, 2, ... arguments respectively. Register contents (e.g., %vr0d, %eax, etc.), register identifiers (e.g., $vr0d, $tr0d,etc.) or constants can be passed as arguments to the callout functions. Before entry to the callout function, all CPU state is saved to the vcpu struct. The code in the callout function can manipulate vcpu state. On return from a callout, the vcpu state is loaded back into hardware registers before executing the next instruction in the translation cache.

Shadow Page Tables

Monee uses shadow page tables to virtualize memory. Because we run only one guest at a time, we only need to protect the monitor (VMM) memory from guest accesses.

The monitor occupies physical pages from LOADER_MONITOR_BASE to LOADER_MONITOR_END. These constants have been defined in sys/loader.h. LOADER_MONITOR_BASE is at 4MB. The monitor memory consists of two parts: VMM and VMX. VMM is the part of the monitor that lives in the guest's virtual address space starting at LOADER_VMM_VIRT_BASE. VMX is the part of the monitor that is used for other data structures of the monitor which do not need to be mapped in the guest address space. Both VMM and VMX are sized at 4MB each. The VMX memory is not currently used.

Here is the physical address space layout for the monitor:

   PHYS_MAX    +-----------------------------------+
               |                                   |
               |                                   |
               |                                   |
               |            Guest                  |
               |                                   |
               |                                   |
               |                                   |
       12MB    +-----------------------------------+
               |                                   |
               |              VMX                  |
               |                                   |
        8MB    +-----------------------------------+
               |                                   |
               |              VMM                  |
               |                                   |
        4MB    +-----------------------------------+
               |                                   |
               |             Guest                 |
               |                                   |
          0    +-----------------------------------+
The monitor steals 8MB of memory from the guest's physical address space. It maintains a swap space of 8MB on disk to store the displaced guest pages. Each time the guest accesses it's physical memory located in the displaced region, it causes a page fault (#PF) and the monitor serves the page from the disk. Hence, before transferring control to the guest, the shadow page tables need to be appropriately configured so that the guest faults on an access to pages in the displaced region.

The code for swapping pages to and from disk for this displaced region can be found in mem/swap.c. Standard page replacement algorithms are used to ensure that only a small number of page faults occur by maintaining the hot displaced pages in the swap cache (backed by physical memory). The swap cache lives in the monitor address space. The hot pages are maintained in the cache and the shadow page table points to them.

The VMM needs to be mapped into the guest's virtual address space because the translated code of the guest is stored in VMM memory and can access the guest's data. To do this, we always keep the VMM mapped to the top 4MB of the guest's virtual address space. Here is a layout of the virtual address space at any time:

 0xffffffff    +-----------------------------------+
               |                                   |
               |              VMM                  |
               |                                   |
 0xffc00000    +-----------------------------------+
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |             Guest                 |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
          0    +-----------------------------------+
The VMM contains the translation cache, the code to perform the translation, and other VMM-specific data structures. The guest code in the guest's address space is translated and stored in the translation cache. Only the translated code is executed (the original guest code is never executed directly).

VMM memory needs to be protected from guest accesses. Because it is impossible to know the memory addresses accessed by the guest at translation time, the only way to ensure protection is through runtime checks. We use segmentation hardware to ensure that the guest can never read/write VMM memory. All guest segments are truncated at 0xffc00000 so that any access to a virtual address above this value causes a general-purpose exception #GP. The only exceptions to this are the %cs and %gs segments which are not truncated. The %cs segment is needed to execute translated code which lives above 0xffc00000 and hence cannot be truncated. Similarly, %gs is deliberately reserved to be able to access VMM data structures (e.g., vcpu struct) from the translated code. This works because most guests rarely use segmentation, and even if they do, they rarely use the %cs and %gs registers. If any guest instruction uses these registers, we binary translate that instruction appropriately so that the guest is still unable to access VMM memory (see uses of operand_accesses_cs_gs() in peep/peep.c). This mechanism is similar to that used by VMware's binary translator[1].

Each time the guest loads a page table (for it's process for example) through the mov %eax, %cr3 instruction, the monitor translates it into a call to callout_mov_to_cr3(). This function checks to see if the new page table is different from the current one. If so, it creates a shadow page table for the newly loaded page table through a call to the shadow_pagedir_sync() function.

The monitor maintains a page table called phys_map. This page table is used to access guest physical memory. The code to initialize phys_map in phys_map_init() function is helpful to understand the need for this table. The phys_map structure contains an identity mapping (using large 4MB pages) for all physical memory, except for pages between LOADER_MONITOR_BASE and LOADER_MONITOR_END. For pages in this range, regular 4KB pages are used, and an empty page table is initialized. If the pages in this range are accessed, they are appropriately swapped-in in 4KB chunks. The swap_get_page() function is used for swapping in a page corresponding to a physical address.

The shadow_pagedir_sync() function first switches to phys_map (because all accesses in this routine will use guest physical addresses). It then calls swap_load_shadow_page_dirs() to load shadow page table, and then switches to it using switch_to_shadow() function. The swap pages and pages containing the shadow page table are allocated from the POOL_SWAP memory pool. The shadow page tables are also cached in memory (to avoid reconstruction on every page table switch), just like other swap pages. This logic is implemented in mem/swap.c.

Memory Traces

It is possible for the guest to modify its page tables or modify its code regions (for self-modifying code). These modifications are dangerous because they can cause our shadow structures or translated code to become stale (invalid). For this reason, we need to trace any write to these memory locations which we have shadowed. This is implemented using mtraces in mem/swap.c.

Device Emulation

We avoid writing detailed device emulation code by passing through the hardware devices directly to the guest. Because we run only one guest, this is usually possible. All hardware devices are transparently made visible to the guest by simply letting the guest's I/O instructions (e.g., in, out) directly access hardware I/O ports. This considerably simplifies the design of our monitor and lets us concentrate on the binary translation and memory virtualization aspects.

There are three exceptions to our device model, namely keyboard, disk, and serial port. To be able to control our monitor, we emulate the keyboard. Most keystrokes are passed through directly to hardware. Certain combinations of keystrokes are intended to be caught by the monitor. This is not currently implemented.

Similarly, the monitor itself lives on the disk and hence the disk cannot be made fully visible to the guest. For example, the monitor occupies the boot sector of the disk. If the guest tries to read it's boot sector, it should see it's own boot sector contents and not the monitor's. Hence, we emulate an IDE disk (hw/ide.c) and perform appropriate translation for the guest's disk accesses before accessing the physical disk.

We use the second serial port (at ioport address 0x2f8 and irq 0x23) to log monitor's activities for debugging. We assume that the guest will not use this serial port.

Logging

The printf() function logs to the second serial port. The contents of this serial port can be obtained by redirecting the serial port output to a file. For example, the following command redirects the first serial port (possibly controlled by the guest) to file1 and second serial port (controlled by the monitor) to file2:
qemu -hda os.dsk -serial file:file1 -serial file:file2

Running

In this homework, you will first boot the unvirtualized and virtualized Pintos guests on four different platforms, namely, QEMU, Bochs and bare-metal hardware (on a machine called tapas). The Pintos guest will be virtualized using Monee. The machine tapas supports HP iLO (Integrated Lights Out), which allows access and control to tapas remotely in an OS-independent way. Some examples of the capabilities of the iLO interface are: We will call the iLO network interface tapas-ilo. We use tapas-ilo to remotely attach a virtual disk device to tapas and observe it's output using the virtual serial port. By imaging our desired OS image in the virtual disk device, we can boot tapas into our desired OS image remotely in a scriptable manner.

QEMU

You will see the performance statistics in the following format:
bubsort.                                 [11082 ticks (11082 k, 0 u), 27961998677 tsc]
cat tests/threads/bubsort.default.stats
bubsort.default.m                        [16870 ticks (16874 k, 0 u), 42601778662 tsc]
                                         [124 m, 17088 g, 432348832214 tsc]
                                         [tb: 0: 1125, 1125, 1125: 141/1204]
                                         [swap: pd-s: 0: 2, 2, 2]
                                         [swap: pd-u: 0: 2, 2, 2]
                                         [swap: pt-s: 0: 3, 3, 3]
                                         [swap: pt-u: 0: 0, 0, 0]
                                         [swap: pg: 0: 0, 0, 0]
                                         [page-faults: 263: 0 t, 0 m, 263 s, 0 p]
                                         [callouts: 51756 all, 16656 forced]
The first line (bubsort.) represents the performance of unvirtualized Pintos guest running the bubblesort algorithm. The total number of timer ticks are 11082, of which all occur while the CPU was in kernel mode (because the bubblesort program runs inside the kernel in this test). The number of CPU cycles (obtained using rdtsc) is 27961998677.

The next set of results (labeled bubsort.default.m) is for the bubblesort guest running virtualized in Monee. When virtualized, the bubblesort kernel takes 16870 ticks, all of which are again in kernel. The number of CPU cycles is also around 1.5x more than unvirtualized kernel.

The virtualized stats also give other information. You can ignore the second line titled 'm', 'g', and 'tsc'. The third line gives statistics on the translation cache. For this test, there were 0 replacements in the translation cache (the translation cache was big enough to hold all translation blocks). The next three numbers give the average, minimum, and maximum sizes respectively of the translation cache (in terms of the number of translation blocks) over the course of this execution.

The next five lines give statistics on the swap cache (to implement shadow page tables). The statistics are divided into different types of swap pages: pd-s stands for page-directory pages with supervisor privileges (kernel) and pd-u stands for page-directory pages with user privileges (user). Similarly, pt-s and pt-u stand for page-table (L2) pages with supervisor and user privileges respectively. The last swap cache statistics are on regular pages (pg). In all these five cases, the first number (0 in this example) gives the number of replacements that happened. The next three numbers represent the average, minimum, and maximum sizes of the swap cache respectively during the execution.

The next line on page-faults gives statistics on the number of page faults that occurred. The first number (263) is the total number of page faults. The second number labeled 't' is the number of true page faults. The third number labeled 'm' is the number of page faults due to memory traces. The fourth number labeled 's' is the number of hidden page faults due to accesses to removed entries in the shadow page table. The fifth and lasst number labeled 'p' is the number of hidden page faults due to accesses to removed entries in phys_map.

The final statistics are on the number of callouts executed during this execution. This example executed 51756 callouts in all. Forced callouts are needed to handle exceptions and interrupts, and you can ignore this number for now.

The code of the bubblesort program and other benchmarks used in this homework can be found in test/pintos/tests/threads/compute.c.

Notice that the overhead of virtualization for bubsort is 1.5x. This sounds too high for a compute-intensive application. The reason is that these results are obtained on QEMU which is itself an emulator. On QEMU, the overhead of extra executed instructions due to binary translation gets magnified. On real hardware, these extra executed instructions are relatively cheaper and hence you will find the overhead to be less than 10%.

Bochs

Install Bochs, and repeat the steps followed for testing with QEMU, except you need to set the sim variable in the configure script to bochs instead of qemu.

Tapas

Tapas is a physical machine which can run our OS/VMM images. Running our images on Tapas gives us true performance

Exercises

References

[1] A Comparison of Software and Hardware Techniques for x86 Virtualization, K. Adams, O. Agesen. ASPLOS 2006

[2] x86 Architecture:

Frequently Asked Questions (FAQ)