CSL862 : Virtualization and Cloud Computing : HW2 : Binary Translation and Shadow Page Tables

In this homework, you will study Monee, an experimental under-development VMM for 32-bit x86 based on binary translation.

Checkout the source code from the SVN repository https://svn.iitd.ernet.in/~sbansal/monitor. You can find instructions on using Subversion here (see Subversion heading). The source code is not publicly available and only available internally to CSL862 students; please do not redistribute the source code. Build and run the VMM over Qemu using the instructions given in the README file. Read the following documentation and answer the questions.

Monee is a research prototype VMM which can run only one guest OS. Monee is a bare-metal hypervisor designed to be installable on an existing system. It works by overwriting the boot-sector of an existing disk. Monee can run an unmodified 32-bit x86 guest and performs binary translation to efficiently virtualize the guest. The purpose of this research prototype is to investigate ideas like dynamic compiler optimizations, dynamic OS optimizations, security, reliability, and other applications of client-side virtualization.

The research prototype can successfully boot an unmodified pintos guest. To test some sample pintos guests, type make test. This should run a few pintos tests natively and on the VMM. It also reports some performance characteristics. Because our prototype is under-development, you will find that its performance might be lower than expected.

Binary Translator

The file peep/peep.tab has all the binary translation rules for 32-bit x86. Each rule begins with entry: keyword. The assembly code between entry: and -- marker represents the guest code, and the assembly code between the -- marker and the == marker represents the equivalent translated code.

Example Translation Rule

Here is an example of a translation rule:

entry:
  cli
  --
  movb $0, %gs:(vcpu + VCPU_IF_OFF)
  ==

specifies that the cli instruction should be translated to the corresponding movb instruction. The vcpu struct maintains the software state of the virtual CPU. The %gs prefix is used to access any VMM memory to distinguish these accesses from other guest memory accesses (similar to how it is done in [1]). The constant VCPU_IF_OFF represents the offset of the emulated IF flag in the vcpu struct. We call these translation rules, peephole translation rules.

Another Example

A more complex example of a peephole translation rule is:

entry:
  call *%vr0d
  --
  %tr0d: eax
  %vr0d: no_eax
  --
  pushl $fallthrough_addr
  JUMP_INDIRECT_USE_EAX_TEMP(%vr0d, 0)
  ==

This rule specifies that any indirect function call using the value of register vr0d (vr0d is a placeholder for the actual register that will be substituted at translation time.) as the target should be translated as given. The register placeholders %vr0d and %tr0d are substituted with real register names at translation time. For example, in this case, %vr0d will be substituted with the register name that occurred in the source instruction. %tr0d stands for a placeholder for a temporary register that is used by the translated code. The translator picks the substitution for %tr0d itself. For example, it could pick a dead register to substitute for %tr0d. If all registers are potentially live, then the translator picks one of the live registers arbitrarily and emits appropriate save and restore code for that register before and after the translated code.

The lines "%tr0d: eax" and "%vr0d: no_eax" specify constraints on the placeholders %tr0d and %vr0d respectively. The first line specifies that %tr0d must be substituted only with %eax before using this translation. This means no other register may be used as a temporary. This is necessary because the code in macro JUMP_INDIRECT_USE_EAX_TEMP clobbers register eax and so it must be saved and restored by the translator appropriately (if live). Similarly, the second constraint specifies that this rule can only be used if the source instruction used a register which is not eax (that's what the tag no_eax means).

The translation code in this example also makes use of a special variable called fallthrough_addr which stands for the address of the source instruction just after this call instruction. For example, in this case, fallthrough_addr needs to be pushed onto the stack to emulate the call instruction (done by the first instruction of the translated code). The next line is a GCC macro which gets expanded just like regular macros in C code. The expansion of this macro can be found in peep/peeptab_defs.h which is included from peep/peep.tab. In this case, this macro is used to jump to the indirect address contained in vr0d.

Naming of Placeholders

Register placeholder are named %vr0d, %vr1d, %tr1d, %vr0w, etc. We describe the meaning of these register placeholder names below:

The first character "v" or "t" specifies if this is an input register (register that occurred in the source instructions) or if this is a temporary register (register that did not occur in the source instructions but is needed by translated code to store temporary data). The value of the temporary register is decided at translation time depending on the constraints and the liveness of available registers.
The second character "r" specifies that this is a register
The third character is a number (0, 1, ...) to name different placeholders of the same type.
The last character can be either d (32-bit doubleword), w (16-bit word), or b (8-bit byte). For example %vr0w could be replaced with one of ax, bx, cx, dx, sp, bp, si, or di. On the other hand %vr0d will be replaced with one of eax, ebx, etc.

Constant placeholders are named C0, C1, C2, etc. The constants placeholders are substituted using the values occurring in the source instructions. Apart from constant placeholders, constants 0, 1, etc. can also be used in the translation rules. In this case, the translation rule can be activated only if that particular constant appears in the source instruction code.

Segment Registers can also use placeholders. For example, consider the following translation rule:

entry:
  mov %vr0d, %vseg0
  --
  %tr0d: no_eax_esp
  %vseg0: no_cs_gs
  --
  MOV_REG_TO_SEG_USE_NO_EAX_TEMP(%vr0, %vseg0, $vseg0, tr0)
  ==

The input instruction in the translation rule mov %vr0d, %vseg0 specifies that the second operand (%vseg0) can be replaced by one of the segment registers cs, ds, es, fs, gs, ss. The segment register placeholders are named %vseg0, %vseg1, and so on.

Memory accesses can be performed using many different addressing modes. Some addressing modes supported by 32-bit x86 are scale-index-base, base+offset, offset-only, etc. The full list of translation codes can be found in Intel Reference Manual 2A [2]. In most cases, the same translation rule is applicable irrespective of the addressing mode used. Hence, it is useful to have a placeholder for any memory access which can be replaced by the appropriate addressing mode used in the source instruction. We use the MEM32 identifier to specify a placeholder for any memory access using 32-bit addressing. Similarly, MEM16 is used to specify memory access using 16-bit addressing. For example, consider the following rule:

entry:
  lgdt %vseg0:MEM32
  --
  %tr0d: no_eax
  --
  leal MEM32, %tr0d
  CALLOUT2(callout_restore_tr0_and_lgdt, $tr0d, $vseg0)
  ==

This rule pattern matches any lgdt instruction which uses any segment register (specified by placeholder %vseg0) and any 32-bit memory addressing mode (specified by MEM32). (Recall that if no segment is specified in input code, it defaults to %ds). The same identifier MEM32 can be used in the translated code. Because only one identifier MEM32 is supported, one translation rule cannot contain more than one memory access.

Finally, register and segment placeholders can be used in two ways:

With the "%" sign (%vr0d, %tr0d, %vseg0, etc.): In this case, the placeholder is replaced with register %eax, %ebx, %cs, etc., depending on the substitution value.
With the "$" sign ($vr0d, $tr0d, $vseg0, etc.): In this case, the placeholder is replaced by a numeric value representing the name of the substituted variable. For example, $vr0d will be replaced with 0, 1, or 2, for substituttions eax, ecx, and ebx respectively. Similarly, $vseg0 will be replaced with 0, 1, or 2 for es, cs, or ss respectively. The numeric representation of register names can be found in sys/vcpu_consts.h. The numeric values of register names are useful for callout arguments, if the callout function is interested in reading and writing to that particular register (see lgdt example above).

Register Constraint Tags

The following register constraint tags can be used to specify constraints on register placeholders:

eax: The placeholder must be substituted with eax. This is useful if the translated code is known to clobber eax. By using such a temporary variable, the programmer ensures that eax is properly saved and restored before and after the translated code respectively.
no_eax: The placeholder must not be substituted with eax. This is again useful if the translated code clobbers eax. In this case, we do not want other registers (vr or tr) to be substituted with eax as that could result in incorrect translation.
abcd: The placeholder can be substituted with only one of eax, ebx, ecx, and edx. This tag disallows substitutions of esp, ebp, esi, and edi for this register placeholder. This tag is useful for instructions that only operate on the first four registers. This tag is also useful while using 8-bit registers al, bl, cl, and dl that only exist for the first four registers.
no_esp: The placeholder must not be substituted with esp. This is useful if the translated code is known to clobber esp.
no_eax_esp: The placeholder must not be substituted with either eax or esp. This is useful if the translated code is known to clobber both eax and esp.
cs_gs: This tag is used for placeholders of segment registers. This identifier specifies that this register should only be substituted with either cs or gs. This is useful because cs and gs are both emulated using memory locations and not using hardware registers. Here is an example peephole translation rule using cs_gs tag:
```
entry:
  jmp *%vseg0:MEM32
  --
  %tr0d: eax
  %tr1d: no_eax
  %vseg0: cs_gs
  --
  MOV_SEG_TO_GS_USE_EAX_TEMP0_NO_EAX_TEMP1(vseg0, tr0, tr1)
  movl %gs:MEM32, %tr1d
  RESTORE_GS
  JUMP_INDIRECT_AFTER_TR1_RESTORE_USE_EAX_TEMP(%tr1d, tr0)
  ==
  
```
This rule specifies that if the guest tries to jump to a far address specified using one of cs or gs, first load the emulated segment register (cs or gs) from memory to the hardware register %gs, then read the memory using %gs, before restoring %gs and jumping to the destination.
no_cs_gs: The segment register placeholder must not be substituted with one of cs or gs. For the example given above, the following corresponding rule exists if the indirect jmp instruction uses a register other than cs or gs:
```
entry:
  jmp *%vseg0:MEM32
  --
  %tr0d: eax
  %tr1d: no_eax
  %vseg0: cs_gs
  --
  movl %vseg0:MEM32, %tr1d
  JUMP_INDIRECT_AFTER_TR1_RESTORE_USE_EAX_TEMP(%tr1d, tr0)
  
```
Notice that the translation in this example is identical to the previous one except that the original segment register is used (instead of gs as in the previous example) as the guest's segment register should be loaded in hardware in this case.

These tags are defined in peep/insntypes.h.

Multiple Peephole Rules Matching a Source Instruction Sequence

If multiple peephole rules match a source instruction (or sequence of instructions), then the one with minimum cost of the translated code is chosen. The choice does not depend on the order of occurrence of the peephole rules.

Source Code of Binary Translator

The function translate() in peep/peep.c performs the actual translation. Here is a description of the arguments to this function and their meaning:

code: The first argument code is a pointer to the guest's code that needs to be translated.
eip_virt: The second argument eip_virt is the virtual address at which this code lives inside the guest. Note that it is possible for multiple virtual addresses to point to the same code (e.g., aliasing within the same page table, or page sharing across multiple page tables).
tpage: The third argument tpage is a pointer to the a region of VMM memory where the translated code must be stored.
tpage_size: The fourth argument tpage_size is the maximum size of the region pointed-to by tpage.
rollbacks: This is a pointer to the rollback code. You can ignore this for now.
tb_len: The sixth argument tb_len is an output argument. The length of the translation block which was translated is returned in this argument. This length will depend on the location of the next control flow instruction starting from code. Recall that we perform the translation of guest code one translation block at a time.
jmp_offsets, edge_offsets: jmp_offsets are offsets in the translated code at which jump target addresses should be patched. Similarly edge_offsetsare offsets in the translated code at which edge-specific code starts (each translation block can have up to 2 outgoing edges).
eip_boundaries: The offsets in the guest code at which each instruction starts in the translation block.
tc_boundaries: The offsets in the translated code at which the translation of each instruction starts in the translation block.
peep_string: A string to identify which peephole rules were applied to translate this block of code. This is used for debugging/logging purposes only.
num_insns: This is an output argument which contains the number of instructions that were translated.
cpu_constraints: This encodes the cpu constraints that should be used to translate this code. For example, this specifies if the code should be translated assuming 16-bit mode or 32-bit mode. Similarly, separate translations can occur depending on whether this code is being translated to handle a trap or not.

The translate() function consults the peephole table represented as peep_tab_entries[] in peep/peep.c to translate the guest code. The table peep_tab_entries[] is generated using the peep.tab files. The code to parse the translation rules in peep.tab and generate the peephole table can be found in peep/peepgen.c. This file encodes the central logic of the binary translator. peepgen.c compiles to a user-level program called peepgen. peepgen takes the peephole translation rules encoded in peep.tab as input and generates the following files in monee-build/mon build directory:

peepgen_offsets.h: This file contains the offsets of vcpu fields.
peepgen_defs.h: This file contains definitions for the labels for each peephole translation rule. The label is chosen based on the line number at which the peephole rule appears, among other things. This label uniquely identifies a peephole translation rule. This file uses a macro called DEF() with each label. This file is included in two different places in peep/peep.c with different meanings of DEF(). For example, this file is included for defining peepgen_label_t by appropriately declaring DEF such that it concatenates PEEP_PREFIX to every label. Similarly, it is included while defining the array peep_label_str such that each label has a corresponding string.
peepgen_entries.h: This file contains the peephole table entries generated from peep.tab. The structure of a peephole entry (struct peep_entry_t) can be found in peep/peeptab.h. Here is a description of the fields inside a peephole table entry:
- n_tmpl: The number of guest instructions (templates) to match
- tmpl[]: The actual guest instructions that should be matched. The format of each instruction can be found in peep/insntypes.h.
- label: The label identifying this peephole translation rule
- n_temporaries: The number of temporaries used by this rule
- temporaries[]: The constraints on these temporaries as specified by tag_t. The temporaries are labeled tr0, tr1, tr2, .... Each entry in this array specifies the type of the corresponding temporary. For example, if temporaries[0] is tag_eax, this means that only the eax register can be used as a temporary for tr0. Apart from the tags specified in peep.tab (as explained earlier), the other default tags are tag_var (represents a register or constant placeholder with no constraints) and tag_const (represents a register or constant value, not placeholder).
- cpu_constraints: This specifies the CPU constraints (e.g., 16-bit or 32-bit) under which this peephole translation rule can be used.
- nomatch_pairs: Ignore this for now.
peepgen_gencode.h: The peephole rules use placeholders for registers and constants. For example, the identifier %vr0d in snippet mov %vr0d, %cr3 can be replaced with any of the eight x86 registers (eax, ecx, ...). Depending on which register it is, the output code needs to be appropriately patched. The output code for each peephole rule is stored in mon/out.o. The patching code is generated in mon/peepgen_gencode.h.

Callouts are implemented for complex functionality. The callouts can be found in peep/callouts.c. References to the callouts can be found in peep/peep.tab. The macros CALLOUT0, CALLOUT1, CALLOUT2, ... are used to make calls to functions with 0, 1, 2, ... arguments respectively. Register contents (e.g., %vr0d, %eax, etc.), register identifiers (e.g., $vr0d, $tr0d,etc.) or constants can be passed as arguments to the callout functions. Before entry to the callout function, all CPU state is saved to the vcpu struct. The code in the callout function can manipulate vcpu state. On return from a callout, the vcpu state is loaded back into hardware registers before executing the next instruction in the translation cache.

Shadow Page Tables

Monee uses shadow page tables to virtualize memory. Because we run only one guest at a time, we only need to protect the monitor (VMM) memory from guest accesses.

The monitor occupies physical pages from LOADER_MONITOR_BASE to LOADER_MONITOR_END. These constants have been defined in sys/loader.h. LOADER_MONITOR_BASE is at 4MB. The monitor memory consists of two parts: VMM and VMX. VMM is the part of the monitor that lives in the guest's virtual address space starting at LOADER_VMM_VIRT_BASE. VMX is the part of the monitor that is used for other data structures of the monitor which do not need to be mapped in the guest address space. Both VMM and VMX are sized at 4MB each. The VMX memory is not currently used.

Here is the physical address space layout for the monitor:

   PHYS_MAX    +-----------------------------------+
               |                                   |
               |                                   |
               |                                   |
               |            Guest                  |
               |                                   |
               |                                   |
               |                                   |
       12MB    +-----------------------------------+
               |                                   |
               |              VMX                  |
               |                                   |
        8MB    +-----------------------------------+
               |                                   |
               |              VMM                  |
               |                                   |
        4MB    +-----------------------------------+
               |                                   |
               |             Guest                 |
               |                                   |
          0    +-----------------------------------+

The monitor steals 8MB of memory from the guest's physical address space. It maintains a swap space of 8MB on disk to store the displaced guest pages. Each time the guest accesses it's physical memory located in the displaced region, it causes a page fault (#PF) and the monitor serves the page from the disk. Hence, before transferring control to the guest, the shadow page tables need to be appropriately configured so that the guest faults on an access to pages in the displaced region.

The code for swapping pages to and from disk for this displaced region can be found in mem/swap.c. Standard page replacement algorithms are used to ensure that only a small number of page faults occur by maintaining the hot displaced pages in the swap cache (backed by physical memory). The swap cache lives in the monitor address space. The hot pages are maintained in the cache and the shadow page table points to them.

The VMM needs to be mapped into the guest's virtual address space because the translated code of the guest is stored in VMM memory and can access the guest's data. To do this, we always keep the VMM mapped to the top 4MB of the guest's virtual address space. Here is a layout of the virtual address space at any time:

 0xffffffff    +-----------------------------------+
               |                                   |
               |              VMM                  |
               |                                   |
 0xffc00000    +-----------------------------------+
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |             Guest                 |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
               |                                   |
          0    +-----------------------------------+

The VMM contains the translation cache, the code to perform the translation, and other VMM-specific data structures. The guest code in the guest's address space is translated and stored in the translation cache. Only the translated code is executed (the original guest code is never executed directly).

VMM memory needs to be protected from guest accesses. Because it is impossible to know the memory addresses accessed by the guest at translation time, the only way to ensure protection is through runtime checks. We use segmentation hardware to ensure that the guest can never read/write VMM memory. All guest segments are truncated at 0xffc00000 so that any access to a virtual address above this value causes a general-purpose exception #GP. The only exceptions to this are the %cs and %gs segments which are not truncated. The %cs segment is needed to execute translated code which lives above 0xffc00000 and hence cannot be truncated. Similarly, %gs is deliberately reserved to be able to access VMM data structures (e.g., vcpu struct) from the translated code. This works because most guests rarely use segmentation, and even if they do, they rarely use the %cs and %gs registers. If any guest instruction uses these registers, we binary translate that instruction appropriately so that the guest is still unable to access VMM memory (see uses of operand_accesses_cs_gs() in peep/peep.c). This mechanism is similar to that used by VMware's binary translator[1].

Each time the guest loads a page table (for it's process for example) through the mov %eax, %cr3 instruction, the monitor translates it into a call to callout_mov_to_cr3(). This function checks to see if the new page table is different from the current one. If so, it creates a shadow page table for the newly loaded page table through a call to the shadow_pagedir_sync() function.

The monitor maintains a page table called phys_map. This page table is used to access guest physical memory. The code to initialize phys_map in phys_map_init() function is helpful to understand the need for this table. The phys_map structure contains an identity mapping (using large 4MB pages) for all physical memory, except for pages between LOADER_MONITOR_BASE and LOADER_MONITOR_END. For pages in this range, regular 4KB pages are used, and an empty page table is initialized. If the pages in this range are accessed, they are appropriately swapped-in in 4KB chunks. The swap_get_page() function is used for swapping in a page corresponding to a physical address.

The shadow_pagedir_sync() function first switches to phys_map (because all accesses in this routine will use guest physical addresses). It then calls swap_load_shadow_page_dirs() to load shadow page table, and then switches to it using switch_to_shadow() function. The swap pages and pages containing the shadow page table are allocated from the POOL_SWAP memory pool. The shadow page tables are also cached in memory (to avoid reconstruction on every page table switch), just like other swap pages. This logic is implemented in mem/swap.c.

Memory Traces

It is possible for the guest to modify its page tables or modify its code regions (for self-modifying code). These modifications are dangerous because they can cause our shadow structures or translated code to become stale (invalid). For this reason, we need to trace any write to these memory locations which we have shadowed. This is implemented using mtraces in mem/swap.c.

Device Emulation

We avoid writing detailed device emulation code by passing through the hardware devices directly to the guest. Because we run only one guest, this is usually possible. All hardware devices are transparently made visible to the guest by simply letting the guest's I/O instructions (e.g., in, out) directly access hardware I/O ports. This considerably simplifies the design of our monitor and lets us concentrate on the binary translation and memory virtualization aspects.

There are three exceptions to our device model, namely keyboard, disk, and serial port. To be able to control our monitor, we emulate the keyboard. Most keystrokes are passed through directly to hardware. Certain combinations of keystrokes are intended to be caught by the monitor. This is not currently implemented.

Similarly, the monitor itself lives on the disk and hence the disk cannot be made fully visible to the guest. For example, the monitor occupies the boot sector of the disk. If the guest tries to read it's boot sector, it should see it's own boot sector contents and not the monitor's. Hence, we emulate an IDE disk (hw/ide.c) and perform appropriate translation for the guest's disk accesses before accessing the physical disk.

We use the second serial port (at ioport address 0x2f8 and irq 0x23) to log monitor's activities for debugging. We assume that the guest will not use this serial port.

Logging

The printf() function logs to the second serial port. The contents of this serial port can be obtained by redirecting the serial port output to a file. For example, the following command redirects the first serial port (possibly controlled by the guest) to file1 and second serial port (controlled by the monitor) to file2:

qemu -hda os.dsk -serial file:file1 -serial file:file2

Running

In this homework, you will first boot the unvirtualized and virtualized Pintos guests on four different platforms, namely, QEMU, Bochs and bare-metal hardware (on a machine called tapas). The Pintos guest will be virtualized using Monee. The machine tapas supports HP iLO (Integrated Lights Out), which allows access and control to tapas remotely in an OS-independent way. Some examples of the capabilities of the iLO interface are:

Remote power off and power on: A network interface is dedicated to iLO and remains always-on. Among other things, this network interface runs an SSH server and a webserver. The OS remains unaware of this interface.
Ability to attach remote devices to the machine. iLO supports attaching a Virtual Keyboard, Virtual USB drive, and Virtual Serial Port. By attaching these "virtual" devices using iLO, it is possible to remotely control the device (both I/O and storage) even when there is no OS installed on it.

We will call the iLO network interface tapas-ilo. We use tapas-ilo to remotely attach a virtual disk device to tapas and observe it's output using the virtual serial port. By imaging our desired OS image in the virtual disk device, we can boot tapas into our desired OS image remotely in a scriptable manner.

QEMU

Install QEMU and ensure that it is in your execution path.
In monee/ subdirectory, edit the fourth line of configure file to set sim=qemu.
cd monee && ./configure
make clean && make && make test

You will see the performance statistics in the following format:

bubsort.                                 [11082 ticks (11082 k, 0 u), 27961998677 tsc]
cat tests/threads/bubsort.default.stats
bubsort.default.m                        [16870 ticks (16874 k, 0 u), 42601778662 tsc]
                                         [124 m, 17088 g, 432348832214 tsc]
                                         [tb: 0: 1125, 1125, 1125: 141/1204]
                                         [swap: pd-s: 0: 2, 2, 2]
                                         [swap: pd-u: 0: 2, 2, 2]
                                         [swap: pt-s: 0: 3, 3, 3]
                                         [swap: pt-u: 0: 0, 0, 0]
                                         [swap: pg: 0: 0, 0, 0]
                                         [page-faults: 263: 0 t, 0 m, 263 s, 0 p]
                                         [callouts: 51756 all, 16656 forced]

The first line (bubsort.) represents the performance of unvirtualized Pintos guest running the bubblesort algorithm. The total number of timer ticks are 11082, of which all occur while the CPU was in kernel mode (because the bubblesort program runs inside the kernel in this test). The number of CPU cycles (obtained using rdtsc) is 27961998677.

The next set of results (labeled bubsort.default.m) is for the bubblesort guest running virtualized in Monee. When virtualized, the bubblesort kernel takes 16870 ticks, all of which are again in kernel. The number of CPU cycles is also around 1.5x more than unvirtualized kernel.

The virtualized stats also give other information. You can ignore the second line titled 'm', 'g', and 'tsc'. The third line gives statistics on the translation cache. For this test, there were 0 replacements in the translation cache (the translation cache was big enough to hold all translation blocks). The next three numbers give the average, minimum, and maximum sizes respectively of the translation cache (in terms of the number of translation blocks) over the course of this execution.

The next five lines give statistics on the swap cache (to implement shadow page tables). The statistics are divided into different types of swap pages: pd-s stands for page-directory pages with supervisor privileges (kernel) and pd-u stands for page-directory pages with user privileges (user). Similarly, pt-s and pt-u stand for page-table (L2) pages with supervisor and user privileges respectively. The last swap cache statistics are on regular pages (pg). In all these five cases, the first number (0 in this example) gives the number of replacements that happened. The next three numbers represent the average, minimum, and maximum sizes of the swap cache respectively during the execution.

The next line on page-faults gives statistics on the number of page faults that occurred. The first number (263) is the total number of page faults. The second number labeled 't' is the number of true page faults. The third number labeled 'm' is the number of page faults due to memory traces. The fourth number labeled 's' is the number of hidden page faults due to accesses to removed entries in the shadow page table. The fifth and lasst number labeled 'p' is the number of hidden page faults due to accesses to removed entries in phys_map.

The final statistics are on the number of callouts executed during this execution. This example executed 51756 callouts in all. Forced callouts are needed to handle exceptions and interrupts, and you can ignore this number for now.

The code of the bubblesort program and other benchmarks used in this homework can be found in test/pintos/tests/threads/compute.c.

Notice that the overhead of virtualization for bubsort is 1.5x. This sounds too high for a compute-intensive application. The reason is that these results are obtained on QEMU which is itself an emulator. On QEMU, the overhead of extra executed instructions due to binary translation gets magnified. On real hardware, these extra executed instructions are relatively cheaper and hence you will find the overhead to be less than 10%.

Bochs

Install Bochs, and repeat the steps followed for testing with QEMU, except you need to set the sim variable in the configure script to bochs instead of qemu.

Tapas

Tapas is a physical machine which can run our OS/VMM images. Running our images on Tapas gives us true performance

To boot tapas into your OS/VMM images, you require a webserver (systems.cse.iitd.ernet.in) and an ILO interface to the machine tapas-ilo (10.20.3.27). Request the TAs for an account on the webserver. Set your environment variables USER and PASSWORD to the username and password of your account on the webserver. The webserver is used to host your disk image which can be used to boot tapas.
Set the sim variable in configure script to tapas and repeat all the steps you performed for running QEMU. Refer to the README file in top-level directory for trouble-shooting.
We use Oracle Grid Engine to implement first-in-first-out queueing to arbitrate multiple simultaneous requests on tapas. Here are some tips on using/querying the queueing system: Log on to systems.cse.iitd.ernet.in and try following commands:
```
To see status of your jobs:
bash$ qstat -f -r

To see status of all users's jobs:
bash$ qstat -f -r -u \* | less

To submit a job manually (please do not submit more than one job at a time):
bash$ qsub -b y  
```

Exercises

Run the tests in test/pintos/tests/threads/Make.tests (namely bubsort, fibo-recursive, fibo-iter, hanoi1, hanoi2, hanoi3, printf, emptyloop). Report and briefly explain (one-line each) the performance of unvirtualized and virtualized executions on Tapas on these benchmarks. (There is no need to report performance numbers on QEMU and Bochs. You should just use these platforms for debugging, etc.).
Count the number of function calls made by the guest for bubsort by modifying the translation of the call instruction. Notice that in modifying the translation you need to be careful in ensuring that you do not overwrite the hardware flags. Hint: The lea instruction might be useful.
Write a benchmark in compute.c that triggers tb replacements. Study the effect of the number of tb replacements on the performance of the system.
Write a benchmark in compute.c that triggers swap cache replacements. Study the effect of the number of swap cache replacements on the performance of the system.
Estimate the performance slowdown caused by a callout using the following strategy:
- Add a peephole rule in peep.tab which performs a callout for any instruction (e.g., nop) that would otherwise have executed natively.
- Write a benchmark which executes this instruction in a loop.
- Compare the performance of the guest with and without the callout to estimate the performance cost of a callout.

References

[1] A Comparison of Software and Hardware Techniques for x86 Virtualization, K. Adams, O. Agesen. ASPLOS 2006

[2] x86 Architecture:

A Guide to Programming Intel IA32 PC Architecture by Kai Li: html
IA-32 Intel Architecture Software Developer's Manual:
Brennan's Guide to Inline Assembly: HTML - details how to embed assembly language in GCC source code.
The GNU Assembler: HTML
The Art of Assembly Language by Randall Hyde: online html, local pdf
PC Assembly Language by Paul A. Carter: online PDF

Frequently Asked Questions (FAQ)

binutils-2.19.tar.bz2 not found. Download here and copy to tars/ directory.