make test. This should run a few pintos tests natively and on the VMM. It also reports some performance characteristics. Because our prototype is under-development, you will find that its performance might be lower than expected.
peep/peep.tabhas all the binary translation rules for 32-bit x86. Each rule begins with
entry:keyword. The assembly code between
--marker represents the guest code, and the assembly code between the
--marker and the
==marker represents the equivalent translated code.
entry: cli -- movb $0, %gs:(vcpu + VCPU_IF_OFF) ==specifies that the
cliinstruction should be translated to the corresponding
vcpustruct maintains the software state of the virtual CPU. The
%gsprefix is used to access any VMM memory to distinguish these accesses from other guest memory accesses (similar to how it is done in ). The constant VCPU_IF_OFF represents the offset of the emulated
IFflag in the
vcpustruct. We call these translation rules, peephole translation rules.
entry: call *%vr0d -- %tr0d: eax %vr0d: no_eax -- pushl $fallthrough_addr JUMP_INDIRECT_USE_EAX_TEMP(%vr0d, 0) ==This rule specifies that any indirect function call using the value of register
vr0dis a placeholder for the actual register that will be substituted at translation time.) as the target should be translated as given. The register placeholders
%tr0dare substituted with real register names at translation time. For example, in this case,
%vr0dwill be substituted with the register name that occurred in the source instruction.
%tr0dstands for a placeholder for a temporary register that is used by the translated code. The translator picks the substitution for
%tr0ditself. For example, it could pick a dead register to substitute for
%tr0d. If all registers are potentially live, then the translator picks one of the live registers arbitrarily and emits appropriate save and restore code for that register before and after the translated code.
The lines "
%tr0d: eax" and "
%vr0d: no_eax" specify constraints
on the placeholders
%vr0d respectively. The first line
%tr0d must be substituted only with
using this translation. This means no other register may be used as a temporary. This
is necessary because the code in macro
eax and so it must be saved and restored by the translator
appropriately (if live). Similarly, the second constraint specifies that this rule
can only be used if the source instruction used a register which is not
eax (that's what the tag
translation code in this example also makes use of a special variable called
which stands for the address of the source instruction just after this call instruction.
For example, in this case,
fallthrough_addr needs to be pushed onto
the stack to emulate the
call instruction (done by the first instruction
of the translated code). The next line is a GCC macro which gets expanded
just like regular macros in C code. The expansion of this macro can be found
peep/peeptab_defs.h which is included from
In this case, this macro is used to jump to the indirect address contained in
%vr0w, etc. We describe the meaning of these register placeholder names below:
v" or "
t" specifies if this is an input register (register that occurred in the source instructions) or if this is a temporary register (register that did not occur in the source instructions but is needed by translated code to store temporary data). The value of the temporary register is decided at translation time depending on the constraints and the liveness of available registers.
r" specifies that this is a register
1, ...) to name different placeholders of the same type.
w(16-bit word), or
b(8-bit byte). For example
%vr0wcould be replaced with one of
di. On the other hand
%vr0dwill be replaced with one of
Constant placeholders are named
C2, etc. The constants
placeholders are substituted using the values occurring in the source instructions. Apart from constant
1, etc. can also be used in the translation
rules. In this case, the translation rule can be activated only if that particular constant appears
in the source instruction code.
Segment Registers can also use placeholders. For example, consider the following translation rule:
entry: mov %vr0d, %vseg0 -- %tr0d: no_eax_esp %vseg0: no_cs_gs -- MOV_REG_TO_SEG_USE_NO_EAX_TEMP(%vr0, %vseg0, $vseg0, tr0) ==The input instruction in the translation rule
mov %vr0d, %vseg0specifies that the second operand (
%vseg0) can be replaced by one of the segment registers
ss. The segment register placeholders are named
%vseg1, and so on.
Memory accesses can be performed using many different addressing modes. Some addressing modes
supported by 32-bit x86 are
etc. The full list of translation codes can be found in Intel Reference Manual 2A .
In most cases, the same translation rule is applicable irrespective of the addressing mode used.
Hence, it is useful to have a placeholder for any memory access which can be replaced by
the appropriate addressing mode used in the source instruction. We use the
identifier to specify a placeholder for any memory access using 32-bit addressing. Similarly,
MEM16 is used to specify memory access using 16-bit addressing. For example,
consider the following rule:
entry: lgdt %vseg0:MEM32 -- %tr0d: no_eax -- leal MEM32, %tr0d CALLOUT2(callout_restore_tr0_and_lgdt, $tr0d, $vseg0) ==This rule pattern matches any
lgdtinstruction which uses any segment register (specified by placeholder
%vseg0) and any 32-bit memory addressing mode (specified by
MEM32). (Recall that if no segment is specified in input code, it defaults to
%ds). The same identifier
MEM32can be used in the translated code. Because only one identifier
MEM32is supported, one translation rule cannot contain more than one memory access.
Finally, register and segment placeholders can be used in two ways:
%" sign (
%vseg0, etc.): In this case, the placeholder is replaced with register
%cs, etc., depending on the substitution value.
$" sign (
$vseg0, etc.): In this case, the placeholder is replaced by a numeric value representing the name of the substituted variable. For example,
$vr0dwill be replaced with
2, for substituttions
$vseg0will be replaced with
ssrespectively. The numeric representation of register names can be found in
sys/vcpu_consts.h. The numeric values of register names are useful for callout arguments, if the callout function is interested in reading and writing to that particular register (see
eax: The placeholder must be substituted with
eax. This is useful if the translated code is known to clobber
eax. By using such a temporary variable, the programmer ensures that
eaxis properly saved and restored before and after the translated code respectively.
no_eax: The placeholder must not be substituted with
eax. This is again useful if the translated code clobbers
eax. In this case, we do not want other registers (
tr) to be substituted with
eaxas that could result in incorrect translation.
abcd: The placeholder can be substituted with only one of
edx. This tag disallows substitutions of
edifor this register placeholder. This tag is useful for instructions that only operate on the first four registers. This tag is also useful while using 8-bit registers
dlthat only exist for the first four registers.
no_esp: The placeholder must not be substituted with
esp. This is useful if the translated code is known to clobber
no_eax_esp: The placeholder must not be substituted with either
esp. This is useful if the translated code is known to clobber both
cs_gs: This tag is used for placeholders of segment registers. This identifier specifies that this register should only be substituted with either
gs. This is useful because
gsare both emulated using memory locations and not using hardware registers. Here is an example peephole translation rule using
entry: jmp *%vseg0:MEM32 -- %tr0d: eax %tr1d: no_eax %vseg0: cs_gs -- MOV_SEG_TO_GS_USE_EAX_TEMP0_NO_EAX_TEMP1(vseg0, tr0, tr1) movl %gs:MEM32, %tr1d RESTORE_GS JUMP_INDIRECT_AFTER_TR1_RESTORE_USE_EAX_TEMP(%tr1d, tr0) ==This rule specifies that if the guest tries to jump to a far address specified using one of
gs, first load the emulated segment register (
gs) from memory to the hardware register
%gs, then read the memory using
%gs, before restoring
%gsand jumping to the destination.
no_cs_gs: The segment register placeholder must not be substituted with one of
gs. For the example given above, the following corresponding rule exists if the indirect
jmpinstruction uses a register other than
entry: jmp *%vseg0:MEM32 -- %tr0d: eax %tr1d: no_eax %vseg0: cs_gs -- movl %vseg0:MEM32, %tr1d JUMP_INDIRECT_AFTER_TR1_RESTORE_USE_EAX_TEMP(%tr1d, tr0)Notice that the translation in this example is identical to the previous one except that the original segment register is used (instead of
gsas in the previous example) as the guest's segment register should be loaded in hardware in this case.
peep/peep.cperforms the actual translation. Here is a description of the arguments to this function and their meaning:
code: The first argument
codeis a pointer to the guest's code that needs to be translated.
eip_virt: The second argument
eip_virtis the virtual address at which this code lives inside the guest. Note that it is possible for multiple virtual addresses to point to the same code (e.g., aliasing within the same page table, or page sharing across multiple page tables).
tpage: The third argument
tpageis a pointer to the a region of VMM memory where the translated code must be stored.
tpage_size: The fourth argument
tpage_sizeis the maximum size of the region pointed-to by
rollbacks: This is a pointer to the rollback code. You can ignore this for now.
tb_len: The sixth argument
tb_lenis an output argument. The length of the translation block which was translated is returned in this argument. This length will depend on the location of the next control flow instruction starting from
code. Recall that we perform the translation of guest code one translation block at a time.
jmp_offsetsare offsets in the translated code at which jump target addresses should be patched. Similarly
edge_offsetsare offsets in the translated code at which edge-specific code starts (each translation block can have up to 2 outgoing edges).
eip_boundaries: The offsets in the guest code at which each instruction starts in the translation block.
tc_boundaries: The offsets in the translated code at which the translation of each instruction starts in the translation block.
peep_string: A string to identify which peephole rules were applied to translate this block of code. This is used for debugging/logging purposes only.
num_insns: This is an output argument which contains the number of instructions that were translated.
cpu_constraints: This encodes the cpu constraints that should be used to translate this code. For example, this specifies if the code should be translated assuming 16-bit mode or 32-bit mode. Similarly, separate translations can occur depending on whether this code is being translated to handle a trap or not.
translate()function consults the peephole table represented as
peep/peep.cto translate the guest code. The table
peep_tab_entriesis generated using the
peep.tabfiles. The code to parse the translation rules in
peep.taband generate the peephole table can be found in
peep/peepgen.c. This file encodes the central logic of the binary translator.
peepgen.ccompiles to a user-level program called
peepgentakes the peephole translation rules encoded in
peep.tabas input and generates the following files in
peepgen_offsets.h: This file contains the offsets of
peepgen_defs.h: This file contains definitions for the labels for each peephole translation rule. The label is chosen based on the line number at which the peephole rule appears, among other things. This label uniquely identifies a peephole translation rule. This file uses a macro called
DEF()with each label. This file is included in two different places in
peep/peep.cwith different meanings of
DEF(). For example, this file is included for defining
peepgen_label_tby appropriately declaring
DEFsuch that it concatenates
PEEP_PREFIXto every label. Similarly, it is included while defining the array
peep_label_strsuch that each label has a corresponding string.
peepgen_entries.h: This file contains the peephole table entries generated from
peep.tab. The structure of a peephole entry (
struct peep_entry_t) can be found in
peep/peeptab.h. Here is a description of the fields inside a peephole table entry:
n_tmpl: The number of guest instructions (templates) to match
tmpl: The actual guest instructions that should be matched. The format of each instruction can be found in
label: The label identifying this peephole translation rule
n_temporaries: The number of temporaries used by this rule
temporaries: The constraints on these temporaries as specified by
tag_t. The temporaries are labeled
tr0, tr1, tr2, .... Each entry in this array specifies the type of the corresponding temporary. For example, if
tag_eax, this means that only the
eaxregister can be used as a temporary for
tr0. Apart from the tags specified in
peep.tab(as explained earlier), the other default tags are
tag_var(represents a register or constant placeholder with no constraints) and
tag_const(represents a register or constant value, not placeholder).
cpu_constraints: This specifies the CPU constraints (e.g., 16-bit or 32-bit) under which this peephole translation rule can be used.
nomatch_pairs: Ignore this for now.
peepgen_gencode.h: The peephole rules use placeholders for registers and constants. For example, the identifier
mov %vr0d, %cr3can be replaced with any of the eight x86 registers (
ecx, ...). Depending on which register it is, the output code needs to be appropriately patched. The output code for each peephole rule is stored in
mon/out.o. The patching code is generated in
peep/callouts.c. References to the callouts can be found in
peep/peep.tab. The macros
CALLOUT2, ... are used to make calls to functions with
2, ... arguments respectively. Register contents (e.g.,
%eax, etc.), register identifiers (e.g.,
$tr0d,etc.) or constants can be passed as arguments to the callout functions. Before entry to the callout function, all CPU state is saved to the
vcpustruct. The code in the callout function can manipulate
vcpustate. On return from a callout, the
vcpustate is loaded back into hardware registers before executing the next instruction in the translation cache.
The monitor occupies physical pages from
LOADER_MONITOR_END. These constants have been defined in
sys/loader.h. LOADER_MONITOR_BASE is at 4MB. The monitor
memory consists of two parts:
VMM is the part of the monitor that lives in the guest's
virtual address space starting at
VMX is the part of the monitor that is used for other data
structures of the monitor which do not need to be mapped in the guest
address space. Both
VMX are sized
at 4MB each. The
VMX memory is not currently used.
Here is the physical address space layout for the monitor:
PHYS_MAX +-----------------------------------+ | | | | | | | Guest | | | | | | | 12MB +-----------------------------------+ | | | VMX | | | 8MB +-----------------------------------+ | | | VMM | | | 4MB +-----------------------------------+ | | | Guest | | | 0 +-----------------------------------+The monitor steals 8MB of memory from the guest's physical address space. It maintains a swap space of 8MB on disk to store the displaced guest pages. Each time the guest accesses it's physical memory located in the displaced region, it causes a page fault (
#PF) and the monitor serves the page from the disk. Hence, before transferring control to the guest, the shadow page tables need to be appropriately configured so that the guest faults on an access to pages in the displaced region.
The code for swapping pages to and
from disk for this displaced region can be found
Standard page replacement algorithms
are used to ensure that only a small number of page faults occur
by maintaining the hot displaced pages in the
swap cache (backed by physical memory). The swap cache lives in
the monitor address space. The hot pages are maintained
in the cache and the shadow page table
points to them.
The VMM needs to be mapped into the guest's virtual address space because the translated code of the guest is stored in VMM memory and can access the guest's data. To do this, we always keep the VMM mapped to the top 4MB of the guest's virtual address space. Here is a layout of the virtual address space at any time:
0xffffffff +-----------------------------------+ | | | VMM | | | 0xffc00000 +-----------------------------------+ | | | | | | | | | | | Guest | | | | | | | | | | | | | | | 0 +-----------------------------------+The VMM contains the translation cache, the code to perform the translation, and other VMM-specific data structures. The guest code in the guest's address space is translated and stored in the translation cache. Only the translated code is executed (the original guest code is never executed directly).
VMM memory needs to be protected from guest accesses. Because it
is impossible to know the memory addresses accessed by the guest
at translation time, the only way to ensure protection is through
runtime checks. We use segmentation hardware to ensure that
the guest can never read/write VMM memory. All guest
segments are truncated at
0xffc00000 so that any
access to a virtual address above this value causes a general-purpose
#GP. The only exceptions to this are the
%gs segments which are not
%cs segment is needed to execute
translated code which lives above
0xffc00000 and hence
cannot be truncated. Similarly,
is deliberately reserved to be able to access VMM data structures
vcpu struct) from the translated code. This
works because most guests rarely use segmentation, and even if they
do, they rarely use the
registers. If any guest
instruction uses these registers, we binary translate
that instruction appropriately so that the guest is still unable
to access VMM memory (see uses of
peep/peep.c). This mechanism is similar to that
used by VMware's binary translator.
Each time the guest loads a page table (for it's process for
example) through the
mov %eax, %cr3 instruction, the
monitor translates it into a call
callout_mov_to_cr3(). This function checks to see
if the new page table is different from the current one. If so, it
creates a shadow page table for the newly loaded page table
through a call to the
The monitor maintains a page table called
phys_map. This page
table is used to access guest physical memory. The code to initialize
phys_map_init() function is helpful
to understand the need for this table. The
contains an identity mapping (using large 4MB pages) for all physical memory, except for
For pages in this range, regular 4KB pages are used, and an empty page table
is initialized. If the pages in this range are accessed, they are appropriately
swapped-in in 4KB chunks. The
swap_get_page() function is
used for swapping in a page corresponding to a physical address.
shadow_pagedir_sync() function first switches to
phys_map (because all accesses in this routine will use guest physical
addresses). It then calls
swap_load_shadow_page_dirs() to load
shadow page table, and then switches to it using
function. The swap pages and pages containing the shadow page table are allocated
POOL_SWAP memory pool. The shadow page tables are also
cached in memory (to avoid reconstruction on every page table switch), just like
other swap pages. This logic is implemented in
out) directly access hardware I/O ports. This considerably simplifies the design of our monitor and lets us concentrate on the binary translation and memory virtualization aspects.
There are three exceptions to our device model, namely keyboard, disk, and serial port. To be able to control our monitor, we emulate the keyboard. Most keystrokes are passed through directly to hardware. Certain combinations of keystrokes are intended to be caught by the monitor. This is not currently implemented.
Similarly, the monitor
itself lives on the disk and hence the disk cannot be made fully visible to
the guest. For example, the monitor occupies the boot sector of the disk. If
the guest tries to read it's boot sector, it should see it's own boot sector
contents and not the monitor's. Hence, we emulate an IDE disk (
and perform appropriate translation for the guest's disk accesses before accessing
the physical disk.
We use the second serial port (at ioport address
0x23) to log monitor's activities for debugging.
We assume that the guest will not use this serial port.
printf()function logs to the second serial port. The contents of this serial port can be obtained by redirecting the serial port output to a file. For example, the following command redirects the first serial port (possibly controlled by the guest) to
file1and second serial port (controlled by the monitor) to
qemu -hda os.dsk -serial file:file1 -serial file:file2
tapas). The Pintos guest will be virtualized using Monee. The machine
tapassupports HP iLO (Integrated Lights Out), which allows access and control to
tapasremotely in an OS-independent way. Some examples of the capabilities of the iLO interface are:
tapas-ilo. We use
tapas-iloto remotely attach a virtual disk device to
tapasand observe it's output using the virtual serial port. By imaging our desired OS image in the virtual disk device, we can boot tapas into our desired OS image remotely in a scriptable manner.
monee/subdirectory, edit the fourth line of
configurefile to set
cd monee && ./configure
make clean && make && make test
bubsort. [11082 ticks (11082 k, 0 u), 27961998677 tsc] cat tests/threads/bubsort.default.stats bubsort.default.m [16870 ticks (16874 k, 0 u), 42601778662 tsc] [124 m, 17088 g, 432348832214 tsc] [tb: 0: 1125, 1125, 1125: 141/1204] [swap: pd-s: 0: 2, 2, 2] [swap: pd-u: 0: 2, 2, 2] [swap: pt-s: 0: 3, 3, 3] [swap: pt-u: 0: 0, 0, 0] [swap: pg: 0: 0, 0, 0] [page-faults: 263: 0 t, 0 m, 263 s, 0 p] [callouts: 51756 all, 16656 forced]The first line (
bubsort.) represents the performance of unvirtualized Pintos guest running the bubblesort algorithm. The total number of timer ticks are 11082, of which all occur while the CPU was in kernel mode (because the bubblesort program runs inside the kernel in this test). The number of CPU cycles (obtained using
rdtsc) is 27961998677.
The next set of results (labeled
bubsort.default.m) is for the
bubblesort guest running virtualized in Monee. When virtualized, the bubblesort
kernel takes 16870 ticks, all of which are again in kernel. The number of
CPU cycles is also around 1.5x more than unvirtualized kernel.
The virtualized stats also give other information. You can ignore the second line titled 'm', 'g', and 'tsc'. The third line gives statistics on the translation cache. For this test, there were 0 replacements in the translation cache (the translation cache was big enough to hold all translation blocks). The next three numbers give the average, minimum, and maximum sizes respectively of the translation cache (in terms of the number of translation blocks) over the course of this execution.
The next five lines give statistics on the swap cache (to implement shadow page
tables). The statistics are divided into different types of swap pages:
stands for page-directory pages with supervisor privileges (kernel) and
stands for page-directory pages with user privileges (user). Similarly,
pt-u stand for page-table (L2) pages with supervisor and user
privileges respectively. The last swap cache statistics are on regular
pg). In all these five cases, the first number (
in this example) gives the number of replacements that happened. The next three
numbers represent the average, minimum, and maximum sizes of the swap cache
respectively during the execution.
The next line on
page-faults gives statistics on the number of
page faults that occurred. The first number (263) is the total number of page faults.
The second number labeled '
t' is the number of true page faults.
The third number labeled '
m' is the number of page faults due
to memory traces. The fourth number labeled '
s' is the number of
hidden page faults due to accesses to removed entries in the shadow page table.
The fifth and lasst number labeled 'p' is the number of hidden page faults due to
accesses to removed entries in
The final statistics are on the number of callouts executed during this execution. This example executed 51756 callouts in all. Forced callouts are needed to handle exceptions and interrupts, and you can ignore this number for now.
The code of the bubblesort program and other benchmarks used in this
homework can be found in
Notice that the overhead of virtualization for
bubsort is 1.5x. This sounds too high for a compute-intensive application.
The reason is that these results are obtained on QEMU which is itself an emulator.
On QEMU, the overhead of extra executed instructions due to binary translation
gets magnified. On real hardware, these extra executed instructions are relatively
cheaper and hence you will find the overhead to be less than 10%.
simvariable in the
tapas-ilo(10.20.3.27). Request the TAs for an account on the webserver. Set your environment variables
PASSWORDto the username and password of your account on the webserver. The webserver is used to host your disk image which can be used to boot
tapasand repeat all the steps you performed for running QEMU. Refer to the
READMEfile in top-level directory for trouble-shooting.
systems.cse.iitd.ernet.inand try following commands:
To see status of your jobs: bash$ qstat -f -r To see status of all users's jobs: bash$ qstat -f -r -u \* | less To submit a job manually (please do not submit more than one job at a time): bash$ qsub -b y
emptyloop). Report and briefly explain (one-line each) the performance of unvirtualized and virtualized executions on Tapas on these benchmarks. (There is no need to report performance numbers on QEMU and Bochs. You should just use these platforms for debugging, etc.).
bubsortby modifying the translation of the
callinstruction. Notice that in modifying the translation you need to be careful in ensuring that you do not overwrite the hardware flags. Hint: The
leainstruction might be useful.
compute.cthat triggers tb replacements. Study the effect of the number of tb replacements on the performance of the system.
compute.cthat triggers swap cache replacements. Study the effect of the number of swap cache replacements on the performance of the system.
calloutusing the following strategy:
peep.tabwhich performs a callout for any instruction (e.g.,
nop) that would otherwise have executed natively.
binutils-2.19.tar.bz2not found. Download here and copy to