|[ PLDI 2020, SAT 2018, APLAS 2017, HVC 2017 ]||
|[ ASPLOS 2019 ]||
HawkEye: Efficient Fine-grained OS Support for Huge Pages [ slides, lightning talk ]
This paper discusses simple yet effective algorithms for huge-page management in the OS kernel. The primary ideas are: access-coverage based huge-page promotion, asynchronous page pre-zeroing, exploring the notion of fairness in dealing with huge-page allocations, and the effect of using hardware performance counters to estimate the benefits of huge-page allocations. The ideas are evaluated over a wide range of workloads for several metrics.
|[ TPDS 2017 ]||
The Unicorn Runtime: Efficient distributed shared memory programming for hybrid CPU-GPU clusters
How can a programmer harness the combined computing power of CPUs and GPUs in a computing cluster without dealing with all the nitty gritties of the underlying system and its optimizations? The paper presents an intuitive programming model (which is perhaps not entirely new). The contribution of the paper is in algorithms to automatically map a program written in this programming model on a cluster of CPUs and GPUs through compile-time and run-time support in a system called Unicorn.
|[ SOSP 2013 ]||
Fast Dynamic Binary Translation for the Kernel [ talk video ]
Dynamic binary translation (DBT) is a technique with a wide variety of applications; innovative ways to employ DBT appear almost regularly in the research literature. This paper explains how to perform DBT for an OS kernel efficiently even in the presence of high rates of interrupt and exception activity. The capability of turning DBT on and off during system execution is quite cool, and did not exist in any previous DBT system. The orchestration of translated instructions in the processor's instruction cache to improve the performance of the system (i.e., the translated system can be faster than the native system) is quite interesting. We use our DBT implementation to monitor the sharing behaviour of the Linux kernel.
This was the first SOSP paper ever from India.
|[ ASPLOS 2013 ]||
Efficient Virtualization on Embedded Power Architecture Platforms
I quote one of our ASPLOS 2013 reviewers:
---begin quote--- It is almost like asking, how would the universe differ if the fine
structure constant had a value of 0.00829 instead of 0.00729.
Answer: a lot!
For this reason, I think the paper is a very good fit for ASPLOS. Even
if architects do their best to think carefully about ramifications of
their choices, this paper is an eye opener into a world of, probably,
unforeseen consequences. As such, like it very much. And I think it should
be a must-read for current and future architects (and not just narrowly
from a virtualization point of view).
It is almost like asking, how would the universe differ if the fine structure constant had a value of 0.00829 instead of 0.00729. Answer: a lot!
For this reason, I think the paper is a very good fit for ASPLOS. Even if architects do their best to think carefully about ramifications of their choices, this paper is an eye opener into a world of, probably, unforeseen consequences. As such, like it very much. And I think it should be a must-read for current and future architects (and not just narrowly from a virtualization point of view).
Thank you dear (anonymous) reviewer for liking our work!
|[ OSDI 2008 ]||
Binary Translation Using Peephole Superoptimizers
Can we automatically infer translations from one architecture to another using synthesis and superoptimization techniques? This paper demonstrates such capability for PowerPC to x86 translation. The performance of the translated code is better than the widely-used commercial Rosetta translator (shipped with Apple's OS X) at that time. The paper provides techniques to dynamically map the state of one machine to the state of another machine in tandem with synthesis-based code generation, for better overall quality of the generated code.
|[ ASPLOS 2006 ]||
Automatic Generation of Peephole Superoptimizers
How to automatically generate a peephole optimizer for x86 using brute-force superoptimization? This paper has received significant attention since it was published. Our current research is targetted towards generalizing and scaling the ideas presented in this paper.
|[ FAST 2004 ]||
CAR: Clock with Adaptive Replacement
This paper extends the well-known CLOCK algorithm for cache replacement (used in most operating systems) with ideas presented in previous work on Adaptive Replacement Cache (ARC). As mentioned on ARC's wikipedia page, variants of this approach have been implemented and deployed in several systems. The primary advantage of CAR (CLOCK with Adaptive Replacement) over ARC is that it gets rid of the LRU lock, while retaining all the benefits of ARC.