Virtual
Virtual
Osaigbovo Timothy
School of ICT, Federal University of Technology Minna.
[email protected]
system instructions are special privileged instructions for managing and protecting shared hardware resources, e.g., the processor, memory and I/O system. Only by system calls can application programs access these resources. The standard architecture has many advantages. Since the interfaces are nicely defined, the application program developers can skip the details of hardware, like I/O and memory allocations, and the hardware and software designs can be decoupled. In the same ISA, software can be reused across different hardware configurations and even across generations. But this architecture also has its disadvantages with flexibility, protection, and performance.
Abstract
Virtual machines provide an abstraction of the underlying physical system to the guest operating system running on it. Based on which level of abstraction the VMM provides and whether the guest and host system use the same ISA, we can classify virtual machines into many different types. For system virtual machines, there are two major development approaches, full system virtualization and paravirtualization. Because virtual machines can provide desirable features like software flexibility, better protection and hardware independence, they are applied in various research areas and have great potential.
General Terms
New Machine Level, Architecture, Portability.
Keywords
Virtualization
1. Introduction
Standard computer systems are hierarchically constructed from three components: bare hardware, operating system, and application software. To get better software capability, a standard Instruction Set Architecture (ISA) was proposed to precisely define the interface between hardware and software. In other words, the ISA is the part of the processor that is visible to the programmer or compiler writer. It includes both user and system instructions. User instructions are the set that is accessible to both the operating system and application programs; while the
virtual machine monitor and the bare machine, which provides the virtual machine environment, is called the host, and the operating system and the applications running on it are called guests. Actually this is just one of the many possible virtual machine Models. The abstract interfaces which VMMs provide can be different types. Some virtual machine monitors perform whole system virtualization, which means the guest operating system doesnt need any change to run on the virtualized system hardware, while some other VMMs dont do full system virtualization, and we need to change some code of the guest operating system to make it suitable for the abstract interface. virtual machine monitor sits between the bare system hardware and operating systems. Usually the underlying platform comprised of the virtual machine monitor and the bare machine, which provides the virtual machine environment, is called the host, and the operating system and the applications running on it are called guests. Actually this is just one of the many possible virtual machine models, and we will address this later in this chapter. Also, the abstract interfaces which VMMs provide can be different types. Some virtual machine monitors perform whole system virtualization, which means the guest operating system doesnt need any changes to run on the virtualized system hardware, while some other VMMs dont do full system virtualization, and we need to change some code of the guest operating system to make it suitable for the abstract interface. This type of virtual machine mechanism is called para virtualization.
Since the architecture of the third generation computers cannot be virtualized directly, it has to be done by software maneuver, which is very difficult. Some researchers then proposed an approach to address this problem virtualizable architectureswhich directly support virtual machines, including Goldbergs Hardware Virtualizer. The basic idea is to not to have a trap and simulation mechanism, which will make the VMM smaller and simpler, and the machine more efficient. This sounds like a great idea, but it seems not the main trend for virtual machines. Currently there are still no virtualizable architectures, and the implementation of a virtual machine still needs lot of effort.
High Level Virtual Machines The last type of process level virtual machine is the most commonly recognized one, partly due to the popularity of Java. The purpose of the previous three virtual machines, except for dynamic optimizer, is to improve cross platform portability. But their approaches need great effort for every ISA, so a better way is to move the virtualization to a higher level: bring a process level virtual machine to the high level language design. Two good examples for this type of virtual machines are Pascal and Java. In a conventional system, the HLL programs are compiled to abstract intermediate codes, and then generated into object code for specific ISA/OS by a code generator. But in Pascal/Java, the code to be distributed is not the object code, but the intermediate codes: P-code for Pascal and byte code for Java. On every ISA/OS, theres a virtual machine to interpret the intermediate codes to platform specific host instructions. So this type of process virtual machines provides the maximal platform independence.
the hosted virtual machines, the ISA of the guest OS is the same as the underlying hardware. Whole system Virtual Machines Sometimes we need to run operating systems and applications on a different ISA. In these cases, because of the different ISA, complete emulation and translation of the whole OS and application are required, so its called whole system virtual machines. Usually the VMM stays on top of a host OS running on the underlying ISA hardware. Co-designed Virtual Machines The above three system virtual machines are all built on a well-developed ISA. Co-designed virtual machines focus on improving performance or efficiency for non-standard ISAs. There are no native ISA implementations, so no native execution is possible. Usually the VMM uses a binary translator to convert guest instructions to native ISA instructions. The VMM works like a part of the hardware implementation, to provide the guest operating system and applications a VM platform just like a native hardware platform. The native ISA is totally concealed from the guest OS and software.
machine, you are ready to run all commodity x86 operating systems and applications. However because of the restrict requirement for the complete mirror of the underlying physical system, the full virtualization usually has to pay the performance penalty. Here I will first discuss the details of the implementation of VMWare [6], a representative example of the full virtualization approach. There are two types full virtualization in VMWare solutions [6]: the hosted architecture and the hypervisor architecture. They are both for the IA-32 architecture and support running unmodified commodity operating systems, like Windows 2000, XP and Linux Redhat. The VMWare workstation [2] uses the hosted approach in which the VM and the guest OS is installed and runs on the top of a standard commodity operating system. It uses the host OS to support a broad range of hardware devices. The hypervisor architecture, in contrast, installs a layer of software, called hypervisor, directly on top of the bare hardware and the guest OS runs on top of the hypervisor. VMWare ESX Server [7][8] is the representative of the hypervisor architecture. Next I will explain the techniques used in the full virtualization by comparing and contrasting the above two VMWare products.
However the downside of this hosted approach is also obvious. Because the host OS has the full control on the hardware, even though the VMM has full system and hardware privileges, it cannot perform full-fledged scheduling. For example, the VM cannot guarantee for a certain CPU share because the VMM itself is scheduled by the host OS. Secondly, to have acceptable performance, the guest OS needs to run on the physical hardware directly as much as possible. So the context switch between the guest OS/VM and the host OS is even more expensive than the process switch. The I/O performance now becomes a big issue, because the I/O operations in the guest OS have to forward to the device drivers in the host OS and the context switches are inevitable here. Figure 3 [2] illustrates the structure of a guest OS in a virtual machine in the hosted architecture in VMWare Workstation. The install process of VMWare Workstation in the host OS is the same as installing a normal application. When it runs, the application portion (VMApp) uses a driver (VMDriver) loaded into the host OS to create a privileged virtual machine monitor component (VMM). This component lives in the guest OS kernel space and runs directly on the physical hardware. The physical processor is now switching between the two worlds: the host OS world and the VMM world. The guest OS and the applications on it are all running in the user mode. The execution is the combination of direct execution and the binary translation. For most of the non-privileged instructions, they run in the physical hardware directly. For the privileged instructions, they are translated to another sequence of the instructions at run time. The translated sequence will ensure to trap into the VMM and emulate the same effect of executing the privileged instruction. When the guest OS performs the I/O operation, it will be intercepted by the VMM. Instead of accessing the physical hardware directly, VMM will switch to the host OS world and call the VMApp to perform this I/O operation via the normal system call on behalf of the VM. The VMM may also yield the control to the host OS when necessary, so that the host OS could handle the interrupt sent back from the hardware when the I/O operation finishes. Only the host OS need deal with the hardware via the normal device drivers. The VMApp will bridge the request/reply back and forth from the VM and host OS.
The VMM will never touch the physical I/O device. The memory virtualization is another interesting topic for the full virtualization. A shadow page table must be introduced to map the physical address to machine address and thus gives the virtual machine an illusion of a continuous zero-based physical memory.
To support this, VMWare ESX Server maintains a pmap data structure for each VM to map physical addresses to machine addresses. Instead of accessing the guest OSs normal page table, the process accesses a shadow page table which contains the translation from virtual address directly to the machine address. All guest OS instructions that manipulate the page tables and TLBs are trapped by dynamic binary translation. ESX Server is responsible to update both the pmap and the shadow page table to keep them in synchronization. The biggest advantage of this shadow page table is that the normal memory access in guest OS can then be executed directly in native processor because the virtual to machine address translation is in the TLB. However, it has to pay performance penalty to maintain the correctness of the shadow page table. Ballooning in reclamation for over-commitment of memory. Over commitment is considered as one of the important advantages of using virtual machines. In over commitment the total size of memory configured for all virtual machines exceeds the total size of the physical memory. The memory pages may shift among the VMs based on configuration and workload. It gives more efficient use of the limited memory resource because most of the time the different guest OSes will have a different level of demand for memory. The overall performance will get improved when more memory is allocated to the guest OS with the higher demand. However the problem is how to find the pages to reclaim. The ESX Server decides to let guest OSes to make the choice based on the fact that the best information about the least valuable pages is known only the guest OS. A ballooning device driver is installed into each guest OS. When ESX Server needs to squeeze the memory from a guest OS, it asks the ballooning device driver to inflate, i.e., to request more memory from the guest OS. Based on its own replacement algorithm, the guest OS will page out the least valuable pages to the virtual disk and the pages obtained by the ballooning device driver will be passed to the ESX Server, which will then update the shadow page table to move it to another guest OS with higher memory demand.
The ballooning device in the later guest OS will perform a deflation operation, i.e., returning pages to the OS. That guest OS now has more free pages that could be allocated to the applications. With the ballooning, ESX Server avoids tracking the page usage history and coding those complicated replacement algorithms. A decision is made at the best place for making the decision. Content-based transparent memory sharing. The shadow page table makes it very easy to share a page among different VMs. Multiple virtual page numbers can be mapped to a single machine page number. It reduces the overall memory footprint, and lowers the overhead in some cases by eliminating the copies. The ESX Server uses a content-based transparent memory sharing technique. By transparent, it means the VMs do not know the pages are shared, and they all look like the same as those private owned pages. Disco[5] can discover the shared pages at the page creation time, but that requires a change to the guest OS, which is unacceptable for the ESX Server. The content-based page sharing means the VMM will share all pages having the same content. Obviously comparing every page with every other page has a complexity of O (n2) in page comparisons. ESX Server uses hash functions to reduce the full page comparison. A hash function is performed on every read-only page first to summarize the content of that page, and the resulting hash value is used as a key and put into a global hash table. Every time when a new read-only page is requested, the hash value of this page is also computed and looked up in the global hash table. If this key is in the hash table already, a full page comparison is performed between the new page and the page set keyed by the hash value. If it is a match, the shared page is found, the shadow page table is updated to map the shared page and also the shared page is marked as COW (copy-on-write). If a match is not found, the new page will be added to the page set in the global hash table. Even though this content based memory sharing is more expensive to maintain, it can find some identical pages which cannot be detected by the traditional approach and thus potentially save more memory space and data copy operations.
Application of Virtual Machines Nowadays computer systems are getting cheaper and cheaper, so the original purpose of virtual machines to share the expensive computer systems isnt that important anymore. But virtual machines have unique features that always draw interest from system researchers. They provide flexibility for the software, more protection between applications, hardware independence, and a good environment for system development and debugging. Peter Chen and Brian Noble [12] propose that the current operating system and software structure should be replaced by a new virtual machine-operating system-software 3-layer structure, and argue that this structure is very useful for certain systems research, like secure logging, intrusion prevention and detection, and environment migration. There are a number of research projects which explored this idea independently [13, 14, 15, 16, and 17]; here we pick some of them to illustrate the applications of the virtual machine technologies.
damage has been caused. Revirt runs in UMLinux, a VMM that runs a guest OS as one process in a host OS. All guest OS system calls and interrupts are mapped to various signals in the host OS. The VMM intercepts these signals, logs the corresponding events and then passes them to the guest OS. When replaying, Revirt starts the system from the known initial state, and injects the logged non-deterministic events at the right point to direct the system to evolve exactly in the same way as on the previous run. Revirt runs in the VMM level and is almost transparent to the guest OS, except that one non-deterministic instruction in the guest OS has to be replaced with a system call. We should note that it does not mean the VMM can not be compromised. But considering the relatively narrower VMM interface and the simpler (thus less vulnerable and easier to verify) VMM software, breaking into a VMM should be much harder than breaking into a guest OS. So the VMM provides a quite secure platform for logging. Also because the VMM is forwarding all system interactions between the guest OS and the host OS or hardware, logging in the VMM is an easy and natural way to record sufficient information for a full replay. Migration Because a VMM is a software abstraction of the hardware, the entire state of a running environment, including the guest operating system and all applications running on it, can be easily captured, packaged into a capsule, sent over the network and resumed in a remote host. The capsule contains all the information the target host needs to resume the running processes and entire guest OS. That information usually includes the state of the virtual disks, memory, CPU registers, and I/O devices. This environment migration is at the virtual machine level which is bigger scale than process migration. It will allow a user to move between computers at home and those at work without having work interrupted. Or it may allow a system admin make a live patch and deploy that live patch to the entire server fleet to let all servers start from a clean fresh state. However considering the giga-bytes size of a virtual disk and hundreds mega-bytes size of memory, the capsule could be so large that constructing it and sending it over the network is very expensive. In [14], the author explained an effective way to construct a
capsule so that it can be transported over a DSL connection within a reasonable time. The following techniques are used to reduce the capsule size: After the initial capsule is transmitted, the future virtual disk changes are captured in an incremental manner. Using copy-on-write, only disk updates will be written into the capsule. Before capturing memory into the capsule, we can run a balloon process to zero out most of the memory by paging most memory pages back out to disk. Therefore, only a small amount of the memory is built into the capsule. Instead of waiting for the entire capsule, the target host can start early with partial information. Especially for disk pages, they could be fetched on demand while the capsule runs. A hash code of a data block is sent first and if the target host has a data block which can produce the same hash code, we use the local copy instead of sending the data block over the wire. The result is significant. The capsule can start running in the target host only after 20 minutes of transmission on a 384 kbps link.
3 Conclusions
The concept of virtual machines is not new. In the 60s, IBM first developed virtual machines to share machine resources among users. The virtual machine has always been an interesting research topic, and recently it draws more attention than ever. The essential part of a virtual machine is the virtual machine monitor (VMM). It abstracts the physical resources of the underlying bare hardware and provides a fully protected and isolated replica of the physical system. It is transparent to the operating system running above it, i.e., the guest operating system. While the above structure describes the original virtual machines, there are many different types of virtual machines in different research areas. Based on whether the VMM provides abstracted ISA or ABI, we can distinguish system virtual machines from process virtual machines. And together with the criteria of whether the guest and host system are the same ISA,
we can classify virtual machines into different types and establish an overall taxonomy.
References
[1] R. J. Creasy. The origin of the VM/370 timesharing system. IBM Journal of research and development, vol. 25, no. 5, p. 483, 1981. [2] J. Sugerman, G. Venkitachalam, and B. Lim. Virtualizing I/O devices on VMware workstations hosted virtual machine monitor. Annual Usenix Technical Conference 2001, Boston, MA, USA, Jun. 2001. [3] R. P. Goldberg. Survey of Virtual Machine Research. IEEE Computer, pp. 3445, Jun. 1974. [4] J. E. Smith, R. Nair. An Overview of Virtual Machine Architectures. Elsevier Science, pp. 5-6, 2006. [5] E. Bugnion, S. Devine, and M. Rosenblum. Disco: Running commodity operating systems on scalable multiprocessors. In Proceedings of the 16th ACM SIGOPS Symposium on Operating Systems Principles, volume 31(5) of ACM Operating Systems Review, pages 143-156, Oct. 1997. [6] VMware whitepaper: Virtualization Overview. http://www.vmware.com/pdf/virtualization.pdf [7] C. A. Waldspurger. Memory resource management in VMware ESX server. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation), Boston, MA, USA, Dec. 2002. [8] M. R. Ferre. Vmware ESX server: scale up or scale out?. http://www.redbooks.ibm.com/abstracts/redp3953.html [9] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, A. Warfield. Xen and the art of virtualization. In Proceedings of the ACM Symposium on Operating Systems Principles. Bolton Landing, NY, USA, Oct. 2003. [10] Xen website: http://www.cl.cam.ac.uk/research/srg/netos/xen/
[11] A. Whitaker, M. Shaw, and S. D. Gribble. Scale and performance in the Denali isolation kernel. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation, ACM Operating Systems Review, winter 2002 Special Issue, pages 195210, Boston, MA, USA Dec. 2002. [12] P. M. Chen and B. D. Noble. When virtual is better than real., In Proceedings of Eighth Workshop on Hot Topics in Operating Systems, p. 0133, 2001. [13] G. W. Dunlap, S. T. King, S. Cinar, M. Basrai, and P. M. Chen. ReVirt: Enabling intrusion analysis through virtual machine logging and replay. In Proceedings of the 5th Symposium on Operating Systems Design and Implmentation (OSDI 2002), ACM Operating Systems Review, winter 2002 special Issue, pages 211-224, Boston, MA, USA, Dec. 2002. [14] C. P. Sapuntzakis, R. Chandra, B. Pfaff, J. Chow, M. S. Lam, and M. Rosenblum. Optimizing the migration of virtual computers. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation, pages 377-- 390, Boston, MA, USA , Dec. 2002 [15] P. Levis and D. Culler. Mate: A tiny virtual machine for sensor networks. In Proceedings of 10th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, USA, Oct. 2002. [16] S. T. King and P. M. Chen. Backtracking intrusion. In Proceedings of the ACM Symposium on Operating Systems Principles. Bolton Landing, NY, USA, Oct. 2003. [17] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh. Terra: A virtual machine-based platform for trusted computing. In Proceedings of the 19th ACM Symposium on Operating Systems Principles. Bolton Landing, NY, USA, Oct. 2003. [18] PlanetLab website: http://www.planetlab.org/Software/roadmap.php#os