Design and Implementation of a Fast Inter Domain ...

Viewer
Transcript

Design and Implementation of a Fast Inter Domain Communication Mechanism Amitabha Roy [email protected] July 6, 2006

Abstract— This research proposal draws attention to the problem of different VM domains running applications that need to communicate efficiently with each other. It proposes the construction of a new device class for high performance inter domain communication while preserving isolations between virtual machines.

I. P ROBLEM S TATEMENT Virtualization is an old concept [1] that has recently made a comeback as a solution to the resource utilization problem in servers. All the current virtual machine implementations such as Xen [2], Vmware [3], Denali [4] and others support running more than one virtual machine on the same physical node. Most virtual machine hypervisors also provide one or more methods of inter domain communication (IDC). This is useful in cases where virtual machines move data and events explicity among each other to form a data pipeline. For example in the case of a web interface to a transaction processing system it is natural to use virtual machines to isolate the webserver and the database management system for fault containment, especially when running on the same physical node. Another area where explicit IDC is becoming important is grids where virtual machines are used to instantiate preconfigured environments [5]. These are then used to run parallel algorithms using application level communication abstractions such as MPI [6]. This research proposal focuses on Xen as a hypervisor platform with the aim of designing and implementing a high performance IDC mechanism. Xen supports running multiple virtual machines (called domains) on a physical node and provides virtual networks, shared pages and events as mechanisms for IDC. There are performance and flexibility problems with these mechanisms. Performance problems have been pointed out in a number of studies. For example [7] shows that running MPI applications using IDC on the same physical node performs poorly in comparison to Inter Process Communication (IPC). Also for remote domain communication the number of data copies impacts performance as shown in [8]. The available options for IDC also tend to severely limit scheduling of domains through middleware such as Xenoservers [9]. Using shared memory pages for high performance IDC automatically constrains VMs to be scheduled on the same physical node. On the other hand using virtual networks restores flexibility in scheduling but results in suboptimal performance when the VMs are running on the same node. Again, using infiniband based RDMA solutions

such as [8] for better network performance can be difficult in heterogeneous environments, since the actual hardware configuration of the target node is not known in advance and configuring the actual VM image becomes cumbersome. An interesting dimension to this problem is that physical proximity of virtual machines sharing data and events can change dynamically due to virtual machine migration [10]. Thus if IDC between VMs on a physical node is done with shared pages, this means they must be moved together to the same physical node, constraining the available scheduling options. On the other hand it maybe possible, based on data sharing patterns observed, to move VMs to the same physical node for better IDC performance. However if the VMs have already chosen VMnets for IDC then the entire purpose is defeated. II. P ROPOSED W ORK The goal of this research is to build an IDC mechanism for Xen that provides the best performance while retaining isolation and flexibility. The platform for experimentation will be virtual machines running on the Xen hypervisor. The most efficient mechanism for IDC on the same node is clearly physcial memory. Zero copy communication from the producer to the consumer domain can be achieved by switching pages from the former to the latter using a technique like device channels [11], already present in Xen. To maintain buffer availability the consumer domain will return synchronously, as many pages as it consumes, to the producer domain. This implementation will allow IDC on the same node to approach IPC in terms of data bandwidth and transfer latency. Alternatively, this can be done with a shared memory region mapped into both VMs. This will also be investigated. Memory is not an option for inter domain communication spanning physical nodes . However as shown in [8] a lot of the inherent latency in moving data around in this case can be removed by using RDMA techniques from userspace. It is possible to implement maximally efficient IDC while retaining flexibility by separating the actual data copying decisions from the virtual machines themselves. This can be done by exposing a standard “VM Communication device” to each of the VMs. A front end driver will support read and write operations to this device. The target VM for these operations is chosen using an appropriate chosen VM naming and identification scheme for a distributed system of physical nodes running virtual machines.

The backend driver chooses the most efficient way of transferring the data. If the target VM is on the same node it may simply switch pages resulting in zero copy transfer of data between the frontend drivers in the source and target VMs. If the frontend driver in the producer VM wishes to continue mapping the pages (perhaps to avoid copying from the application buffers for the send call), the backend driver can choose the most efficient way of memory to memory copy on the same system. It may be possible to do this using harware assists such as memory to memory to DMA (if the platform supports it, for example using [12]) or in the worst case plain old fashioned copying. If the transfer involves communication between different physical nodes, the backend driver can decide whether to use RDMA type solutions depending on the platform configuration (such as whether it supports infiniband) for quick transfer or fall back to setting up a network connection to the remote node. Even in the latter case it may decide to open and maintain a persistent connection based on the frequency of data transfer or based on explicit request from the source VM (perhaps through an open call on the device). It may even be possible to multiplex data from different VMs on the same network connection to avoid the overhead opening and maintaining multiple connections to the same node. This solution will also cover the problem of VM migration because the backend can choose to change its data transfer mechanism independent of the front end. At the same time flexibility is maintained because the back end driver in the control domain or in an isolated domain can be statically configured on a per physical node basis. The proposed solution is analogous to virtual channel processors [13], which are aimed at increasing flexibility in implementing IO stacks by decoupling them into a separate virtual machine. It can be thought of as a virtual channel processor dedicated to inter domain communication. The proposed research will design and implement this fexible high performance IDC mechanism using the Xen hypervisor. A select set of applications will be used to measure performance gains using this technique. It is anticipated that this will consist of transaction processing systems such as a DBMS and webserver responding to TPCC style queries to reflect server workloads and scientific applications communicating using MPI to reflect grid workloads. Any required changes to userspace communication libraries such as sockets or MPI will be made to support this new device (a move towards extreme para virtualization like in [14]). To level the playing field in terms of flexibility the implementation will be benchmarked against an equivalent implementation using only vmnets. The benchmarking will also subject the virtual machines to live migration. It is expected that the typical usage scenario for such deployment of VMs using IDC is in server farms running middleware such as Xenoserver [9] with a central scheduling entity. It is thus possible to use feedback from the backend drivers described above about IDC patterns and costs. This is turn can be used for better scheduling policies that will

dynamically move communicating VMs to the same node as far as possible. This research will target an implementation that provided such feedback. If possible such a scheduler will also be investigated. Filesystem based communication is currently a non goal for this research. Techniques for improving the efficiency of VMs that share file systems is already being investigated [14] [15] [16]. It is not anticipated that files will be a popular mode of IDC, primarily due to the difficulty of maintaining coherent views among VMs. However, the IDC device described above can easily coexist with these mechanisms. In addition if the VMs desire persistent copies of transferred data it may be possible to interface the backend driver with the unified buffer cache of [14] or with an on disk FS cache like in [16]. III. B ENEFITS OF R ESEARCH The current trend in computer architecture of a larger number of smaller, simpler and power efficient cores in microprocessors [17] means that the number of hardware threads available in server platforms is set to grow phenomenally, an early example being SUN’s Niagara [18]. This coupled with server consolidation means that the number of virtual machines running on a single physical node will also grow. In such an environment it is inevitable that virtual machines will share data. This research will improve the performance of communication between virtual machines as well as decrease while retaining scheduling flexibility and tolerance to failures. R EFERENCES [1] Gerald J. Popek and Robert P. Goldberg. Formal requirements for virtualizable third generation architectures. Commun. ACM, 17(7):412– 421, 1974. [2] Paul T. Barham, Boris Dragovic, Keir Fraser, Steven Hand, Timothy L. Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualization. In SOSP, pages 164–177, 2003. [3] S. Devine, E. Bugnion, and M. Rosenblum. Virtualization system including a virtual machine monitor for a computer with a segmented architecture. [4] A. Whitaker, M. Shaw, and S. Gribble. Scale and performance in the denali isolation kernel, 2002. [5] Xuehai Zhang, Katarzyna Keahly, Ian Foster, and Timothy Freeman. Virtual cluster workspaces for grid applications. [6] Message Passing Interface Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, 1994. [7] Anirban Saha and Gang Peng. How good is xen for simulating distributed applications. [8] Jiuxing Liu, Wei Huang, Bulent Abali, and Dhabaleswar K. Panda. High performance vmm-bypass i/o in virtual machines. In Usenix, 2006. [9] Evangelos Kotsovinos. Global Public Computing. Technical Report UCAM-CL-TR-615, Computer Laboratory, University of Cambridge, January 2005. [10] C. Clark, K. Fraser, S. Hand, J. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines, 2005. [11] K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson. Safe hardware access with the xen virtual machine monitor. In 1st Workshop on Operating System and Architectural Support for the on demand IT InfraStructure (OASIS), 2004. [12] www.intel.com/go/ioat. [13] D. McAuley and R. Neugebauer. A case for virtual channel processors, 2003. [14] Mark Williamson. Extreme paravirtualisation: beyond arch/xen. [15] Mendel Rosenblum Ben Pfaff, Tal Garfinkel. Virtualization aware file systems: Getting beyond the limitations of virtual disks. In 3rd Symposium of Networked Systems Design and Implementation (NSDI), May 2006.

[16] Ming Zhao, Jian Zhang, and Renato Figueiredo. Distributed file system support for virtual machines in grid computing. In HPDC-13, 2004. [17] John D. Davis, James Laudon, and Kunle Olukotun. Maximizing cmp throughput with mediocre cores. In PACT ’05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05), pages 51–62, Washington, DC, USA, 2005. IEEE Computer Society. [18] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32-way multithreaded sparc processor. IEEE Micro, 25(2):21– 29, 2005.

Applying UML to Design an Inter-Domain Service ...