An Instrumented Data Center Infrastructure for Research on Cross-Layer Autonomics

Overview

This project's goal is to acquire and develop an instrumented datacenter testbed spanning the three sites of the NSF Center for Autonomic Computing (CAC) - the University of Florida (UF), the University of Arizona (UA) and Rutgers, the State University of New Jersey (RU). This proposed infrastructure reflects the natural heterogeneity, dynamism and distribution of real-world datacenters, and includes embedded instrumentation at all levels, including the platform, virtualization, middleware and application layers. Its scale and geographical distribution enables studies of both "scale-up" and "scale-out" challenges faced by datacenter applications, services, middleware and architectures. It will enable fundamental and far-reaching research focused on cross-layer autonomics for managing and optimizing large-scale datacenters. This research is motivated by the growing complexity and costs of operating and managing enterprise datacenters. The participant sites will contribute complementary expertise - UA at the resource level, UF at the virtualization layer and RU in the area of services and applications. The end-to-end approach to autonomic management was identified as critical by the industrial members of the CAC in order to bring coherence across ongoing separate research efforts and have a transformative impact on the modeling, formulation and solution of datacenter management problems which so far have mostly considered layers individually.

The infrastructure will uniquely enable the scientific understanding of cross layer management of datacenters in many contexts, such as:

  • At the resource level using information from the virtualization layer: the virtual layer provides information about virtual machine capacities required by the current workloads. The autonomic controller at the resource level uses this information to proactively detect and dynamically scale down/up resources in server platforms such that it is just right-sized to handle the expected workloads and while providing promised service levels. For example, power can be saved by transitioning over-provisioned resources to low-power states.

  • At the virtualization level, using information from the resource and services layers: the robustness of virtualization-level autonomic controllers with respect to resource-level parameters will be characterized and predictive models of service behavior will be developed to estimate demands on virtual machine performance and resources.

  • At the services and application level, using information from the virtualization layer: programming and runtime support will be developed to enable autonomic services capable of managing system and application dynamics and uncertainty in terms of scale, performance, energy consumption and costs. For example, mechanisms and policies will be developed to enable applications/services to react to adaptations at the platform and virtualization layers by, for example, adapting service behavior and/or re-negotiating SLAs

IT infrastructures such as Google, Amazon, eBay and E-Trade are powered by datacenters that contain tens to hundreds of thousands of computers and storage devices running complex software applications. Computer manufacturers, software developers and service providers spend significant time and resources ensuring that datacenters and applications are optimized for high performance and low operational cost. In particular, there is strong industry interest in devising autonomic management approaches to the problems of tuning performance and minimizing energy consumption in datacenters. This is reflected in the projects currently underway at the NSF I/UCR Center for Autonomic Computing (CAC) which include research efforts on the following topics:

  1. Autonomic power, thermal and performance management of large-scale datacenters: the objectives include techniques and mechanisms for datacenters capable of (1) adaptively learning and automatically identifying strategies to minimize power consumption and thermal envelope while maintaining the required Quality of Service (QoS) requirements for a wide range of workloads and applications; and (2) dynamically reconfiguring the computing, storage and network resources according to the selected optimization strategies and meeting the agreed upon SLA.

  2. Autonomic demand-driven service and power management in virtualized datacenters: the goals include the following: to devise mechanisms to monitor, model and predict workloads associated with individual services; to model and predict global resource demand; and to dynamically allocate and de-allocate virtual machines to physical machines.

  3. Autonomic services architecture: the research foci are on programming systems and proactive execution engines for autonomic services that leverage monitoring and adaptation capabilities at the platform and virtual machine levels to manage system/application dynamics and uncertainty, and enforce desired performance, power and cost constraints.

These three projects address related issues – performance, power and thermal management of datacenters – at different layers of the IT stack – the resource, virtualization and service levels.

Resource-level Multi-platform Autonomic Computing Techniques

The goal is to design innovative hierarchical autonomic architectures that can be integrated with traditional server platforms to transform them into intelligent self-managing entities that optimize performance/watt. To achieve this goal, we plan to focus on power and performance at the resource-level within the platform and at the platform-level within an enclosure. The central ideas behind our research approach are: (i) to proactively detect and reduce resource over-provisioning in server platforms such that it is just right-sized to handle the requirements of the application; and (ii) migrate virtual machines from one physical server in the enclosure to another server to smooth thermal gradients within the enclosure. In the first approach, we will save power by transitioning over-provisioned resources to low-power states and simultaneously maintain performance by satisfying the application resource requirements. In the second approach, we will reduce hot spots within the enclosure and maintain a thermal envelope, which in turn reduces cooling costs.

Autonomic power and performance management of server platforms

Our approach to develop an autonomic power, thermal and performance managed server platforms can be viewed as a special case of an Autonomic Computing System. We consider an enclosure with multiple server platforms – each consists of multi-core processors and multi-rank memory subsystems plus other resources. (Note that within the platform, we will initially focus on the processor and memory and then expand to include other components as our work progresses) The Autonomic Enclosure consists of three hierarchies of management – the Enclosure Autonomic Manager (EAM) at the enclosure level; the Platform Autonomic Manager (PAM) at the platform-level; and Core Manager (CM) and Rank Manager (RM) at the lower-level managing individual processor cores and memory ranks, respectively. The objective of the EAM is to ensure that all platforms within the enclosure operate within the pre-determined thermal gradient and thermal/power envelope by migrating virtual machines running specific workloads from one platform to another. Similarly, the objective of the PAM is to ensure that the platform resources (processor/memory) are configured to meet the dynamic application resource requirements such that any additional platform capacity can be transitioned to low-power states. In this manner both the EAM and PAM save total power without hurting application performance. The platform power and performance parameters together determine the platform operating point in an n-dimensional space at any instant of time during the lifetime of the application. The PAM manages the platform power and performance by maintaining the platform operating point within a predetermined safe operating zone. The PAM predicts the trajectory of the operating point as it changes in response to changes in the nature and arrival rate of the incoming workload and triggers a platform reconfiguration whenever the operating point drifts outside of the safe operating zone. For example, a sudden increase in the rate of arrival of processor-intensive jobs would increase the average platform response time because the processor is not configured to handle this increase in traffic (for instance, too few cores are in a high power processing state). This leads to a reduction in the rate of processing of jobs and an increase in the average response time for the platform jobs. This same technique will be adopted to optimize thermal and power parameters.

EAM and PAM monitor the rate of change of this parameter during any observation interval to predict the nature of the incoming workload and reconfigure the platform to suit its needs. In a similar manner, a sudden increase in the arrival rate of memory intensive jobs may also increase the platform response time. The PAM monitors additional parameters such as memory miss ratio, memory end-to-end delay and memory request loss to determine the best memory configuration that would maintain the platform response time within the safe operating region.

The platform state is defined by the number of processor cores in the turbo state, the number of memory ranks in the active state and the physical location of memory ranks within the memory hierarchy. Hence the power consumed by a platform state is the sum of the power consumed by its constituent parts (cores, ranks). The performance of the platform state depends on the physical configuration of the platform. The PAM platform reconfiguration decision actually involves a platform state transition from the current state to a target state that would maintain the performance while giving the smallest power consumption. The search for this ideal target state is formulated as an optimization problem.

The EAM will rely on thermal sensors that will be placed in critical positions within the enclosure and platforms. Data from all these sensors will be collected by EAM, which will make decisions based on the thermal gradient within the enclosure, the set of resources required by the running workloads and the decisions PAM has been undertaking. One interesting research challenge is the sensitivity to data inaccuracies when moving across the hierarchy from the component managers to the platform managers to the enclosure manager.

Virtualization-layer Autonomic Computing Techniques

Autonomic computing systems in datacenter environments benefit from virtualization techniques in a variety of ways. System virtual machines such as VMware and Xen provide a flexible management platform that is useful for both the encapsulation of application execution environments, and the aggregation and accounting of resources consumed by an application. Research on autonomic computing techniques at the virtualization layer at CAC has been based on approaches sharing the following characteristics:

  • Encapsulation and isolation: VM “containers” are used to encapsulate an application and its execution environment in images, which hold the entire state associated with the virtual machine (including CPU, memory and I/O devices).
  • Dynamic provisioning: A container can be instantiated in any datacenter resource which provides sufficient capacity (CPU cycles, disk/memory space, network bandwidth) and can be migrated to other resources at run-time without significant service disruption with live-migration techniques.
  • Resource usage monitoring: Resource consumption of the container can be measured in ways that reflect its hardware utilization. Metrics available in existing commercial VM frameworks include physical memory occupied by the contained, fraction of physical CPU cycles, disk I/O block transfers and network packet send/receives per unit of time.
  • Resource allocation control: Resources bound to a virtual container can be reserved by “slicing” available physical hardware resources, which include physical memory, processor time-slices, disk and network bandwidth.

Because of the above characteristics, virtual machines provide a layer that is well-positioned in the hardware/software stack of computer systems to provide fine-grain resource monitoring and control capabilities that are at the core of the MAPE control loop of autonomic computing frameworks.

top 

People


Salim Hariri
email:


Youssif Al-Nashif
email:


Ali Akoglu
email:


Haoting Luo
email:


Arjun Hary
email:


top 

Publications

  • Arjun Hary, Ali Akoglu, Youssif Alnashif, Salim Hariri, and Darrel Jenerette, "Design and Evaluation of a Self-healing Kepler for Scientific Workflows", In proceedings of the 19th IEEE International Symposium on High Performance Distributed Computing, 2010.

  • Haoting Luo, Youssif Alnashif, and Salim Hariri "Enclosure Autonomic Manager(EAM): Design and Evaluation", Technical Report, to be submitted to The 2010 IEEE/ACM International Conference on Green Computing and Communications (GreenCom2010).

 

top 

Related Papers

  • Y. Jaraweh, A. Hary, Y.B. Al-Nashif, S. Hariri, A. Akoglu, and D. Jenerette, “Accelerated Discovery Through Integration of Kepler with Data Turbine for Ecosystem Research,” the IEEE/ACS International Conference on Computer Systems and Applications, 2009, pp. 1005-1012.

  • B. Khargharia, S. Hariri and M.S. Yousif, “An Adaptive Interleaving Technique for Memory Performance-per-Watt Management,” IEEE Transactions on Parallel and Distributed Systems, Vol. 20, Issue 7, July 2009, pp. 1011-1022.

 

top 


Sponsors

 

 

 
Phone Number: (520) 621-9915 Room 251, ECE Dept. 1230 E. Speedway Tucson, AZ 85721-0104
ACL - © Copyright 2009, Webmaster: Youssif Al-Nashif
All Rights Reserved