An Instrumented Data Center Infrastructure for Research on Cross-Layer Autonomics
This project's goal is to acquire and develop an instrumented datacenter testbed spanning the three sites of the NSF Center for Autonomic Computing (CAC) - the University of Florida (UF), the University of Arizona (UA) and Rutgers, the State University of New Jersey (RU). This proposed infrastructure reflects the natural heterogeneity, dynamism and distribution of real-world datacenters, and includes embedded instrumentation at all levels, including the platform, virtualization, middleware and application layers. Its scale and geographical distribution enables studies of both "scale-up" and "scale-out" challenges faced by datacenter applications, services, middleware and architectures. It will enable fundamental and far-reaching research focused on cross-layer autonomics for managing and optimizing large-scale datacenters. This research is motivated by the growing complexity and costs of operating and managing enterprise datacenters. The participant sites will contribute complementary expertise - UA at the resource level, UF at the virtualization layer and RU in the area of services and applications. The end-to-end approach to autonomic management was identified as critical by the industrial members of the CAC in order to bring coherence across ongoing separate research efforts and have a transformative impact on the modeling, formulation and solution of datacenter management problems which so far have mostly considered layers individually.
The infrastructure will uniquely enable the scientific understanding of cross layer management of datacenters in many contexts, such as:
IT infrastructures such as Google, Amazon, eBay and E-Trade are powered by datacenters that contain tens to hundreds of thousands of computers and storage devices running complex software applications. Computer manufacturers, software developers and service providers spend significant time and resources ensuring that datacenters and applications are optimized for high performance and low operational cost. In particular, there is strong industry interest in devising autonomic management approaches to the problems of tuning performance and minimizing energy consumption in datacenters. This is reflected in the projects currently underway at the NSF I/UCR Center for Autonomic Computing (CAC) which include research efforts on the following topics:
These three projects address related issues – performance, power and thermal management of datacenters – at different layers of the IT stack – the resource, virtualization and service levels.
Resource-level Multi-platform Autonomic Computing Techniques
The goal is to design innovative hierarchical autonomic architectures that can be integrated with traditional server platforms to transform them into intelligent self-managing entities that optimize performance/watt. To achieve this goal, we plan to focus on power and performance at the resource-level within the platform and at the platform-level within an enclosure. The central ideas behind our research approach are: (i) to proactively detect and reduce resource over-provisioning in server platforms such that it is just right-sized to handle the requirements of the application; and (ii) migrate virtual machines from one physical server in the enclosure to another server to smooth thermal gradients within the enclosure. In the first approach, we will save power by transitioning over-provisioned resources to low-power states and simultaneously maintain performance by satisfying the application resource requirements. In the second approach, we will reduce hot spots within the enclosure and maintain a thermal envelope, which in turn reduces cooling costs.
Autonomic power and performance management of server platforms
Our approach to develop an autonomic power, thermal and performance managed server platforms can be viewed as a special case of an Autonomic Computing System. We consider an enclosure with multiple server platforms – each consists of multi-core processors and multi-rank memory subsystems plus other resources. (Note that within the platform, we will initially focus on the processor and memory and then expand to include other components as our work progresses) The Autonomic Enclosure consists of three hierarchies of management – the Enclosure Autonomic Manager (EAM) at the enclosure level; the Platform Autonomic Manager (PAM) at the platform-level; and Core Manager (CM) and Rank Manager (RM) at the lower-level managing individual processor cores and memory ranks, respectively. The objective of the EAM is to ensure that all platforms within the enclosure operate within the pre-determined thermal gradient and thermal/power envelope by migrating virtual machines running specific workloads from one platform to another. Similarly, the objective of the PAM is to ensure that the platform resources (processor/memory) are configured to meet the dynamic application resource requirements such that any additional platform capacity can be transitioned to low-power states. In this manner both the EAM and PAM save total power without hurting application performance. The platform power and performance parameters together determine the platform operating point in an n-dimensional space at any instant of time during the lifetime of the application. The PAM manages the platform power and performance by maintaining the platform operating point within a predetermined safe operating zone. The PAM predicts the trajectory of the operating point as it changes in response to changes in the nature and arrival rate of the incoming workload and triggers a platform reconfiguration whenever the operating point drifts outside of the safe operating zone. For example, a sudden increase in the rate of arrival of processor-intensive jobs would increase the average platform response time because the processor is not configured to handle this increase in traffic (for instance, too few cores are in a high power processing state). This leads to a reduction in the rate of processing of jobs and an increase in the average response time for the platform jobs. This same technique will be adopted to optimize thermal and power parameters.
EAM and PAM monitor the rate of change of this parameter during any observation interval to predict the nature of the incoming workload and reconfigure the platform to suit its needs. In a similar manner, a sudden increase in the arrival rate of memory intensive jobs may also increase the platform response time. The PAM monitors additional parameters such as memory miss ratio, memory end-to-end delay and memory request loss to determine the best memory configuration that would maintain the platform response time within the safe operating region.
The platform state is defined by the number of processor cores in the turbo state, the number of memory ranks in the active state and the physical location of memory ranks within the memory hierarchy. Hence the power consumed by a platform state is the sum of the power consumed by its constituent parts (cores, ranks). The performance of the platform state depends on the physical configuration of the platform. The PAM platform reconfiguration decision actually involves a platform state transition from the current state to a target state that would maintain the performance while giving the smallest power consumption. The search for this ideal target state is formulated as an optimization problem.
The EAM will rely on thermal sensors that will be placed in critical positions within the enclosure and platforms. Data from all these sensors will be collected by EAM, which will make decisions based on the thermal gradient within the enclosure, the set of resources required by the running workloads and the decisions PAM has been undertaking. One interesting research challenge is the sensitivity to data inaccuracies when moving across the hierarchy from the component managers to the platform managers to the enclosure manager.
Virtualization-layer Autonomic Computing Techniques
Autonomic computing systems in datacenter environments benefit from virtualization techniques in a variety of ways. System virtual machines such as VMware and Xen provide a flexible management platform that is useful for both the encapsulation of application execution environments, and the aggregation and accounting of resources consumed by an application. Research on autonomic computing techniques at the virtualization layer at CAC has been based on approaches sharing the following characteristics:
Because of the above characteristics, virtual machines provide a layer that is well-positioned in the hardware/software stack of computer systems to provide fine-grain resource monitoring and control capabilities that are at the core of the MAPE control loop of autonomic computing frameworks.
Phone Number: (520) 621-9915 Room 251, ECE Dept. 1230 E. Speedway Tucson, AZ 85721-0104
ACL - © Copyright 2009, Webmaster: Youssif Al-Nashif
All Rights Reserved