Professor Dawn Woodard Aids the Diagnosis of Performance Problems in Large-Scale Distributed Computing Systems

Mark Eisner (ORIE); April 29, 2011

A recent lengthy outage crisis at an Amazon data center in Virginia that provides information processing services for a large number of websites has put a spotlight on the reliability of an approach to service known as "cloud computing."

Amazon, Microsoft, Google and other organizations have pioneered the business of establishing data centers with as many as hundreds of thousands of computer servers that run software for applications such as search, email processing, e-commerce, and data storage for many companies. Such large, distributed computing systems are at the heart of cloud computing, so-called because a cloud symbol is often used in diagrams illustrating the connection of individual computers to the internet. Cloud computing is forecast to become a multi-billion dollar business.

In a recently published paper, ORIE Assistant Professor Dawn Woodard and Moises Goldszmidt of Microsoft Research describe a new method of exploiting statistical techniques to rapidly, accurately and automatically match a currently occurring performance lapse, or "crisis," which in the extreme could lead to an outage, to previous crises whose causes may be known or unknown.

The method can be used as part of a process to chose the optimal intervention (taking into account the uncertainty of the match) thereby minimizing the expected cost of the crisis, a cost which may include payouts to clients for violation of service agreements as well as client dissatisfaction.

The new method is based on measurements ("metrics") that are routinely monitored for each server at a data center. For example, the level of utilization of the server's central processor and memory, the number of emails awaiting filtering for spam, and other key performance indicators are continually tracked and aggregated over short, fixed-length time intervals.

Three metrics, with crisis intervals of types A, B and C highlighted.

When graphed, the median time series of such metrics, taken across all servers, exhibit characteristic behavior in times of crisis, with the pattern of behavior of the collection of metrics differing with different causes. Crisis types may include configuration errors, overloads, spikes in workload, and even disruption in power to the data center, each with different patterns, or 'fingerprints.' Outages from a crisis are typically avoided by fail safe and redundant mechanisms, but even if they are, degraded performance is at best inconvenient and at worst, expensive.

Woodard and Goldszmidt's approach is designed to carry out computations during the time intervals between crises to automatically group past crises into clusters, fully employing an approach known as Bayesian inference. The computation can start with input from human experts. It then applies values of the observed metrics for each crisis in the past to update the estimate of the number of different types, or clusters, into which crises can be gathered and the best cluster to assign to each crisis.

When a new performance lapse is detected (sometimes even before it is technically identified as having started), the values of the metrics associated with it are incorporated into the input data and a new set of clusters is very rapidly and accurately generated, thereby identifying which previous crises the current crisis most resembles. This information can lead to early resolution of the crisis.

To test the value of their method, Woodard and Goldszmidt applied it to data from one of Microsoft's Hosting Services for the first four months of 2008. During that time, twenty-seven performance problems of varying severity occurred. Experts were able to diagnose the causes of most of them. More than 90% of the time the new clustering method determined the correct cluster membership, on average doing so based on the portion of the data available in some cases even before the technical start of the crisis. Moreover the technique identified each recurrent problem in record time.

The method "dramatically outperforms a state-of-the art...clustering method," according to the authors. Beyond speed and accuracy, it has several advantages, including the fact that it does not grow in complexity with the number of servers, as distributed data centers get larger and larger. It can be used to distinguish the system status metrics that are most strongly associated with crises of a particular type, it "learns" a mapping from the metrics to the best intervention, it can forecast the evolution of a crisis as it occurs, and potentially it can be used to model not just the evolution but how the evolution depends on the intervention that is taken.

Woodard was an intern with Goldszmidt at two different companies: a start up and HP Labs, before and during her Ph.D. studies. "As a statistician, my job is to facilitate applied work by developing and applying sound methods for learning from data," Woodard said. "I work closely with collaborators in applied disciplines in order to contribute to those disciplines, and to keep my work grounded in reality. In the case of this collaboration, Moises is an expert both on distributed computing systems and on statistical methods, which greatly facilitated our project. He had already done a huge amount of analysis of the data, both exploratory and statistical, before we began."

Goldszmidt said "the goals for this technique are to facilitate the job of the data center operators and maximize the availability and reliability of cloud services. Professor Woodard has a unique combination of expertise in both statistics and computer science that moved this research effort to the next level. I am excited to continue our collaboration on this and related topics."