Availability in Globally Distributed Storage Systems

October 24, 2020

Paper Review

Introduction

This paper, published by Google researchers in 2010, provides rare insight into how one of the world's largest tech companies keeps data available and accessible despite constant hardware and software failures. The study analyzed real failure data from Google's production storage infrastructure over an entire year, examining tens of thousands of machines across multiple data centers.

Why this matters: Understanding how large-scale storage systems handle failures is crucial for anyone building distributed systems. This paper bridges the gap between theoretical failure models and real-world production systems.

Abstract

Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud- based storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

In simpler terms: Cloud storage systems need to stay available even when individual components fail. While we know a lot about how individual parts (like hard drives) fail, we know much less about how entire storage systems behave. This paper analyzes a full year of real data from Google's infrastructure to understand what actually causes data to become unavailable and how different design choices affect availability.

Key Concepts and Terminology

Before diving into the findings, let's understand some important terms:

Storage Cell: A large collection of storage nodes (machines) along with their coordination processes. Think of it as a complete storage system instance. A typical cell contains thousands of machines housed in a single data center or co-located buildings.

Node Unavailability: When a storage machine stops responding to requests. The node stays unavailable until it either comes back online or the system reconstructs its data from other machines that have copies.

Replication vs. Erasure Encoding: Two different strategies for protecting data against failures:

Replication (R=n): Make n identical copies of your data. If you have 3 copies, you can lose 2 machines and still have your data. Simple but uses 3x the storage space.
Reed-Solomon Erasure Encoding (RS(n, m)): A more sophisticated approach that splits data into n data blocks and adds m parity blocks (error correction codes). You can reconstruct the original data from any n blocks out of the total (n+m). For example, RS(6, 3) creates 6 data blocks and 3 parity blocks, meaning you can lose any 3 blocks and still recover your data. This uses only 1.5x storage (vs 3x for triple replication) while providing similar protection.

Correlated Failures: When multiple machines fail at the same time or in quick succession, often due to a shared cause (like a power outage, network issue, or planned maintenance). These are much more dangerous than random individual failures because they can overwhelm your redundancy protection.

Key Findings

1. Accurate Models Need to Account for Correlated Failures

The most important insight: You cannot treat all failures as independent random events. Real systems experience waves of correlated failures that traditional models don't capture. The study measured component availability across machines, racks, and multi-rack groups in tens of Google storage clusters and found that modeling correlated failures is critical for accurate availability predictions.

2. The Study's Two-Part Approach

Part 1 - Component Failures: Analyzed how individual components fail (machines, racks, groups of racks) and discovered that correlated failures are frequent and significant.

Part 2 - Data Availability: Examined how these component failures affect actual data availability, testing different replication strategies and system parameters using real failure patterns.

The results show that cluster-wide failure events must be considered when choosing replication and recovery strategies. Ignoring correlated failures leads to overconfident availability estimates.

3. Why Nodes Fail

Nodes become unavailable for many reasons:

Storage node or network switch gets overloaded
Software crashes or restarts (operating system or storage program)
Hardware errors in the machine
Automated repair processes temporarily remove disks or machines
Planned maintenance takes down whole clusters

Surprising finding: Most unavailability comes from planned reboots (like kernel upgrades), not hardware failures. While disk failures are permanent and node failures are usually temporary, node failures happen so much more frequently that they matter more for overall availability.

4. The 15-Minute Rule

The study focuses on unavailability events lasting 15 minutes or longer. Why?

Short outages (under 15 minutes) are most common but have minimal impact on overall availability
Starting recovery immediately after brief failures is inefficient and wastes resources
Google's GFS (Google File System) waits 15 minutes before starting data recovery, giving nodes time to come back online naturally

5. Key Metrics Used

Average Availability: The percentage of time all N nodes are available, averaged across all nodes.

Mean Time To Failure (MTTF): The average time until a failure occurs. Higher is better.

6. Understanding Data Protection

Data stripes: Data is divided into stripes, where each stripe contains a set of fixed-size data and code blocks called chunks. These chunks are spread across different nodes.

With replication (R=3): Each stripe has 3 identical chunks. You can lose 2 and still have your data.
With erasure encoding (RS(6,3)): Each stripe has 6 data chunks and 3 parity chunks. You can lose any 3 and still reconstruct everything.

What affects data availability:

Individual node reliability
The encoding scheme (replication vs erasure coding)
How failures are distributed (random vs correlated)
Where you place chunks (chunk placement strategy)
How quickly you can recover lost data (recovery time)

Deep Dive: Understanding Failures

Disk Failures: The Obvious Culprit

Disks have received significant research attention because they permanently store data, making disk failure potentially catastrophic. Previous studies found that disks have an Annual Replacement Rate (ARR) of 2-4%, meaning in a data center with 10,000 disks, you'd replace 200-400 drives per year.

However, this paper looks at disk errors from the application's perspective, which includes:

Physical disk issues (latent sector errors, corrupt sectors)
Infrastructure problems (firmware bugs, device drivers, controllers, cables)
Silent corruption (network and memory corruption that goes undetected)
Software bugs in the storage stack

Node Failures: The Real Problem

The study focuses on three types of node unavailability:

Node restarts: Software crashes and restarts of the storage program
Planned machine reboots: Scheduled maintenance like kernel upgrades
Unplanned machine reboots: System crashes and kernel panics

Key finding: The majority of unavailability comes from planned reboots, not catastrophic failures. While disk failures are permanent and most node failures are temporary, node failures happen so much more frequently that they dominate system availability.

All failure numbers include both software and hardware issues, giving a complete picture of real-world unavailability.

Understanding Correlated Failures

When many nodes fail together, they can overwhelm your redundancy protection. If you have 3 copies of data but all 3 machines fail at once, your data becomes unavailable despite the replication.

Failure Domains

A failure domain is a group of machines that share a common point of failure. Examples:

Machines in the same rack (sharing a network switch)
Machines on the same power circuit
Machines in the same data center building

Physical racks turn out to be a critical failure domain in practice.

Failure Bursts

A failure burst is a sequence of node failures happening close together in time. Specifically, failures where each one occurs within a time window of the next failure.

Startling statistic: 37% of all failures are part of a burst affecting at least 2 nodes. This means more than one-third of failures don't happen in isolation - they come in waves.

Two Types of Failure Bursts

The data reveals two distinct patterns:

Sudden catastrophic bursts: Many nodes fail very quickly in succession. Think of a power outage hitting a data center - hundreds of machines go dark simultaneously. These show up as steep spikes in failure graphs.
Rolling wave bursts: Smaller numbers of nodes fail at a steady, evenly-spaced rate. This happens during planned maintenance activities like rolling reboots or software upgrades, where operators systematically restart machines across a data center.

Rack Correlation

For large bursts (10+ nodes), only 3% have all nodes on different racks. This confirms that failures tend to cluster within failure domains like racks.

Measuring Rack Affinity

The paper introduces a metric called rack affinity to quantify how concentrated a burst is within racks:

Close to 1: Strongly rack-correlated (failures concentrated in few racks)
Close to 0.5: Random distribution across racks
Close to 0: Anti-correlated (spread across many racks - rarely observed)

For example:

A burst hitting nodes (1, 4) on the same rack = score 6 (high concentration)
A burst hitting nodes (1, 1, 1, 2) across 4 racks = score 1 (low concentration)

Finding: Larger failure bursts generally have higher rack affinity, meaning big failure events tend to be domain-related (affecting specific racks or infrastructure) rather than random.

Strategies for Handling Failures

Now that we understand how failures happen, how do we protect against them? The paper examines two main approaches:

1. Data Replication and Recovery

The basics: Replication and erasure encoding provide resilience to individual node failures. When a node fails, the system can recover data from surviving nodes.

The challenge: Distributed file systems use queues to manage recovery operations. After a failure, the system must:

Detect the failure
Queue up recovery tasks
Read data from surviving nodes
Write reconstructed data to new locations

This takes time, and if another failure happens during recovery, you're in trouble.

2. Smart Chunk Placement

The insight: If you know about failure domains (like racks), you can be strategic about where you place data chunks.

Rack-aware placement: Ensure that chunks from the same stripe (the set of chunks needed to reconstruct data) are spread across multiple racks. This way, a failure affecting one rack won't take out all your copies.

The impact is dramatic: For small and medium burst sizes with large encodings, using a rack-aware placement policy increases stripe MTTF by a factor of 3 compared to random placement.

Modeling and Simulation Approaches

The paper uses two complementary methods to understand availability:

Trace-Based Simulation

How it works: Replay actual sequences of node failures observed in production (or generate synthetic sequences based on real patterns) and calculate the impact on stripe availability.

What it tracks: For each node, record three event types:

Down: Node becomes unavailable
Up: Node comes back online
Recovery complete: Data reconstruction finished

The goal: Calculate the expected number of stripes unavailable for at least 15 minutes over time.

This approach is concrete and intuitive - you literally replay what happened and see the effects.

Markov Model

Why use it: The Markov model provides more analytical power and flexibility. It allows you to:

Reason directly about each layer of the storage stack
Evaluate system parameter trade-offs mathematically
Handle rare events and extremely low unavailability rates efficiently
Calculate MTTF as the mean time to reach the "unavailable state"

The assumption: Events occur independently with constant rates over time. This is a simplification (real failures show autocorrelation and long-range dependence), but it provides useful insights.

How it works: The model represents the system as states (like "all chunks available" or "one chunk missing") and transitions between states (like "chunk failure" or "recovery complete"). Standard Markov chain math then computes availability metrics.

Multi-Cell (Multi-Data Center) Replication

Both models extend to handle replication across multiple data centers (cells). For example:

RS(6, 3) × 3: Three data centers, each with RS(6,3) encoding
Each cell has its own transition matrix
The combined model uses tensor products of individual cell matrices
Includes terms for whole-cell failures and cross-cell recovery

Critical consideration: Recovery bandwidth between data centers. Transferring data between distant cells has significant cost (both money and time). This creates a trade-off between:

Higher replication within a single cell (cheaper bandwidth, but vulnerable to cell-wide failures)
Multi-cell replication (protects against entire data center failures, but requires expensive inter-cell bandwidth)

Model Validation

The models were validated against real production data with two key findings:

Captures failure burst effects: The models accurately represent how failure bursts impact availability - the most important factor for real systems.
Distinguishes rack-spanning failures: The models can tell the difference between failure bursts that span multiple racks (dangerous) and those confined to single racks (less dangerous with good chunk placement).

Practical Insights and Trade-offs

1. When Recovery Speed Matters

With few correlated failures: Reducing recovery time is very effective. For RS(6,3) encoding, a 10% reduction in recovery time yields a 19% reduction in unavailability.

With many correlated failures: The benefits of faster recovery diminish significantly. When multiple nodes fail together, recovery speed helps less because you're already in a precarious state.

2. Diminishing Returns of Replication

Adding more replicas helps, but with diminishing returns - especially when correlated failures are common. Going from 2 to 3 replicas provides good protection, but going from 3 to 4 provides much less additional benefit in the presence of failure bursts.

3. Where Improvements Don't Help

Surprising finding: Improvements below the node level (like better disks or more reliable hardware) don't significantly improve data availability. Why? Because most unavailability comes from node-level and cluster-level events (software restarts, planned maintenance, network issues), not individual hardware failures.

4. The Power of Multi-Cell Replication

Replicating data across multiple data centers (cells) greatly improves availability because it protects against correlated failures within a single data center. An entire cell can go down due to power, network, or maintenance issues, but if you have copies in other cells, your data stays available.

5. Cost-Effectiveness Analysis

The paper enables reasoning about cost trade-offs:

Storage cost: More replicas or higher redundancy codes
Bandwidth cost: Recovery within a cell vs. cross-cell recovery
Availability goals: Different applications need different availability levels

By modeling these factors, you can choose the most cost-effective scheme for your specific availability requirements.

Related Work and Context

This paper builds on previous research in several ways:

Failure burst literature: Earlier work focused on discovering the statistical relationship between failure event size and probability. This paper goes further by connecting failure patterns to actual data availability.

Reliability models: Previous models fall into two categories:

Non-Markov models: More limited in scope and harder to extend
Markov models: More versatile and general, can model both replication and erasure encoding

This paper's contribution is using Markov models with real failure data to make practical availability predictions for production systems.

Conclusion and Impact

This paper provided Google with unprecedented insight into data availability patterns across their massive infrastructure. The analysis went beyond intuition and anecdotal evidence, providing concrete data to guide engineering decisions.

Real-World Impact

The framework from this paper enables Google's teams to:

Make informed decisions about replication strategies and encoding schemes
Identify root causes of data unavailability in production systems
Evaluate trade-offs between different approaches using real failure patterns
Predict availability for new system designs before deploying them

Key Takeaways for System Designers

If you're building distributed storage systems, remember:

Correlated failures dominate - Your biggest threat isn't random node failures, it's groups of nodes failing together
Failure domains matter - Rack-aware placement can triple your availability
Node failures > disk failures - Despite all the focus on disk reliability, node-level events cause most unavailability
Multi-cell replication is powerful - Spreading data across data centers provides the best protection against correlated failures
Model with real data - Theoretical models that assume independent failures will mislead you

Why This Paper Matters

Before this paper, most availability research focused on component-level failures (especially disks) or used purely theoretical models. This paper:

Provided real production data from a massive deployment
Showed that correlated failures are the critical factor
Demonstrated practical modeling techniques validated against production systems
Enabled cost-benefit analysis for different redundancy strategies

It bridges the gap between academic research and production reality, making it essential reading for anyone building reliable distributed systems.

Resources

PDF

Paper Details

Authors: Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, and others from Google, Inc.

Published: OSDI (Operating Systems Design and Implementation), 2010

Study Period: One year of production data from tens of Google storage clusters

Over the next few Saturdays, I'll be going through some of the foundational papers in Computer Science, and publishing my notes here. This is #27 in this series.

Anant Jain