Metrics Collection at Scale (Prometheus, M3)

Introduction

Every production system of meaningful size generates a continuous stream of metrics: request latencies, error rates, queue depths, CPU utilization, garbage collection pauses. At small scale, collecting and querying these signals is straightforward. At the scale of thousands of microservices emitting millions of time series, the problem becomes a distributed systems challenge in its own right. Two systems that have shaped modern approaches to this problem are Prometheus and M3, each embodying different design tradeoffs for metrics ingestion, storage, and query.

Prometheus: Pull-Based Collection and Local Storage

diagram-1 — Illustrate the pull-based scrape model showing Prometheus server periodically initiating HTTP requests to application endpoints, contrasting with push-based model arrows. Include scrape interval timing and WAL/block storage flow on the Prometheus side.

Prometheus, originally built at SoundCloud and now a CNCF graduated project, introduced a pull-based collection model that diverged from earlier push-based systems like Graphite (and aggregation tools like StatsD). In the pull model, the Prometheus server periodically scrapes HTTP endpoints exposed by instrumented applications. This inversion of control has several practical consequences.

First, the metrics server determines the scrape interval and target set, making it the single source of truth for what is being monitored. Second, if a target is down, the absence of data is itself a signal, because a failed scrape is distinguishable from a target that simply stopped pushing. Third, pull-based collection simplifies firewall and network configuration in many deployment topologies, since the monitoring server initiates all connections.

Prometheus stores time series data locally on disk using a custom storage engine. The engine organizes data into blocks of compressed samples, with a write-ahead log (WAL) for durability of recent, not-yet-compacted data. Each block covers a fixed time range and is immutable once written. Compaction merges smaller blocks into larger ones, reducing the number of files and improving query performance over longer time ranges. Prometheus's chunk encoding uses delta and variable-length integer encoding for timestamps and values; while this shares conceptual ground with academic work on time series compression, it is a distinct implementation from the Gorilla paper's XOR-based float compression.

The query language, PromQL, supports dimensional data selection, aggregation, and rate computation. Labels attached to each time series enable flexible grouping and filtering without requiring a predefined hierarchy.

Limitations at Scale

diagram-3 — Display data retention and resolution tradeoff timeline comparing Prometheus single-node storage boundaries versus M3's tiered aggregation strategy (e.g., 10-second resolution for 48 hours, 1-minute for 30 days, 10-minute for 1 year).

The single-server architecture of Prometheus creates hard ceilings. Storage is bounded by local disk. Query performance degrades as cardinality (the number of distinct label combinations) grows. High availability requires running multiple independent Prometheus servers scraping the same targets, with no built-in mechanism for deduplication or federated querying across replicas.

Projects like Thanos and Cortex address these limitations by adding a global query layer and remote storage backends on top of Prometheus. Thanos, for example, uploads compacted blocks to object storage and provides a query component that merges results from multiple Prometheus instances and the object store, enabling long-term retention and a global view.

M3: Distributed Metrics at Uber Scale

diagram-2 — Show the M3 distributed architecture with M3DB cluster nodes partitioned by consistent hashing, M3 Coordinator handling query/write interface, and M3 Aggregator managing tiered rollups. Display replica shards, quorum acknowledgment paths, and etcd placement service.

M3, developed at Uber, was designed from the outset for horizontally scalable metrics collection and storage. Where Prometheus favors simplicity and a single-node model, M3 embraces a distributed architecture with several cooperating components.

M3DB is the storage layer, a distributed time series database that partitions data across a cluster using consistent hashing. Each time series is assigned to a set of replica nodes based on its series ID. M3DB uses a custom compressed columnar format optimized for time series data, achieving high compression ratios (typically around 1.5 bytes per sample or less under production workloads) through techniques like delta-of-delta encoding for timestamps and XOR-based compression for floating-point values, drawing directly from the approach described in the Facebook Gorilla paper.

M3 Coordinator provides a query and write interface compatible with Prometheus remote read/write protocols. This compatibility is significant because it allows M3 to serve as a drop-in long-term storage backend for existing Prometheus deployments.

M3 Aggregator handles streaming downsampling and rollup aggregation. Rather than storing all raw samples indefinitely, M3 can aggregate metrics at configurable resolutions (for example, keeping 10-second resolution for 48 hours, then 1-minute resolution for 30 days, then 10-minute resolution for a year). This tiered retention is essential for managing storage costs at scale while preserving the ability to query historical data.

Consistency and Replication

M3DB uses quorum-based replication. Writes succeed when a configurable number of replicas acknowledge the write, and reads merge results from multiple replicas. This design tolerates node failures without data loss, at the cost of increased operational complexity compared to Prometheus's single-server model.

The placement and topology of M3DB nodes are managed through an etcd-backed placement service, which handles shard assignment and rebalancing when nodes are added or removed.

Choosing Between Approaches

The choice between Prometheus (with or without a long-term storage extension) and M3 depends on the operational context. For organizations running tens to low hundreds of services, Prometheus with Thanos provides a well-understood, widely adopted solution. For organizations operating at the scale of hundreds of thousands of containers producing billions of data points per day, a purpose-built distributed system like M3 reduces the operational overhead of managing many independent Prometheus instances.

Both systems share core ideas: dimensional time series models, label-based indexing, and efficient compressed storage. The differences lie in how they distribute work across machines and how they handle the tension between operational simplicity and horizontal scalability.

Key Points

Prometheus uses a pull-based scrape model where the server initiates collection, simplifying network configuration and making target failures explicitly observable.
Prometheus stores data locally with a block-based storage engine, creating inherent scalability limits around disk capacity and query cardinality.
Thanos and Cortex extend Prometheus with global querying, deduplication, and object-storage-backed long-term retention.
M3DB partitions time series across a cluster using consistent hashing and quorum-based replication, supporting horizontal scaling from the ground up.
M3DB implements Gorilla-style compression (delta-of-delta timestamps, XOR-encoded floating-point values) to achieve sub-2-byte-per-sample storage efficiency under typical production workloads. Prometheus uses its own chunk-based encoding with delta compression, which is related in spirit but is a distinct implementation.
M3's tiered aggregation and rollup system manages storage costs by downsampling older data to coarser resolutions while retaining high-resolution data for recent time ranges.
The fundamental architectural tradeoff is single-node operational simplicity and ecosystem maturity (Prometheus) versus native distributed scalability and operational complexity (M3).

References

Pelkonen, T., Franklin, S., Teller, J., Cavallaro, P., Huang, Q., Meza, J., and Veeraraghavan, K. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." Proceedings of the VLDB Endowment, Vol. 8, No. 12, 2015.

Prometheus Authors. "Prometheus: Monitoring System and Time Series Database." CNCF Documentation, 2023.

Thanos Authors. "Thanos: Highly Available Prometheus Setup with Long Term Storage Capabilities." GitHub Repository and Design Documents, 2023.