The observability platform decision record: why the telemetry pipeline you chose determines your cardinality ceiling and your incident investigation latency
Observability platform selection looks like a tooling detail until a $47k Datadog invoice arrives in month 38 and the team discovers that APM was enabled on every Lambda function, custom metrics grew to 12,000 unique time series past the 10k default limit, and the per-host rate had been quietly scaling with every new microservice extraction over three years. The platform you chose in month two determines your cardinality ceiling, your trace sampling model, your cost scaling formula, and whether the engineer who gets paged at 2am can find the root cause before the SLA expires.
An 18-person Series A startup adopted Datadog in month two. The CTO had used it at his previous company and the onboarding was fast — install the agent, wire the API key, and within an hour the infrastructure dashboard was green. Eight hosts, three engineers with Datadog access. The per-host cost was unremarkable. Nobody asked about the pricing model past the first invoice. The ChatGPT session that produced the adoption decision — "what observability tool should we use for a Node.js backend on AWS?" — ended with a working install command. Datadog was the first result. It was easy. The session closed.
Thirty-eight months later, the company had 140 hosts across twelve microservices, two Lambda-based async processors, and a staging environment that mirrored production. The monthly Datadog invoice was $47,200. The finance team flagged it. The CTO, now managing an engineering team of 31, asked the infrastructure lead to explain the bill. The infrastructure lead spent four hours pulling apart the Datadog account configuration to reconstruct how the costs had accumulated. What she found was three compounding decisions, none of which had ever been documented anywhere:
First: APM (Application Performance Monitoring — distributed tracing) had been enabled on every service as each microservice was extracted from the monolith. The person who extracted each service had enabled APM because APM was enabled on the monolith and it seemed like the right thing to do. At 140 APM hosts, the APM charge alone was $16,800 per month. Three of the twelve microservices had no consumers of their APM traces — the dashboards had never been opened. APM was running, generating traces, ingesting data, and billing at full rate on services that the team had never actually used the tracing for.
Second: the two Lambda-based async processors had been instrumented with the Datadog Lambda Extension in month 24, because a "how do I observe my Lambda functions?" ChatGPT session had recommended it. Lambda billing in Datadog is per-invocation, not per-host. The two processors together handled 4.2 million invocations per month. The Lambda Extension was collecting a custom metric on every invocation — a processing latency histogram — which generated 4.2 million custom metric data points per month, contributing to 11,400 unique custom metrics across the account. The 10,000 custom metric default limit had been crossed in month 31. The overage charge was $70 per month per thousand over the limit — an invisible line item that had been on the invoice for seven months without triggering any alert.
Third: the staging environment had full observability infrastructure parity with production. Every host in staging ran the full Datadog agent stack — infrastructure, APM, logs. The staging environment had expanded from 4 hosts to 22 hosts over three years as new services were added. Nobody had made a deliberate decision to run full observability in staging; it was the default behavior of the Terraform module that provisioned new services, which included the Datadog agent installation. Twenty-two staging hosts at the full per-host rate contributed $4,400 per month.
The infrastructure lead's investigation revealed that no single decision had been wrong. Each individual choice — enable APM on this service, instrument this Lambda, run the agent on staging — was locally reasonable. The problem was that the cumulative cost scaling of those individually reasonable decisions had never been modeled against the pricing formula. The pricing formula had never been documented. The decisions had been made across seventeen separate sessions over thirty-eight months, each session solving an immediate instrumentation problem, none of them considering the bill that would arrive when all seventeen decisions were compounded.
The second incident happened independently, at 2:14am, six months before the invoice audit. A P0 alert fired: the checkout service latency had spiked to 14 seconds on average, with a 5% error rate. The on-call engineer pulled up the Datadog APM service map and found the checkout service highlighted in red. She clicked through to the trace list and found a small number of slow traces. She needed to understand which specific operation within the checkout flow was slow for the requests that were timing out. She found four traces over 10 seconds in the last 30 minutes. All four showed the same structure: a slow database query in the payment validation step. She thought she had the root cause.
She escalated the investigation to the database team. Twenty minutes of analysis showed no anomaly in the PostgreSQL query execution times — the slow queries were not visible in the database metrics. The on-call engineer went back to Datadog and looked at the trace sampling rate. It was 1%. The service was processing 4,000 requests per minute at peak load. At 1% sampling, Datadog was storing 40 traces per minute. In the 30 minutes of the incident, 120,000 requests had passed through the checkout service. 1,200 of them were in the sampled traces. The 118,800 unsampled requests might have shown a completely different slow operation — one that the database query pattern did not explain.
The 1% sampling rate had been set during the initial APM configuration in month two, in the same session that enabled APM: "how do I set up distributed tracing with Datadog in Node.js?" The default recommended in the tutorial was 1%. At eight requests per minute, 1% sampling was inconsequential — every minute produced a trace from the sample. At 4,000 requests per minute in production during the incident, 1% sampling was hiding 99% of the causal evidence. The engineer had not known the sampling rate was 1%. It was not documented anywhere except in the Datadog agent configuration file, which she had to find and grep for the setting at 2:48am.
The actual root cause was a third-party payment gateway that had begun returning 502 errors for requests from a specific AWS availability zone after a BGP routing change. The payment validation step was retrying those 502 responses three times before failing, which is why the slow requests all looked structurally similar to a slow database query — the latency was coming from retry backoff, not from the database. The connection was only visible in 16 traces out of the 1,200 sampled, and she had happened to look at the 4 slowest ones, all of which showed a different pattern. The root cause took 3.5 hours to find. With full trace storage, a single query — GROUP BY destination_host WHERE status_code = 502 OVER last 30 minutes — would have surfaced the payment gateway as the anomaly in under two minutes.
Both incidents were downstream consequences of the same upstream omission: the observability platform was selected and configured across a series of informal sessions, and the structural decisions made during those sessions — the pricing model, the cardinality ceiling, the sampling rate — were never recorded. An observability platform decision record that documented the pricing formula, the cardinality model, and the trace sampling rate would not have prevented Datadog from being chosen. It would have prevented the compounding cost decisions and the 3.5-hour incident investigation.
The three structural properties that platform selection determines
When teams evaluate observability platforms, the discussion centers on feature completeness, existing team familiarity, integration quality with the cloud provider, and time-to-first-dashboard. These are real factors. The structural properties that determine whether the platform choice ages well are different — and they're set at selection time, when the team has no production traffic, no sense of their actual metric cardinality, and no experience of what an incident investigation actually requires.
Pricing model and cost scaling formula
Every observability platform pricing model encodes a specific assumption about what should drive costs: infrastructure size (per-host), team size (per-seat), data volume (per-GB ingested or per-event stored), or a combination. The model is not just a billing detail — it is a structural constraint on which growth vectors are expensive and which are not.
Per-host pricing — Datadog's primary model — charges for each host running the observability agent. This model scales linearly with infrastructure expansion. A team that extracts a monolith into twelve microservices may triple or quadruple their host count if each service runs on dedicated instances. If APM is also enabled, the APM per-host charge compounds the infrastructure per-host charge. The compounding is not visible in a single monthly invoice comparison; it reveals itself only when the infrastructure-to-cost ratio is modeled forward three years, which is exactly what no team does in an early-stage ChatGPT-assisted tooling selection session.
Per-seat pricing — New Relic's current model — decouples infrastructure growth from cost growth. A team of 8 engineers pays the same observability bill whether they operate 8 hosts or 80 hosts, as long as the number of Full Platform Users doesn't change. This is structurally favorable for infrastructure-heavy workloads with small engineering teams — a solo founder running a batch processing pipeline on 40 compute nodes pays for one seat, not forty nodes. The risk profile reverses: headcount growth drives costs, infrastructure growth does not, and the per-seat rate for Full Platform Users with access to all data is high enough that a 20-engineer team faces meaningful per-seat charges.
Per-GB ingested pricing — used by Grafana Cloud for logs and traces — scales with data volume, which is driven by application verbosity, log level configuration, and trace retention decisions. This model gives teams direct control over cost by controlling what they ship to the platform. The risk is that verbosity decisions (logging at DEBUG level, tracing every request) are made at the application level by individual engineers without a view of the per-GB cost implications, and costs grow invisibly until a verbose service is deployed to production.
Per-event pricing — Honeycomb's model — charges per event stored, where an event is a structured row of data (a trace span, a log line, a custom event). This model is predictable in that cost scales with request volume and instrumentation depth. It is the most direct correlation between product usage (requests served) and observability cost, which makes it the easiest to model forward. The caveat is that high-traffic services with deep instrumentation (many spans per request, many attributes per span) generate high event volume, and per-event pricing at high volume is not always cheaper than per-host pricing at high host count.
The practical consequence of not documenting the pricing model in the ADR is that each individual instrumentation decision — "should I enable APM on this service?" "should I add a per-customer tag to this metric?" "should I run full observability in staging?" — is made without the cost formula that would let the engineer evaluate whether the observability value justifies the cost. Those decisions compound silently over three years into the $47k invoice.
Cardinality model and the custom metrics ceiling
In metrics observability, cardinality refers to the number of unique time series a platform must track. Each unique combination of metric name and tag values is a separate time series. A metric api.request.duration with a customer_id tag on a platform serving 50,000 customers creates 50,000 separate time series for that single metric — one per customer. High cardinality is the technical description of a situation where the number of unique label combinations across all metrics is large.
Why cardinality matters for platform selection is that different platforms handle high-cardinality metrics in fundamentally different ways, and the handling determines what instrumentation patterns are affordable. Prometheus stores every unique time series in memory in its TSDB. A high-cardinality metric — per-user, per-request-ID, per-session — can exhaust available memory and crash Prometheus or degrade query performance. The standard guidance for Prometheus-based stacks is to avoid high-cardinality labels; use aggregate metrics (count, histogram) and push per-user analytics to a separate data store. This is a constraint on instrumentation design, not a cost constraint.
Datadog's model imposes a 10,000 custom metric ceiling on the default plan, where each unique time series counts as one custom metric. Crossing the ceiling generates per-metric overage charges. The ceiling applies to custom metrics specifically — metrics sent via the Datadog API or agent from application code — not to infrastructure metrics collected by the agent from the host OS or cloud provider. A team that adds a tenant_id tag to a billing-related metric on a 500-tenant platform creates 500 custom metrics from that one tag. On a 15,000-tenant platform, the same decision creates 15,000 custom metrics — already well past the 10k ceiling. The team that encounters this ceiling typically discovers it in the invoice, not in a cardinality warning.
Honeycomb's column-store architecture is designed for high cardinality. Events (traces, log lines, custom events) are stored as structured rows with arbitrary columns. Queries aggregate, filter, and group over the raw rows at query time. There are no pre-aggregated time series, no cardinality ceilings, and no per-metric charges. The tradeoff is that metrics in Honeycomb are derived from event data at query time — Honeycomb does not have a first-class infrastructure metrics product, and teams that need traditional infrastructure metrics (CPU, memory, disk) alongside Honeycomb traces typically run a separate metrics product for the infrastructure layer. This architectural constraint is part of the ADR, not an implementation detail.
Query model and incident investigation latency
The query model determines how quickly an engineer can answer the question "what is causing this incident?" during a P0. The model has three components: trace storage completeness (what fraction of requests are traced at what resolution), query expressiveness (what filters and aggregations can be applied to the stored data), and query latency (how long the query takes to return results).
Most observability platforms default to sampled trace storage. Sampling means that only a fraction of requests — 1%, 10%, or a configurable percentage — are fully traced and stored. The remaining requests are not stored. This is a cost tradeoff: storing traces for 100% of requests at high traffic volumes is expensive, and most requests follow the happy path and add no investigative value. The consequence during an incident is that the specific failing requests may be in the unsampled fraction. The on-call engineer sees traces for 1% of requests, all of which may happen to not exhibit the failure pattern — either because the failure is rare (a specific payment gateway at a specific availability zone) or because the failure is invisible in aggregate traces but visible in the raw request data.
Head-based sampling — where the sampling decision is made at the beginning of a trace — is the default for most platforms. It is operationally simple but strategically blind: the sampling decision is made before the outcome is known. Tail-based sampling — where the full trace is buffered and the sampling decision is made after the trace is complete, using the trace outcome (error, slow response, unusual pattern) to bias toward storing interesting traces — provides better incident coverage but requires a trace buffer that adds latency and complexity. Datadog's APM supports tail-based sampling via its intelligent sampling feature, but the configuration is not obvious and is rarely set up during initial APM onboarding.
Honeycomb's full-resolution storage eliminates the sampling tradeoff by storing every event. The query model operates on the full dataset: an engineer can run GROUP BY destination_host WHERE http.status_code = 502 OVER last 30 minutes and receive an answer covering every request in that window, not a 1% sample of requests. This is the capability that reduces the 3.5-hour incident investigation from the opening narrative to a two-minute query. The per-event cost model is the mechanism that makes full-resolution storage economically viable — events are charged individually, and the team pays for what they store, not for infrastructure that might generate events.
The options and their structural tradeoffs
Datadog
Datadog is the dominant commercial observability platform for infrastructure-heavy teams. Its breadth — infrastructure metrics, APM, logs, synthetics, RUM, security, database monitoring, network performance — means a team can consolidate observability into a single product rather than integrating multiple point solutions. The integration catalog covers virtually every cloud service, database, and framework with pre-built dashboards and alerting rules. For a team that wants to move fast and have working observability out of the box without spending engineering time on integration work, Datadog's time-to-first-useful-dashboard is the shortest of any platform in this comparison.
The pricing model's cost scaling is the primary structural risk for growing teams. Per-host infrastructure monitoring, per-host APM, per-GB log ingestion, and custom metric overage charges combine in ways that are non-obvious at selection time. The Lambda and container billing models add further complexity — per-invocation charges for Lambda, and container host normalization (containers sharing a host may be counted as fractional hosts or as a separate host depending on the container runtime and Datadog agent version). Teams that do not model the three-year cost trajectory at selection time frequently encounter invoices that reflect a formula they did not understand they had agreed to.
Datadog's OpenTelemetry support has improved significantly since 2022. The Datadog Agent can ingest OTLP (OpenTelemetry Protocol) traces, metrics, and logs directly, which means teams can instrument application code using the OpenTelemetry SDKs and vendor-neutral exporters and route to Datadog without Datadog-specific instrumentation libraries. This matters for the exit ramp assessment: teams that instrument with OTel can migrate to a different backend by reconfiguring the exporter rather than rewriting instrumentation in every service. Teams that instrument with Datadog's proprietary tracing libraries (dd-trace) have migration cost proportional to the number of services instrumented. The ADR should document which instrumentation path is used and what the migration cost implication is. See the broader context of build-vs-buy decisions — vendor lock-in is the canonical consequence of choosing proprietary over open-standard instrumentation.
The 15-day default retention for APM traces and logs is a structural investigation constraint. An incident discovered in its aftermath — a billing anomaly from six weeks ago, a slow-building data corruption from three weeks ago — cannot be investigated in Datadog with default retention. Extended retention is available and priced per GB-day. Teams with compliance requirements (SOC 2, HIPAA) that require long-term audit trail retention need to plan retention costs as part of the ADR, not as a post-compliance-audit discovery.
Grafana OSS stack (Prometheus + Grafana + Loki + Tempo)
The Grafana OSS stack — Prometheus for metrics, Loki for logs, Tempo for traces, and Grafana as the unified query and visualization frontend — is the primary open-source alternative to commercial platforms. The stack is free to run self-hosted, which removes the per-host or per-seat licensing cost. The operational cost of running the stack on your own infrastructure (compute, storage, maintenance engineering time) is the substitute cost. For teams with existing Kubernetes expertise, the Prometheus Operator and the kube-prometheus-stack Helm chart automate the deployment and configuration of the metrics layer. For teams without Kubernetes experience, self-hosting the full stack is a meaningful operational investment.
Prometheus's cardinality model — in-memory TSDB with a hard ceiling at available memory — requires active management for high-cardinality metrics. The standard guidance is to use recording rules to pre-aggregate high-cardinality metrics into lower-cardinality summaries, and to use Thanos or Cortex for horizontal scaling and long-term storage. Both are additional operational dependencies. Teams that want the simplicity of Prometheus for standard infrastructure and application metrics, without the engineering investment of building a horizontally-scaled long-term storage layer, typically run Prometheus with a 15-day local retention and use Grafana Cloud's remote_write endpoint to push metrics to managed long-term storage — combining self-hosted collection with managed retention. This hybrid model bridges the operational gap without fully committing to either self-hosted or fully managed.
Loki's log architecture differs from Elasticsearch and Datadog's full-text log indexing in a fundamental way. Loki indexes only log stream labels (key-value pairs like app=checkout-service and environment=production) and stores log content as compressed chunks. Full-text search over log content requires scanning the compressed chunks rather than querying an index. This means Loki query performance degrades for queries that require scanning large log volumes without a label filter to narrow the chunk set. For queries that are always scoped by service and time range — the common case for application debugging — Loki's performance is acceptable. For security investigations that require arbitrary full-text search across all services — "show me every log line containing this specific user ID across all services in the last 30 days" — Loki's scan-based model is slower than an indexed model. This performance characteristic belongs in the ADR as a constraint on log query patterns.
Grafana Cloud is the managed version of the OSS stack, offering Prometheus-compatible metrics, Loki-compatible logs, and Tempo-compatible traces as a fully managed service with per-GB ingestion pricing. For teams that want the vendor-neutral data model and OpenTelemetry compatibility of the OSS stack without the self-hosting operational burden, Grafana Cloud is the intermediate option. The pricing model — per-GB ingested for logs and traces, per-series active per-month for metrics — is more transparent and lower-floor than Datadog for most starting workloads, with the cardinality caveat that high-series-count metrics still require active management. The infrastructure-as-code strategy for managing Grafana dashboards and alert rules as code (Grafonnet, Terraform Grafana provider) is an additional maturity investment that the ADR should note as a requirement for production-grade observability on the OSS stack.
New Relic
New Relic's transition to user-based pricing in 2020 — Full Platform Users charged per seat, Basic Users free — restructured the cost model to scale with team size rather than infrastructure size. For infrastructure-heavy teams with small engineering teams, this model can be significantly cheaper than per-host pricing. A solo founder or two-person team operating a complex multi-service system pays two Full Platform User seats regardless of how many hosts the system runs on. As engineering team size grows, the per-seat cost scales, and at 20+ Full Platform Users, the cost model converges with or exceeds Datadog's per-host costs depending on infrastructure density.
New Relic's APM agent model — proprietary language agents (Java, Node.js, Python, Ruby, Go, PHP, .NET) that provide automatic instrumentation — is among the deepest in the industry for supported languages. The agents instrument frameworks, database clients, HTTP clients, and message queue clients automatically without code changes, which produces rich distributed traces with minimal developer effort. The tradeoff is the same as Datadog's proprietary agents: the instrumentation is vendor-specific, and migration to another platform requires replacing or reconfiguring agents in every service. New Relic's OpenTelemetry support (OTLP ingest) allows teams to migrate instrumentation to OTel SDKs and route to New Relic, but the automatic instrumentation depth of the proprietary agents is generally richer than what OTel auto-instrumentation provides for some frameworks.
New Relic's data retention model differs from Datadog's. By default, metrics are retained at full resolution for 8 days and at one-minute rollup for 13 months. Traces are retained at full resolution for 8 days. Logs retention is configurable per account (30 days by default for the paid tier). The 8-day full-resolution retention for traces is slightly shorter than Datadog's 15-day default, which matters for post-incident investigations that happen more than a week after the incident. The 13-month metric rollup is longer than Datadog's 13-month rollup retention, which makes New Relic's capacity planning and trend analysis view longer by default.
Honeycomb
Honeycomb's architecture is designed around a single premise: production systems are too complex for pre-aggregated metrics to reliably surface root causes, and the right model is to store every event at full resolution and query the raw data arbitrarily at investigation time. The column-store backend (ClickHouse-based under the hood) executes GROUP BY and WHERE queries over billions of rows in seconds. The query language (BubbleUp, Heatmap, BREAKDOWN BY) is purpose-built for answering "what is different about the slow or failing requests compared to the fast or succeeding requests?"
This query model produces the investigation latency advantage described in the opening narrative's P0 incident. For distributed systems with high-cardinality dimensions — customer ID, tenant ID, request path variant, feature flag assignment — Honeycomb's ability to query those dimensions without cardinality ceilings or sampling gaps is a structural advantage during incidents. The engineer does not need to know in advance which dimensions will be relevant during an investigation; all dimensions on every event are queryable after the fact.
The structural constraints of Honeycomb's model are important to document. Honeycomb does not have a native infrastructure metrics product. CPU, memory, disk, and network time series are not first-class in Honeycomb's event-based data model. Teams that adopt Honeycomb for application tracing typically run a separate metrics platform — Prometheus, Grafana, or cloud-native metrics (AWS CloudWatch, Google Cloud Monitoring) — for infrastructure layer observability. This is a two-platform operational model, which means maintaining integrations, alert routing, and dashboards across two products. The ADR must document whether infrastructure metrics and application tracing will be handled by one platform or two, and if two, how alert routing spans both. The logging infrastructure decision record has the same two-platform consideration when the team's log platform and trace platform are separate products — the integration surface between them is where incidents are most expensive to investigate.
Honeycomb's per-event pricing model is the most predictable of the four platforms at steady traffic volume, but it is the most sensitive to instrumentation depth. Enabling detailed tracing (many spans per request, many attributes per span) at high traffic directly increases per-event cost. A team that instruments 20 spans per request at 1,000 requests per second generates 20,000 events per second — 1.7 billion events per day. At Honeycomb's per-event pricing, this is a cost that must be modeled forward against the traffic growth trajectory, not evaluated only against current traffic.
The AI chat sessions that produced undocumented decisions
Observability platform configuration produces a specific pattern of undocumented decisions: the sessions are focused on getting visibility running, not on documenting the structural constraints of the visibility choices. Engineers are instrumenting code, not writing architecture records. The decisions feel like configuration. They reveal themselves as architecture only when the cost arrives or the incident exposes the investigation gap.
The initial platform selection session — "what observability tool should we use for a Node.js backend on AWS?" — produces the vendor choice without any analysis of the pricing model's cost scaling formula. The session ends when the first dashboard is green. The ChatGPT response lists Datadog, New Relic, and Grafana; notes that Datadog has good AWS integration; and ends with an install command. The session did not cover what happens to the Datadog bill when the team has 140 hosts in three years. The session did not cover cardinality ceilings or trace sampling rates. It was not the wrong session to have — it solved the immediate problem. The omission is that nothing from that session was written into a decision record. See the structural pattern described in decisions never written down — the session closes when the immediate problem is solved, and the constraints that were implicit in the solution are never externalized.
The APM enablement session — "how do I set up distributed tracing with Datadog in Node.js?" — produces the trace sampling rate. The tutorial recommends 1% sampling because the tutorial was written for a high-traffic production environment where 100% sampling is prohibitively expensive. For a service processing 8 requests per minute, 1% sampling is inconsequential; for a service processing 4,000 requests per minute during a production incident, 1% sampling is a 99% blind spot. The sampling rate appears in the agent configuration file as a number, not as a constraint with implications. The session did not ask: what are the incident investigation implications of this sampling rate? The engineer got distributed tracing working, which was the goal. The ADR section that would have contained "trace sampling rate: 1%; investigation implication: unsampled requests are not available for post-hoc analysis; procedure for increasing sampling during incidents: [link to runbook]" was never written. The CI/CD pipeline decision record has the same pattern: the rollback capability is discovered during an incident rather than designed and documented in advance.
The "reduce Datadog bill" session — "our Datadog bill is $47k per month, how do we reduce it?" — produces a retroactive cost audit. The session is valuable; it surfaces the APM-on-every-Lambda problem, the custom metric overage, the staging-host waste. What it does not produce is documentation of the original decisions that accumulated the cost. The optimization actions are logged in a Jira ticket, resolved, and closed. The ticket title is "reduce Datadog spend Q2." The ADR that would have documented "rationale for enabling APM on Lambda: [original reasoning]" still does not exist. The next engineer who decides whether to enable APM on a new Lambda function has no document to check. They ask ChatGPT. The session closes with an install command. This is the same compounding pattern described in the build-vs-buy decision record — the first decision is made in one session, the consequences accumulate across many sessions, and the connection between the first decision and the consequences is invisible because no ADR links them.
The custom instrumentation session — "how do I add custom metrics to track business events in Datadog?" — produces the tag dimensions for the custom metric. The session recommends adding a customer_id tag to make the metric queryable per customer. On a 200-customer platform, this is reasonable: 200 time series per metric. On a 15,000-customer platform, the same tag creates 15,000 time series per metric — 150% of the 10k default limit from a single metric's tag. The session did not model the cardinality ceiling. The platform had 200 customers when the session happened. The ADR section on cardinality — "custom metric naming convention, approved high-cardinality labels, labels that should NOT be used as tag dimensions, current custom metric count vs ceiling" — would have constrained the tag choice before it crossed the overage threshold. The ADR format for this type of constraint is a Consequences section: positive consequences (queryable per customer), negative consequences (200 → 15,000 time series as platform grows, crossing the 10k ceiling at ~10,000 customers). Writing the negative consequence forces the explicit acknowledgment that the tag dimension is bounded by the cardinality ceiling.
The staging observability session — "how do I set up observability for my staging environment?" — produces full-parity observability because parity is the easiest answer. The Terraform module that installs the Datadog agent in production is reused for staging. Nobody questions whether staging needs APM tracing at the same sampling rate as production, or whether staging logs need 15-day retention, or whether staging hosts need to be on the same billing tier as production. These are cost-architecture decisions. They were never made — they were inherited by default. The infrastructure-as-code strategy decision record pattern applies here: the Terraform module is the infrastructure decision, and the Datadog agent block in that module is the observability decision. When both decisions are made in separate sessions at separate times, the cost implications of the IaC configuration are never analyzed against the observability pricing model.
The open-source WhyChose extractor surfaces these decisions from existing AI chat history — the platform selection session, the APM setup session, the custom metric session, the staging setup session — where the decisions were made. Running the extractor on observability-related chat threads frequently surfaces the original sampling rate rationale ("1% is fine for our current load"), the original tag dimension discussion ("customer_id as a tag would let us query per-customer"), and the staging parity decision ("use the same Terraform module for staging"). The extraction output gives teams a starting point for writing the platform ADR retroactively, before the next invoice audit or P0 incident makes the gaps visible. The guidance on documenting architecture decisions consistently identifies these infrastructure setup sessions as the highest-value source for retroactive ADR reconstruction: the decisions that were made fastest, in sessions focused on getting things running, are the decisions with the most undocumented consequences.
What to actually document in the observability platform ADR
An observability platform ADR that prevents the $47k invoice and the 3.5-hour P0 investigation is not the same document as the observability runbook. The runbook documents how to use the platform; the ADR documents why this platform was chosen over the alternatives, what constraints that choice imposes on all future instrumentation decisions, and what decisions were made during configuration that a future engineer would not be able to infer from the codebase or dashboard configuration alone.
The pricing model and cost scaling formula is the most important section that teams skip. Document the actual pricing formula for your account tier: per-host infrastructure rate, per-host APM rate, custom metric limit and overage rate, per-GB log ingestion rate, Lambda per-invocation rate if applicable. Then model the formula forward: "At current infrastructure (28 hosts, 12 APM hosts, 4,200 custom metrics), the monthly bill is approximately $X. If the team scales to 80 hosts and enables APM on all services, the projected bill is approximately $Y. The primary cost drivers in order are: [1] APM hosts, [2] custom metric overage, [3] infrastructure hosts. Before enabling APM on a new service, evaluate whether the trace data has a named consumer on the team who will actually use it." This forward model prevents the pattern where APM is enabled as a default rather than as a deliberate cost-justified decision.
The cardinality ceiling section must specify which tag dimensions are safe to use and which are not. "Safe dimensions are those with a bounded value set: environment (production/staging/development), region (us-east-1/us-west-2/eu-west-1), service name, deployment version. Unsafe dimensions are those with unbounded or very high value sets: customer_id, user_id, request_id, session_id, order_id. Using an unsafe dimension as a metric tag will create one time series per unique value, multiplied by the number of metrics that use that tag. At 15,000 customers, any metric with a customer_id tag will consume 15,000 custom metric slots. Use Datadog's DogStatsD histogram and distribution metric types for per-entity aggregations rather than per-entity time series." The cardinality policy is an instrumentation convention. It must be documented to be enforceable — engineers making individual instrumentation decisions do not have visibility into the account-wide custom metric count without going to the Datadog UI explicitly, and most don't.
The trace sampling rate and its incident investigation implications must be stated explicitly. "APM trace sampling rate: 1% (default). At current peak traffic (4,200 requests/minute), this stores approximately 42 traces per minute or 2,520 traces per hour. The 99% of requests that are not sampled are not available for post-hoc investigation. During a P0 incident, if the root cause is visible in fewer than 1% of requests (a specific customer, a specific availability zone, a specific request path), the standard 1% sample may not capture the causal traces. Procedure for increasing sampling during incidents: [link to runbook section]. Temporary increase to 10% can be applied via the Datadog UI → APM → Service → Ingestion Controls without code deployment. Revert to 1% after incident resolution to avoid log ingestion cost overage." Documenting the procedure for increasing sampling during incidents converts a 3.5-hour investigation into a two-minute configuration change followed by a two-minute query. The procedure was always possible; it was not discoverable at 2:14am without the ADR.
The OpenTelemetry coverage section is the exit ramp documentation. "Instrumentation approach: [proprietary dd-trace agents / OpenTelemetry SDKs with OTLP export to Datadog]. If proprietary: migration to a different observability backend requires replacing dd-trace with the target platform's agent or an OTel SDK in each instrumented service. Current instrumented services: [list]. Estimated migration effort: [N service-days]. If OTel: backend migration requires only reconfiguring the OTLP exporter endpoint; application code is unchanged." The exit ramp assessment does not predict that the team will migrate — it makes the migration cost visible so that future platform re-evaluation sessions have an accurate cost baseline. The infrastructure-as-code strategy analogy applies: choosing vendor-specific Terraform modules over Pulumi or CDK creates IaC migration cost that accumulates in every new module. Choosing vendor-specific observability agents over OTel SDKs creates instrumentation migration cost that accumulates in every new service.
The data retention policy per signal type must be documented with its investigation implications. "Metrics: 13 months at 1-minute rollup. APM traces: 15 days at full resolution. Logs: 15 days at full resolution (customizable in account settings). Investigation scope: the maximum lookback window for trace-based investigations is 15 days. Incidents discovered more than 15 days after occurrence cannot be investigated in Datadog traces; use the metrics at 1-minute rollup for trend reconstruction. Long-term compliance-relevant events are archived to S3 via Datadog's Log Archive feature; index name [log_archive_prod], S3 bucket [observability-archive-prod], rehydration procedure in the compliance runbook." Retention limits that are only discovered during an investigation are retention limits that delayed the investigation. Documenting them in advance converts "I don't know how far back I can look" into "I know I need to check whether this incident started more than 15 days ago and if so use the metric rollup instead."
The alert routing and escalation path closes the section that is most frequently configured interactively and least frequently documented. "P0 alerts (checkout service error rate > 2%, all payment processing service errors): PagerDuty Engineering On-Call rotation. P1 alerts (any service error rate > 5%, latency p99 > 2s): Slack #incidents channel + PagerDuty Low-Urgency. P2 alerts (infrastructure health, disk > 80%, memory > 90%): Slack #infra-alerts. Alert configuration source of truth: Terraform module observability/alerts (Datadog Terraform provider). Monitor IDs are in the Terraform state; do NOT edit monitors directly in the Datadog UI — changes will be overwritten on the next Terraform apply." The "do NOT edit in the UI" rule is the operational contract that prevents configuration drift between the Terraform state and the actual Datadog configuration — a common source of alert routing confusion during incidents where the on-call engineer edits a monitor in the UI during the incident and the next Terraform apply silently reverts the change.
The ADR template for observability platform selection
The template below follows the Nygard format extended with observability-specific sections. It captures the sections whose absence produced the incidents described above. Adapt field values to the selected platform.
# ADR-NNN: Observability platform selection
## Status
Accepted / Proposed / Superseded by ADR-NNN
## Context
[What are the observability requirements? Infrastructure monitoring,
application APM, distributed tracing, log aggregation, synthetics,
RUM? What is the current infrastructure size and projected three-year
growth trajectory? What are the cardinality requirements — do any
metrics need per-user, per-tenant, or per-request tag dimensions?
What are the incident investigation SLOs?]
## Decision
We will use [Datadog / Grafana OSS stack / New Relic / Honeycomb /
other] for [scope: all signals / metrics only / traces only / etc].
## Pricing model and cost scaling formula
Platform: [vendor]
Pricing model: [per-host / per-seat / per-GB / per-event]
Current monthly cost: $[amount]
Cost formula at current scale: [formula with current values]
Projected cost at 3-year scale: [formula with projected values]
Primary cost drivers in order: [list]
Rules for cost-gated instrumentation decisions:
- APM enablement: [approval required / self-service / default-on]
- Custom metric addition: [cardinality review required / self-service]
- Log verbosity in production: [INFO default / DEBUG requires approval]
## Cardinality model and ceiling
Cardinality model: [pre-aggregated time series / raw event column store]
Cardinality ceiling: [10k custom metrics / no ceiling / memory-bound]
Overage behavior: [$X per metric per month / OOM risk / query slowdown]
Approved high-cardinality labels (bounded value sets):
environment, region, service, version
Prohibited labels (unbounded value sets — DO NOT use as metric tags):
customer_id, user_id, request_id, order_id, session_id
Current custom metric count: [N] / ceiling: [M]
Safe margin: [M - N remaining before overage]
## Trace sampling rate and investigation model
Sampling approach: [head-based 1% / tail-based / full resolution]
Implication: [N% of requests are not available for post-hoc investigation]
Procedure for increasing sampling during P0 incidents:
[UI path or config change, estimated time to take effect]
Full-resolution query capability: [yes / no — use separate platform for P0]
## OpenTelemetry coverage
Instrumentation approach: [proprietary agents / OTel SDKs / mixed]
Services using proprietary instrumentation: [list]
Services using OTel instrumentation: [list]
Migration cost estimate if platform changes: [N service-days]
## Data retention per signal type
Metrics: [duration at full resolution, duration at rollup]
APM traces: [duration]
Logs: [duration, configurable? yes/no]
Investigation lookback limit: [N days for trace-based investigations]
Long-term archive: [S3 / GCS / none — location, rehydration procedure]
## Alert routing and escalation
P0 definition and routing: [condition → PagerDuty rotation name]
P1 definition and routing: [condition → Slack channel + PD low urgency]
P2 definition and routing: [condition → Slack channel]
Alert configuration source of truth: [Terraform / UI / both]
Edit policy: [Terraform-only / UI-editable / mixed — drift risk noted]
## Consequences
Positive: [capabilities this platform provides for the team's use cases]
Negative: [cost scaling risks, cardinality constraints, sampling gaps,
vendor lock-in from proprietary instrumentation, operational burden
of self-hosted stack if applicable]
Risks: [what to monitor as infrastructure and team size grow]
The sections that teams consistently skip are the cost scaling formula (the number is known; the formula with future values is not), the cardinality ceiling and prohibited labels (the ceiling is in the pricing documentation; the application to specific tag decisions is not), and the trace sampling rate with its investigation procedure (the rate is in the config file; the incident procedure is not). Those three sections are the ones whose absence produces the $47k invoice and the 3.5-hour P0 investigation. Write them before the second service is instrumented, not after the second invoice is audited.
Observability platform decisions share the same structural characteristic as observability strategy decisions: the initial choice looks low-stakes at a small scale ("it's just metrics and logs") and reveals its true migration and cost structure only after the platform is wired into every service across three years of growth. The ADR is not an accounting exercise. It is the document that makes the next instrumentation decision — "should I enable APM on this new service?" — a two-minute check against a documented cost model rather than a five-second default that will cost $1,400 per month for the next thirty-eight months on a service whose traces nobody reads.
Frequently asked questions
What is cardinality in observability and why does it matter for platform selection?
Cardinality in observability refers to the number of unique time series a platform must track — each unique combination of a metric name and its tag or label values is a separate time series. A metric called api.request.duration with a customer_id tag on a platform serving 50,000 customers creates 50,000 separate time series for that one metric. Cardinality matters for platform selection because platforms handle high-cardinality data differently: Prometheus stores every unique time series in memory and can crash on high-cardinality metrics; Datadog imposes a 10,000 custom metric ceiling with per-metric overage charges above it; Honeycomb's column-store architecture has no cardinality ceiling because it stores raw events rather than pre-aggregated time series. A team that adopts Datadog for a small customer base and later adds per-tenant metrics for a multi-tenant product may cross the cardinality ceiling without expecting to, generating invoices that scale with customer growth rather than infrastructure growth. The cardinality model determines which instrumentation patterns are affordable and must be documented in the platform ADR as a constraint on tag dimension choices.
When should a team use Honeycomb instead of Datadog?
Honeycomb is the appropriate choice when the team's primary observability use case is incident investigation and root cause analysis on production traffic with high-cardinality dimensions, and when per-event pricing is preferable to per-host pricing. Honeycomb's column-store architecture stores every trace and event at full resolution and supports arbitrary GROUP BY, WHERE, and HAVING queries over the full dataset without sampling. An engineer investigating a P0 incident can query "show me all requests where customer_id = 12345 AND duration_ms > 5000 AND region = us-east-1 in the last 30 minutes" and receive an answer based on actual production traffic rather than the 1% sample that Datadog APM provides by default. This full-resolution query capability is Honeycomb's primary differentiator. Honeycomb is not the appropriate choice when the primary use case is infrastructure monitoring with pre-built dashboards — Honeycomb has no native infrastructure metrics product, so teams need a separate metrics product (Prometheus, CloudWatch) for the infrastructure layer. Cost at scale favors Honeycomb when trace volume is high and cardinality is high, because per-event pricing scales with event count rather than host count. The crossover depends on average trace depth, events per request, and custom metric usage.
What is the difference between per-host and per-seat observability pricing?
Per-host pricing charges based on the number of infrastructure hosts running the observability agent. Datadog's infrastructure monitoring charges per host per month — as infrastructure grows (more services, horizontal scaling, containerization), host count grows and cost scales proportionally. A monolith at 8 hosts that becomes 30 microservices at 140 hosts sees a near-17x increase in base monitoring cost before any APM, logging, or custom metric charges. Per-seat pricing charges based on the number of engineers with access to the observability platform. New Relic's user-based model charges per Full Platform User per month regardless of how many hosts the system runs. A two-person team operating 140 hosts pays two seats, not 140 hosts. This model aligns cost with team size rather than infrastructure size: infrastructure growth does not drive costs, headcount growth does. The right choice depends on which growth vector is expected to be faster — if infrastructure grows faster than the team (infrastructure-heavy, small-team operations), per-seat is typically cheaper. If the team grows faster than infrastructure (consulting firms, SaaS with efficient infrastructure), per-host may be cheaper at comparable scale.
What should an observability platform ADR document that teams typically skip?
Teams typically document the platform name and the install command — "We use Datadog for observability." The ADR sections that prevent surprise invoices and incident investigation gaps are: (1) the pricing model and cost scaling formula — document the actual formula with current values and project it forward to three-year scale so future instrumentation decisions can be evaluated against the cost model before being made; (2) the cardinality ceiling and prohibited tag dimensions — not all label values are safe as metric tags; documenting which dimensions are high-cardinality and what the platform's response is (overage charges, memory explosion, query slowdown) prevents instrumentation decisions that silently grow the bill; (3) the trace sampling rate and its incident investigation implications — document the rate, what fraction of requests it leaves unsampled, and the procedure for temporarily increasing sampling during P0 incidents; (4) the OpenTelemetry coverage — which services use vendor-specific agents versus OTel SDKs; this is the exit ramp assessment; vendor-specific instrumentation is migration cost that accumulates in every service; (5) the data retention policy per signal type — retention limits that are only discovered during an investigation are retention limits that delayed the investigation; (6) the alert routing and the configuration source of truth — alert configurations edited interactively in the vendor UI without Terraform management drift silently during the next infrastructure deployment and produce routing surprises during incidents.