Blog

Long-form essays on engineering decision records, ADR conventions, AI chat as a thinking tool, and how teams keep their reasoning findable. New posts ship roughly weekly. The narrative arc starts with the new-CTO onboarding problem and works outward.

2026-06-22 · ~21 min read

The search infrastructure decision record: why the search engine you chose determines your query relevance ceiling and your schema evolution cost

Search engine selection looks like a configuration detail until a relevance improvement triggers a 4.5-hour index rebuild that freezes document creation, or a managed SaaS bill grows from $420 to $1,240 a month — because the relevance model ceiling, the schema evolution constraint, and the per-operation cost at scale were never modeled when the session closed.

2026-06-22 · ~21 min read

The CDN decision record: why the CDN you chose determines your cache invalidation latency and your origin shield cost at global scale

CDN selection looks like a routing detail until a pricing bug is patched in four minutes of deploy time but cached API responses keep serving the wrong entitlements for an hour — because the team cached API responses without documenting the invalidation procedure, the propagation latency, or the maximum accepted staleness for pricing endpoints.

2026-06-22 · ~22 min read

The API gateway decision record: why the gateway you chose determines your authentication surface and your rate limiting model

API gateway selection looks like an infrastructure detail until a security researcher finds four unprotected admin routes — because the authentication model was per-route, the default was open, and the policy that said "all state-modifying routes must have the JWT authorizer" had never been written down.

2026-06-21 · ~23 min read

The observability platform decision record: why the telemetry pipeline you chose determines your cardinality ceiling and your incident investigation latency

Observability platform selection looks like a tooling detail until a $47k Datadog invoice arrives in month 38 and the team discovers APM was enabled on every Lambda function, custom metrics grew to 12,000 unique time series past the 10k default limit, and per-host charges had been scaling with every microservice extraction for three years. The platform you chose — Datadog, Grafana OSS stack, New Relic, or Honeycomb — determines your cardinality ceiling, your trace sampling model, your cost scaling formula, and whether the engineer paged at 2am can find the root cause before the SLA expires.

2026-06-21 · ~22 min read

The message broker decision record: why the broker you chose determines your delivery guarantee and your consumer group isolation capability

Message broker selection looks like an infrastructure detail until a payment reconciliation service needs 30 days of event replay and you discover the topic retention policy was set to 24 hours at initial setup. The broker you chose — Kafka, RabbitMQ, SQS, or Redis Streams — determines your delivery guarantee, your replay window, your consumer group isolation model, and whether adding the sixth downstream consumer is a configuration change or a six-week engineering project.

2026-06-21 · ~22 min read

The flag service infrastructure decision record: why the evaluation model and SDK you chose determine your rollback latency and your A/B testing capability

Feature flag service selection looks like a tooling choice until a bad rollout burns through Black Friday traffic and you discover the homegrown flag system uses per-request random bucketing — so 10% of users is actually 10% of requests, and the same user sees both versions on alternate page loads. The flag service you chose — LaunchDarkly, Unleash, Flipt, or a homegrown Redis-backed system — determines your rollback propagation latency, your user-consistency guarantee, your A/B testing trustworthiness, and how expensive a future migration will be.

2026-06-21 · ~20 min read

The secrets management decision record: why the secrets store you chose determines your rotation automation capability and your audit trail at every production access

Secrets management looks like a security detail until a leaked credential triggers an incident response at 3am and your team discovers the rotation procedure in the runbook assumes the secrets store you replaced six months ago. The store you chose — environment variables, AWS Secrets Manager, HashiCorp Vault, or a managed platform — determines your rotation automation model, your audit trail granularity, and how many minutes elapse between detecting a compromised credential and denying access to every consumer of it.

2026-06-21 · ~20 min read

The database connection pooling decision record: why the pooler you chose determines your connection exhaustion ceiling and your failover behavior during database restarts

Database connection pooling looks like a performance tuning detail until your application hits FATAL: sorry, too many clients already at 2am and your on-call engineer doesn’t know whether the pooler is PgBouncer in transaction mode or the ORM’s built-in pool. The pooler you chose — PgBouncer, RDS Proxy, or application-level pooling — determines your connection exhaustion ceiling, your server RAM commitment per idle connection, and how fast your application recovers when the database restarts.

2026-06-21 · ~22 min read

The container orchestration decision record: why the Kubernetes vs ECS vs serverless choice you made determines your operational complexity floor and your autoscaling behavior under load

Container orchestration looks like an infrastructure choice until a traffic spike reveals that your autoscaling model adds four minutes of latency to scale-out and your on-call engineer doesn’t know why Kubernetes was chosen over ECS in the first place. The orchestration platform you chose in year one sets the operational complexity floor — the minimum ongoing maintenance work your team carries — and determines the autoscaling behavior your users experience when demand changes suddenly.

2026-06-20 · ~20 min read

The CI/CD pipeline decision record: why the deployment pipeline you chose determines your rollback capability and your mean time to deploy under incident pressure

CI/CD pipelines look like plumbing until a production incident requires an emergency rollback and the pipeline that ships code in 12 minutes takes 47 minutes to roll back because nobody designed the rollback path. The deployment pipeline decisions made in year one — build tool, artifact model, deployment strategy, rollback mechanism — determine whether your team can respond to production incidents with a single command or with a multi-step manual procedure under pressure.

2026-06-20 · ~20 min read

The infrastructure-as-code strategy decision record: why the IaC approach you chose in year one determines your drift detection capability and your compliance audit surface

Infrastructure-as-code looks like a solved problem until a manually-applied security group fix from a late-night incident sits in production for eight months, invisible to your Terraform state, and a junior engineer's next plan would delete it. The IaC tool and module structure you chose — Terraform, Pulumi, CloudFormation, or direct CLI — determines whether drift is detectable, whether your compliance audit has a change trail, and whether the engineer running apply has enough context to understand what they're about to destroy.

2026-06-20 · ~20 min read

The API versioning strategy decision record: why the versioning approach you chose in year one constrains the migration path for every breaking change in year three

API versioning looks like a URL prefix until a mobile client running version 1.2 refuses to update and your paying enterprise customer's integration breaks when you remove a deprecated field. The versioning strategy embedded in your API — URL path versioning, header-based versioning, date-based versioning, or no versioning at all — determines how many simultaneous API versions you can maintain, what migration friction your customers experience, and how long deprecated endpoints consume your engineering capacity.

2026-06-20 · ~20 min read

The background job infrastructure decision record: why the job queue you chose determines your retry semantics and your dead-letter visibility

Background jobs look like a simple work-queue pattern until a critical billing job silently exhausts its retries, disappears into a dead-letter queue no one monitors, and the revenue leak runs for three days before anyone notices the accounts receivable balance is wrong. The retry policy, backoff strategy, dead-letter handling procedure, deduplication approach, and concurrency model embedded in the job infrastructure determine what happens to failed work and whether the team can diagnose the failure without reading application logs.

2026-06-20 · ~20 min read

The multi-region deployment decision record: why the region topology you chose determines your latency floor and your data residency compliance posture

Multi-region deployment looks like infrastructure configuration — pick a cloud provider, select additional regions, enable database replication, deploy identical application stacks. This framing conceals the failover model that determines RPO and RTO during regional outages; the data residency policy that determines which data can leave which jurisdiction; the cross-region consistency model that determines the tradeoff between write latency and data loss risk; and the traffic routing approach that determines the actual latency floor users experience.

2026-06-19 · ~20 min read

The search architecture decision record: why the search approach you chose determines your relevance tuning capability and your query latency under index growth

Search feels like a database query — a WHERE clause, a LIKE operator, maybe a full-text index. This framing obscures the architectural decisions embedded in the implementation: the indexing strategy that determines the performance ceiling under corpus growth; the relevance model that determines whether the first result is actually the most relevant result; the synchronization approach that determines how stale the index can be; the schema evolution policy that determines whether adding a new field requires downtime.

2026-06-19 · ~20 min read

The queue and messaging decision record: why the message queue you chose determines your delivery guarantees and your dead-letter handling posture

Message queues look like infrastructure configuration — choose a broker, publish a message, consume it in a worker. This framing hides the architectural decisions embedded in the choice: the delivery semantic that determines what your consumer must do when a message arrives twice; the dead-letter strategy that determines whether a malformed message is isolated for inspection or loops forever; the schema evolution policy that determines whether producer and consumer can be deployed independently.

2026-06-19 · ~20 min read

The API rate limiting decision record: why the rate limiting approach you chose determines your abuse surface and your SLO degradation behavior under traffic spikes

Rate limiting is added reactively — after the first abuse incident or the first traffic spike that takes the service down. The approach chosen at that moment, under pressure, determines whether legitimate traffic degrades gracefully when spikes arrive, whether the abuse surface can be narrowed without breaking existing integrations, and whether the on-call engineer can change a limit value at 2am without triggering a deployment.

2026-06-19 · ~20 min read

The authentication strategy decision record: why the session management approach you chose in year one determines your SSO migration cost and your compliance posture in year three

Authentication libraries are chosen from quickstart tutorials and rarely revisited as decisions. Two years later, the session management approach you chose determines what an SSO integration costs, whether a compromised token can be revoked before it expires, and whether your SOC 2 audit finds undocumented session data retention policies.

2026-06-19 · ~19 min read

The database migration strategy decision record: why the schema migration approach you chose determines how safely you can evolve your data model at production traffic

Database migration tooling is chosen when the project starts and rarely revisited. Two years later, the migration approach you chose determines whether adding a column to a 200-million-row table takes 3 seconds or 40 minutes, whether a failed deploy can be rolled back, and whether the team can apply schema changes during a rolling deploy without breaking the running version.

2026-06-18 · ~18 min read

The observability strategy decision record: why the metrics, traces, and logs platform you chose determines what questions you can answer when incidents happen

Observability platform adoption is treated as a DevOps improvement, not an architecture decision. The metrics backend is chosen when the first dashboard is needed. The distributed tracing library is chosen when the first latency mystery appears. Two years later, the cardinality limits of the metrics backend determine whether you can answer "which customer is experiencing the slowest API response?" during an incident. The tracing format you chose determines whether you can switch backends without re-instrumenting every service. Three backend families: self-hosted Prometheus (active series in memory, ~30M series per 32 GB instance — high-cardinality labels like customer_id produce memory exhaustion rather than cost spikes, cardinality explosions are the most common Prometheus failure mode); managed SaaS (Datadog, Grafana Cloud — no operational burden, cost scales with active series; a customer_id label on a metric with 50,000 active customers multiplies the metric's cost by 50,000); high-cardinality backends (Honeycomb, ClickHouse — column-store, per-event pricing, filter-by-any-dimension without cardinality limits). The prohibited label set: customer_id, user_id, request_id, session_id, tenant_id are labels that produce cardinality explosions on high-traffic metrics — the most consequential undocumented constraint of any Prometheus deployment. The distributed tracing decision has three independently load-bearing components: (1) instrumentation format — OpenTelemetry SDK (vendor-neutral, backend migration requires only OTEL Collector exporter reconfiguration) vs Jaeger native or Zipkin B3 (backend-specific, migration requires re-instrumenting every service); (2) sampling strategy — head-based probabilistic (independent draw per request, 1% sample misses a 0.1% error rate most of the time during a 5-minute incident window) vs tail-based (sampling decision made after the full trace is assembled, every error trace retained, memory cost for buffering); (3) trace context propagation — W3C traceparent headers for HTTP, Kafka message header serialization for async boundaries; missing propagation at async boundaries produces orphaned root spans disconnected from the originating request trace. The observability contract: explicit statements of which incident questions the platform can answer ("which service is responsible for elevated latency?") and which it cannot ("which specific customers are affected? — customer_id label not permitted due to cardinality constraints; manual database query required"). Writing the observability strategy ADR: the observability contract with explicit capability and incapability statements; the metrics backend decision with cardinality limit and prohibited labels; the distributed tracing decision with format, sampling strategy, and propagation policy; the log aggregation integration with cross-pillar correlation policy; the revisitation conditions naming cost thresholds, cardinality thresholds, compliance retention requirements. Finding observability decisions in AI chat: four session types — initial instrumentation sessions (backend selection, tracing library choice, sampling strategy); incident response sessions (capability gaps discovered under pressure: "how do I filter Prometheus metrics by customer ID?"); cardinality incident sessions (the event that surfaces the cardinality limit); platform migration sessions (format lock-in consequences discovered at migration time).

2026-06-18 · ~18 min read

The monorepo vs. polyrepo decision record: why the repository structure you chose in year one shapes your CI costs, code ownership, and dependency management in year four

Repository structure is chosen once — at project inception — and rarely revisited. Four years later, the build graph determines whether a one-line change to a shared utility takes 4 minutes or 40 minutes in CI. The CODEOWNERS model determines whether a cross-team PR review takes one day or one week. The dependency sharing model determines whether a breaking change in a shared package can be deployed atomically or requires coordinating five separate service teams. Four structural patterns with distinct architectural consequences: flat monorepo (single CI pipeline, no package boundaries, CI time grows linearly with codebase size, no ownership model — the starting point most teams outgrow without documenting the transition); modular monorepo with workspace tooling (Turborepo, Nx, Bazel — affected-package computation drives CI efficiency, accuracy depends entirely on the correctness of dependency declarations in package.json, phantom dependencies produce CI green results for artifacts built against outdated transitive dependencies); polyrepo (one repo per service, independent release cadence, shared code versioned via an internal registry, version skew accumulates without a maximum skew policy, cross-service refactoring requires coordinated multi-repo PRs); hybrid structure (monorepo per domain, polyrepo across domains — inherits both affected computation complexity and cross-repository coordination cost). The affected computation accuracy invariant: a phantom dependency — a package that a module imports directly without declaring it in package.json — is resolved at runtime because the transitive dependency is installed in the workspace, but the build tool has no record of the relationship; when the phantom dependency's package changes, the importing package's build cache is not invalidated, CI shows green, and the deployed artifact is built against the outdated version; phantom dependency prevention (pnpm strict hoisting mode, dependency-cruiser pre-merge lint) converts the phantom dependency from a silent production failure mode to a development-time error. The CODEOWNERS ownership model: in a monorepo, ownership is a CODEOWNERS file that must be maintained as the team structure changes; stale CODEOWNERS entries gate PRs on reviewers who left the team, producing review bottlenecks; ungated packages (no CODEOWNERS entry) must be documented choices, not accidental gaps. The dependency sharing model: workspace:* references mean all packages always use the current version of all internal dependencies — a breaking change to a shared package must update all consumers in the same PR; in a polyrepo, consumers upgrade on independent schedules, enabling staged rollouts but accumulating version skew without a documented maximum skew policy. Writing the repository structure ADR: structure decision with alternatives evaluated and rejection reasons; CI build model and caching policy naming how affected computation works and what phantom dependency prevention mechanism is in place; code ownership model with CODEOWNERS maintenance policy and cross-team contribution process; dependency sharing model with breaking change deployment policy; revisitation conditions naming checkable thresholds for CI time, review turnaround, and team count. Finding repository structure decisions in AI chat: four session types — initial setup sessions (structure choice, build tool selection); CI speed sessions (affected computation configuration, remote cache backend); phantom dependency sessions (accuracy policy discovery through incident); cross-team contribution sessions (CODEOWNERS design and review process).

2026-06-18 · ~18 min read

The caching strategy decision record: why the cache invalidation approach you chose shapes your consistency guarantees and the classes of bugs your users experience in production

Caching adoption is treated as a performance optimization, not an architecture decision. The invalidation mechanism is chosen quickly when the first slow endpoint appears and rarely documented. Two years later, the TTL policy determines the staleness window for every cached resource, the cache stampede behavior determines what users experience when the cache is flushed, and the failure mode determines whether the application degrades gracefully or breaks entirely when the cache node restarts. Five caching patterns: cache-aside (lazy loading — application checks cache first, on miss reads from source and populates cache; write path decoupled from cache, staleness bugs emerge when write paths lack invalidation calls); write-through (updates cache synchronously on every write, always current, write latency cost, cold start problem after flush); write-behind / write-back (acknowledges writes to cache immediately, DB write async — minimizes write latency, durability risk if cache node fails before DB write completes); read-through (cache layer handles cache miss automatically by calling a configured data loader, consistent population logic); CDN/edge caching (HTTP response caching via Cache-Control headers, tag-based surrogate key invalidation for purge-on-update). The invalidation mechanism decision: TTL-based invalidation is the default but is simultaneously the staleness window that determines how long a stale billing address, expired permission, or outdated price persists; event-driven invalidation supplements TTL by explicitly deleting keys on write, but every new write path that lacks the corresponding invalidation call is a staleness bug; tag-based invalidation groups cached responses under named tags and purges by tag, requiring a maintained tag design that determines the granularity of invalidation. The cache stampede problem: when a high-traffic cached value expires, multiple concurrent requests simultaneously execute the cache miss handler, saturating the database; single-flight / mutex pattern deduplicates concurrent miss executions; probabilistic early expiration spreads regeneration load across the TTL window before expiry. The consistency guarantee: cache-aside with TTL gives eventual consistency at the TTL boundary — account tier, feature entitlements, and prices silently serve stale data for the entire TTL window when the write path lacks invalidation. Writing the caching ADR: caching mechanism and cache provider decision with alternatives evaluated; TTL policy and invalidation mechanism by data class; consistency guarantee naming the staleness window for each data class; cache key design and namespace policy; failure behavior and cache stampede mitigation strategy. Finding caching decisions in AI chat: four session types — initial adoption sessions (mechanism choice, provider selection, TTL reasoning); staleness incident sessions (invalidation gap discovery); stampede sessions (thundering herd discovery and mitigation); and cache failure sessions (failure behavior and fallback policy).

2026-06-18 · ~18 min read

The feature flag decision record: why the flag evaluation mechanism you chose in year one constrains how you do gradual rollouts and A/B testing in year three

Feature flag adoption is treated as a developer experience enhancement, not an architecture decision. The evaluation mechanism is chosen during the first dark launch emergency and rarely documented. Two years later, the flag evaluation model determines whether gradual rollouts are safe under concurrent deployment versions, the flag store design determines whether A/B testing produces reliable impression data, and the absence of a lifecycle policy produces a codebase with 200 flags where 30 are actively used and none can be safely deleted. Five evaluation pattern categories: boolean environment variable flags (requires restart/redeploy to change — the zero-infrastructure entry point that many teams outgrow without documenting the transition); database-backed server-side evaluation (DB as flag store, synchronous request evaluation, caching policy determines propagation delay); SDK with local evaluation (LaunchDarkly, Unleash, Flipt — in-memory flag config synced via streaming, sub-millisecond evaluation latency, bootstrap state behavior is the undocumented constraint on Lambda cold starts); remote evaluation via vendor API call (flag vendor evaluates per-request, each evaluation adds a network round-trip on the critical path, vendor outage blocks flag evaluation); client-side evaluation (flag config in browser, PII in evaluation context exposed in developer tools, personalization without server round-trip). The gradual rollout safety constraint: per-request random selection vs user-ID-consistent hashing — a checkout flow that writes state in one format under treatment and reads in another format under control requires consistent hashing to be safe; with per-request random assignment, a user has a 10% chance of encountering a format mismatch on every request after the first. The rolling deployment window interaction: two application versions running simultaneously during a Kubernetes rolling update may evaluate the same flag differently if the flag was added after the old version was built. The A/B testing impression recording gap: SDK-based local evaluation tools do not automatically generate impression events — the application must explicitly call the SDK impression API; teams that adopt a flag tool for dark launches and repurpose it for experiments discover the missing impression data at first analysis. Assignment persistence across sessions: user-ID-consistent hashing provides cross-session consistency for authenticated users; cookie-based session UUIDs break when users clear cookies or switch devices. Writing the feature flag ADR: evaluation mechanism with alternatives evaluated and bootstrap behavior documented; user targeting and assignment model with consistent hashing policy and PII restrictions; A/B testing and impression recording policy naming how impressions are generated and where they land; flag lifecycle policy with type definitions, expected lifetimes, ownership assignment, and removal sequence; critical path policy naming which flag types are permitted in the synchronous user-request path. Finding decisions in AI chat: initial adoption sessions (mechanism choice), gradual rollout incident sessions (assignment model constraints), A/B testing sessions (impression recording gap discovery), and flag removal sessions (lifecycle policy constructed reactively).

2026-06-18 · ~17 min read

The logging infrastructure decision record: why the log aggregation tool you chose shapes what questions you can answer when something breaks at 3am

Log aggregation tool selection is treated as infrastructure configuration, not architecture. The tool is chosen once during initial cluster setup and rarely revisited. Two years later, it determines which incident response questions the on-call engineer can answer without a multi-minute scan: field-indexed tools (Elasticsearch, Datadog Logs) allow filter-by-any-structured-field regardless of cardinality; label-indexed tools (Grafana Loki, CloudWatch Logs) index only a fixed set of low-cardinality deployment labels and require full-text content scans for application-level fields like customerId or tenantId. The 3am constraint is discovered in production, not at evaluation time. Five pattern categories: print-to-stdout with external aggregation (Kubernetes-native default; the aggregation tool determines what downstream); field-indexed self-hosted aggregation (EFK/ELK — rich query capability, high operational overhead); label-indexed self-hosted aggregation (Loki — low ingestion cost, label schema determines the query surface); SaaS aggregation (Datadog, New Relic — low operational cost, super-linear per-GB cost growth at traffic scale); no centralized aggregation as a deliberate documented choice. The retention cost constraint: log volume grows with traffic; self-hosted Elasticsearch at 100 GB/day with 30-day hot retention costs ~$390/month in NVMe storage alone; Loki at the same volume costs ~$69/month in S3; SaaS tools have a cost crossover point that most teams hit three years after adoption. Structured logging as the aggregation tool's access key: a field-indexed tool paired with unstructured log output is equivalent to full-text search; field naming divergence across services (user_id vs userId vs customerId) produces incident query errors without a documented convention. The log level contract as the missing policy: WARN misuse (logging routine high-frequency events at WARN to make services visible in production filters) produces alert signal that cannot be trusted; the on-call engineer cannot distinguish a genuine WARN spike from a busy cache layer without a written contract. Writing the logging ADR: aggregation mechanism with alternatives evaluated; structured logging policy with required fields and field naming convention; log level contract with per-level definitions and on-call trigger; retention policy with cost projection at current volume; revisitation conditions naming cost and query latency thresholds. Finding logging decisions in AI chat: four session types — aggregation tool selection (mechanism choice), structured logging adoption (field decisions), incident debugging (revealed query constraints), and cost investigation (retention policy decision under budget pressure).

2026-06-18 · ~16 min read

The service mesh decision record: why the inter-service communication infrastructure you chose in year two constrains your observability and zero-trust security posture in year four

Service mesh adoption is rarely treated as an architecture decision — it is treated as a platform upgrade. Three years later, the mesh is load-bearing for observability, security posture, and Kubernetes upgrade compatibility, and no one can reconstruct what is intentional configuration and what is accumulated default. Four pattern categories to document: sidecar-proxy meshes (Istio, Linkerd — every pod carries an Envoy sidecar that intercepts traffic via iptables); sidecar-less eBPF-based meshes (Cilium, Istio ambient mode — node-level eBPF programs eliminate per-pod sidecar overhead but require kernel ≥5.10); application-layer meshes (mTLS and circuit breaking embedded in each service's SDK); and the "no mesh" posture as a deliberate documented choice. The observability constraint: sidecar proxies emit distributed tracing spans at the network layer but cannot propagate trace context through the application — trace header propagation remains an application responsibility that the mesh does not eliminate; teams that believe the mesh "handles tracing" discover disconnected root spans rather than end-to-end traces. The zero-trust constraint: permissive mTLS (Istio's default) accepts both mTLS and plaintext — authentication theater for any SOC 2 requirement that requires enforcement; strict mTLS migration requires auditing the sidecar exclusion list before flipping the mode or it breaks batch jobs and CronJobs that were intentionally excluded. The sidecar exclusion policy as the most commonly missing section: every production mesh has workloads outside the mTLS perimeter (CronJob pods whose Envoy proxy prevents termination; DaemonSets with host networking); undocumented exclusions are indistinguishable from accidental gaps, producing audit findings regardless of actual coverage. The Kubernetes version coupling: Istio is explicitly versioned against Kubernetes minor versions; an undocumented version coupling converts a cluster upgrade into an unplanned mesh version assessment. Writing the service mesh ADR: mesh selection with alternatives evaluated and rejected; sidecar exclusion policy naming each excluded workload type with the technical reason and security implication; mTLS mode and authorization policy model with the migration plan from permissive to strict; Kubernetes version compatibility with the upgrade trigger. Finding service mesh decisions in AI chat: four session types — initial evaluation (mechanism selection), CronJob incident sessions (sidecar exclusion policy), mTLS migration sessions (authorization policy model), and upgrade planning sessions (version coupling).

2026-06-15 · ~15 min read

The dependency injection decision record: why the DI pattern you chose shapes the testability, startup cost, and cognitive overhead of every feature you add

Every codebase has a dependency injection pattern. Most teams didn't choose it — it came with the framework, or emerged from whoever wrote the first service class. Once 200 classes are structured around it, the DI mechanism is no longer a preference. It determines which things the test suite can substitute (the testability consequence: constructor injection makes every dependency an explicit, mockable parameter; IoC containers make dependencies implicit and resolution rules load-bearing); how fast the application starts (the startup cost consequence: eager container initialization that is invisible in long-lived servers becomes a Lambda cold-start penalty that can exceed 6 seconds); and how much context a new engineer needs before they can make a confident change (the cognitive overhead consequence: a constructor signature tells you exactly what a class depends on; a container annotation tells you the type and leaves the resolution rules to be discovered). The scope and lifetime decision as the most common source of concurrency bugs: singleton services holding per-request state (current user ID, transaction context) produce intermittent cross-request data leaks under concurrent load — the bug caused by a DI lifetime mismatch, not application logic. The injectable boundary decision: making everything injectable produces a container that knows about hundreds of classes that are never substituted, while drawing the boundary tightly around layer-crossing interfaces (database repositories, HTTP clients, message producers) produces a test suite with a clear substitution contract. How the DI mechanism decision becomes irrecoverable: embedded in every class constructor or annotation in the codebase, requiring a full codebase migration to change — which means the initial choice needs a revisitation condition before the accumulation makes it permanent. Writing the DI ADR: mechanism decision with specific alternatives evaluated and rejected; lifetime policy naming the rule for each service category; injectable boundary naming what crosses a layer and what doesn't; composition root policy naming where registrations live. Finding DI decisions in AI chat: framework evaluation sessions (the mechanism choice); testability problem sessions (the first time the mechanism's test consequences become visible); startup time sessions (the Lambda cold-start discovery); and onboarding sessions where a new engineer's questions reveal the cognitive overhead the mechanism creates.

2026-06-15 · ~16 min read

The error handling strategy decision record: why "we'll handle errors properly later" becomes the policy your users experience in production

Every application has an error handling strategy. Most teams chose theirs by not choosing it — "we'll add proper error handling later" became the policy. The implicit default (what happens when you don't handle an error: framework returns 500, exceptions are swallowed, empty arrays are returned on database failure) is the de facto strategy. Four decisions that accumulate into policy: the error surface decision (what users see vs. what engineers see from the same error — and why the boundary is also a security boundary that prevents stack traces from leaking internal paths); the failure mode decision (which operations fail-fast vs. degrade gracefully vs. fail silently — and why "silently" is almost always wrong); the retry and idempotency policy (which operations are safe to retry, what backoff strategy, how idempotency keys prevent duplicate payments and duplicate emails); and the observability contract (what gets logged at what level, what triggers an alert, what is explicitly ignored — and why alert fatigue is a consequence of the undocumented observability contract, not a tooling problem). The error taxonomy problem: without a named taxonomy, teams invent local HTTP status code conventions that conflict — one endpoint returns 400, another 422, a third 200 with an error field — producing inconsistency that frontend engineers must special-case per endpoint. The user-facing error as a product decision: what users see when something fails determines whether they retry, fix their input, or abandon; this copy is almost always written by engineers at implementation time rather than by product at design time. Writing the error handling strategy ADR: error taxonomy with consistent handling per category, failure mode policy with the list of ancillary features exempt from fail-fast, retry and idempotency policy with backoff parameters and retry budget, observability contract with log level trigger conditions and alert thresholds. Finding error handling decisions in AI chat: four session shapes — design question sessions where individual choices are first made, production incident sessions where error handling costs become visible, retry and idempotency sessions where the duplicate-action problem surfaces, and observability frustration sessions where alert fatigue is explicitly named.

2026-06-15 · ~15 min read

The test strategy decision record: why the testing pyramid your team adopted looks like a preference but acts like a constraint

Every team has a testing pyramid. Most teams didn't choose it — it emerged from whoever set up the CI pipeline first, from the framework the team happened to adopt, from the testing philosophy of the engineer who wrote the first test. The ratio of unit to integration to end-to-end tests then constrains refactoring safety, deployment speed, CI cost, and what the team can honestly promise about production behavior. Three test strategy archetypes worth documenting: the unit-test-dominant pyramid (fast CI, strong protection for internal logic, but structurally incapable of catching mock drift bugs — the real implementation changed, the mock wasn't updated, the tests went green, the production deployment failed); the integration-test-dominant strategy (hits a real database or real service infrastructure, catches the class of bugs that mock-based tests cannot — query plan regressions, constraint violations, transaction isolation failures — but requires container provisioning and longer CI runs); and the E2E-primary strategy (highest-fidelity deployment signal, slowest CI, highest flakiness tax). The mock boundary decision as the most consequential undocumented testing decision: where in the stack do tests switch from real implementations to fakes? The mock boundary determines what the test suite can and cannot tell you about production behavior — and it cascades into CI infrastructure, test data management, and what refactoring is safe without updating tests. The test strategy as a hiring constraint: engineers arriving from TDD backgrounds extend mock-heavy suites further; engineers from integration-test backgrounds add real-database tests; engineers from E2E backgrounds add browser tests; without a documented strategy, three overlapping strategies accumulate in one suite rather than one coherent approach. Writing the test strategy ADR: pyramid ratio decision with reasoning, mock boundary decision naming exactly where real implementations give way to fakes, deployment confidence threshold, refactoring safety guarantee, and revisitation conditions. Finding test strategy decisions in AI chat: four session shapes — setup sessions in project week one where framework choices locked in the strategy before it was consciously evaluated; mock boundary debates where the team argued about database mocking; CI speed sessions where the strategy's cost first became explicit; and production escape sessions where the test suite's coverage gap was named directly.

2026-06-15 · ~14 min read

The rejected pattern: why the programming patterns your team didn't adopt deserve a decision record

A library that was rejected still leaves a trace — it isn't in package.json. A pattern that was rejected leaves nothing. The codebase that doesn't use decorators for dependency injection looks identical whether the team spent two weeks evaluating and rejecting them or whether decorators were never considered. Three pattern rejection categories worth documenting: the evaluated-and-explicitly-rejected pattern (the team looked at decorator-based DI, ORMs, or event-driven state management and chose not to adopt them, with specific reasoning that disappears when the evaluating engineers leave); the deliberately excluded pattern as a categorical standards decision ("no ORM," "no global state," "no magic" — a documented constraint that can be cited in code review, versus an undocumented preference that erodes as each new engineer adds one exception that seems obviously justified); and the adopted-and-then-de-standardized pattern (partial migration where old and new patterns coexist without documentation explaining the direction of travel — looks like inconsistency rather than an in-progress architectural shift). The consistency compulsion: new engineers import patterns from prior codebases, and absent a record, the current codebase's pattern absence looks like a gap to close rather than a decision to respect. The technical debt misread: a new technical leader who sees raw SQL everywhere in 2026 reads it as a legacy choice if there is no record, and reads it as a deliberate stance if there is. The anti-pattern ADR: categorical rejections produce a document type distinct from the standard ADR — not documenting what was chosen but what is not acceptable and what the team uses instead. The de-standardization migration record: the most urgent pattern documentation because the mixed state actively misleads new engineers about which pattern is acceptable for new code. Writing the rejected pattern ADR: Context names the pattern and why it was under consideration, Alternatives Considered names the evaluated pattern with specific concerns, Decision names what was chosen instead with the constraint that drove it, Revisitation Condition names concrete triggers for re-evaluation. Finding pattern rejections in AI chat: evaluation sessions that conclude negatively are harder to anchor than dependency rejections because the pattern has no canonical search term — the quarterly review pass is the most reliable extraction mechanism.

2026-06-14 · ~16 min read

The multi-tenancy decision record: why the isolation model you chose in year one constrains every enterprise deal in year three

Every SaaS product has a multi-tenancy isolation model. Most chose it implicitly: a tenant_id column added because it was the simplest way to separate customer data when there was only one paying customer and the question of tenant isolation felt theoretical. Three categories of multi-tenancy decisions worth documenting: the isolation model (row-level security vs. schema-per-tenant vs. database-per-tenant — the mechanism determines the blast radius of a cross-tenant leak, the per-tenant backup and restore capability, and what the enterprise sales team can promise a prospect); the data residency decision (which regions tenant data can live in — single-region deployment is a data residency policy with GDPR Article 44 consequences for European customers; the isolation model constrains whether per-tenant regional routing is achievable without a migration); and the compliance certification scope (what certifications the isolation model supports — SOC 2 with RLS, HIPAA BAA, FedRAMP — and what the dedicated-instance path looks like for certifications the current model cannot satisfy). The "default" pathology: the team that added tenant_id without documenting the trigger for reconsidering, added row-level security under audit pressure without evaluating database-per-tenant, then lost an enterprise deal when the security questionnaire asked "do you offer dedicated database instances?" Finding multi-tenancy decisions in AI chat: three session types — early database design ("how do I separate customer data in Postgres?"), security review sessions ("is there any risk of cross-tenant data leaks?"), and enterprise deal sessions ("the customer wants their own database — how hard would that migration be?") — the enterprise deal sessions are the highest-value extraction targets because they contain the first honest gap analysis between the current isolation model and enterprise requirements.

2026-06-14 · ~14 min read

The data retention decision record: why "we keep everything forever" is an undocumented retention strategy

Most engineering teams discover their data retention policy during a GDPR audit, a legal hold request, or the quarter when infrastructure costs spike 40% and someone asks why the user_events table is 800 gigabytes. Three categories of retention decisions worth documenting: the deletion policy (what gets deleted, when, by what mechanism — hard delete, soft delete, or anonymization — and who triggers it; the mechanism choice determines whether a GDPR right-to-erasure request is a 20-minute operation or a multi-day archaeological project); the backup and archival retention period (how long backups live and whether archives are subject to the same deletion policy as the primary store); and the compliance scope decision (which tables contain personal data, which regulation applies, what the minimum and maximum retention periods are). The "default" pathology: the audit log created for SOC 2's 90-day minimum that becomes permanent by default; the user activity table kept because "storage is cheap" that grows 5 GB per month and becomes a GDPR compliance surface nobody classified as personal data; the chat transcript archive retained "for the life of the account" that nobody defined. Writing the retention ADR: data classification inventory as prerequisite, compliance scope statement naming which regulations apply to which tables, legal hold exception policy, and a revisitation condition tied to infrastructure cost thresholds or GDPR territory expansion. Finding retention decisions in AI chat: three session types — early database design ("should we soft-delete or hard-delete users?"), GDPR compliance preparation ("what personal data do we store?"), and infrastructure cost review ("our S3 costs are growing 20% per month, can we archive older data?").

2026-06-14 · ~15 min read

The API versioning decision record: why "we'll cross that bridge when we come to it" is an undocumented versioning strategy

Every API has a versioning strategy — even "no versioning" is a strategy. The decision that determines the cost of every future breaking change is usually made by default, undocumented, and invisible until the first consumer breaks. Three categories of API versioning decisions worth documenting: the strategy choice (URL versioning vs. header versioning vs. no versioning — each encodes a different set of consumer commitments at a different level of visibility); the breaking change definition (what counts as breaking versus backwards-compatible — the missing definition that produces inconsistent API evolution and surprised consumers when different engineers apply different standards to the same type of change); and the deprecation and migration timeline (how long old versions live, how consumers learn about deprecation, what migration support the team provides). The "just in case v1" problem: most teams add /v1/ because it seems like good practice without documenting what condition would trigger a /v2/ — the trigger condition is the entire decision. The breaking change definition gap: a field type change from string to uppercase enum is breaking or not depending on the consumer's comparison logic, and "probably backwards-compatible" is not a versioning policy. The consumer landscape as context: the same versioning strategy that is appropriate for an internal API serving two frontend applications is inappropriate for a public API serving hundreds of third-party developers — and the ADR must name the consumer landscape so a future team can evaluate whether the strategy still fits. Writing the API versioning ADR: Context must include the consumer landscape; Alternatives Considered names each versioning mechanism evaluated; the breaking change definition section is the section most commonly omitted; the deprecation policy sets the terms of the contract. Finding API versioning deliberations in AI chat: they appear at two predictable points — the API design session that precedes the first external endpoint (the alternatives evaluation) and the first breaking change deliberation (the implicit breaking change definition produced through deliberation).

2026-06-14 · ~14 min read

The performance optimization decision record: why the "we added a cache" decision is not self-documenting

The Redis cache your team added three years ago has no ADR. The commit says "add caching for user profile queries." What it doesn't say: what the query latency was before the cache, which load condition made it necessary, what TTL was chosen and why, or whether the cache is still needed now that the query has been rewritten. Three categories of performance decisions worth documenting: caching decisions (Redis, Memcached, in-process — the cache is visible in the stack but the decision that put it there is not); index decisions (accumulated reactively for queries that may have since changed; each unused index adds write overhead on every insert); and query optimization decisions (the N+1 fix, the denormalized column, the materialized view — each looks like ordinary implementation but represents a performance trade-off with ongoing maintenance obligations). The baseline problem: "we added a cache because the query was slow" — slow at what traffic, with what data volume? The "is this still needed?" question: the cache added for a query pattern that has since been rewritten, the index added for a report that was deprecated, the denormalized column added for a product feature that no longer exists — each accumulates maintenance cost without a record that would allow it to be reconsidered. Writing the performance optimization ADR: Context must include the baseline measurement and load condition; Alternatives Considered must name the alternatives and the specific reason each was not chosen; Revisitation Condition must name checkable triggers for when the optimization should be removed. Finding performance deliberations in AI chat: performance debugging sessions are among the most structured patterns — symptom, query plan, root cause, alternatives, recommendation, verification — and the verification session contains the post-optimization measurement that completes the decision record.

2026-06-14 · ~14 min read

The build vs. buy decision record: why the make-or-buy choice is the hardest to document honestly

The build vs. buy decision is the one engineering teams most reliably document with the wrong reason. The ADR says "vendor lock-in concerns and need for customizability." What it doesn't say: Auth0's multi-tenant tier was $1,800 a month at projected MAU; the lead engineer had built JWT systems before; the CEO didn't want to pay a vendor for something two engineers could build. Three categories of build-vs-buy decisions worth documenting: feature-level (authentication, search, payment processing — made early, become load-bearing, and inherited by every new CTO who asks "why aren't we using Auth0?"); capability-level ("we'll use Datadog when we're bigger" — the "not yet" condition disappears from the ADR even though it was the entire decision); and platform-level (the vendor market that moves fastest while the internal implementation accumulates switching cost). The vendor evaluation spreadsheet problem: three AI chat sessions, a four-tab comparison matrix, and an engineer who left eighteen months later — the spreadsheet is gone but the build decision it produced is load-bearing infrastructure. The build decision cascade: the first ADR opens fifty implicit downstream decisions about token format, session storage, and multi-tenant isolation, each of which accumulates switching cost independently. The lock-in rationalization in detail: the test for whether vendor lock-in was the actual constraint, why pricing and team familiarity are the real drivers in most early-stage builds, and why the honest record is harder to write but more useful for the re-evaluation in year three. The re-evaluation asymmetry: the original decision was made at zero switching cost; the re-evaluation must include the switching cost accumulated by three years of downstream implementation decisions — and that calculation is only possible if the build ADR was written honestly enough to name the cascade it opened. Writing the build-vs-buy ADR with specific Alternatives Considered (named vendors, actual pricing, specific capability gaps) and a Revisitation Condition that is checkable rather than general. Finding build-vs-buy deliberations in AI chat: they are multi-session, multi-round discussions — the final "we'll build it" is rarely in one session — and the extractor identifies them through the vendor-name comparison pattern combined with elimination markers across the temporal cluster.

2026-06-13 · ~14 min read

The interface decision: why the contracts between your components deserve their own records

Every public interface is a promise. A REST endpoint promises a JSON response shape. A gRPC service promises a Protobuf contract. A message queue event schema promises a payload format. The promise constrains every consumer, and the choice of what to promise — which protocol, which fields, which versioning strategy — is a decision made against alternatives that almost never gets documented. Three types of interface decisions worth documenting: transport and protocol decisions (REST vs. gRPC vs. GraphQL — the consumer profile is the decisive constraint and is completely invisible from the interface itself); schema and shape decisions (envelope vs. root-level, cursor vs. offset pagination, timestamp format — incremental extensions accumulate as a palimpsest of individual choices nobody made explicitly); versioning strategy decisions (URL versioning vs. header versioning vs. no strategy — the decision that determines the cost of every future breaking change, often made by default). The consumer-producer asymmetry: the producer knows which behaviors are intentional promises and which are implementation details; consumers infer intent from observed behavior; without a record distinguishing them, producers cannot change implementation details without breaking consumers who treated them as promises. The incremental extension problem: each backward-compatible addition felt like a non-decision, but collectively they represent structural choices nobody made explicitly. The REST/gRPC/GraphQL decision: the most common technical architecture discussion in AI chat and the most consistently undocumented — the consumer profile, tooling constraints, and capability requirements that drove the selection are fully present in the session and exist nowhere else.

2026-06-13 · ~13 min read

The rejected dependency: why the libraries you didn't install deserve a decision record

The evaluated-but-rejected library is the most invisible entry in your dependency graph. It isn't in package.json, it doesn't appear in any audit, and the deliberation that produced the rejection exists only in someone's AI chat history from fourteen months ago — until a new engineer joins and proposes the same library again. The dependency re-proposal cycle: proposers have done fresh research; current team members reconstruct reasoning from memory; the asymmetry consistently favors the proposal. Five rejection reasons worth documenting: the security-surface reason (what the library brings in — a specific transitive dependency concern or supply chain risk that may change when the library drops it); the maintenance-burden reason (release cadence, maintainer responsiveness, bus factor, evaluated and found to be a risk at a specific point in time); the native capability sufficiency reason (the native API covered the use cases — the absence of the popular library looks like an oversight to a new engineer); the deliberate dependency diet (explicit architectural policy of minimizing dependency count for auditing, deployment constraints, or supply chain surface); and the internal implementation decision (library rejected because the team built the same capability — without a record, the custom implementation looks like accidental technical debt and a new engineer proposes replacing it with the library that was originally rejected). The rejection record format: title in decision-statement form ("Chose native Temporal API over date-fns for date handling"), Context naming the problem and the evaluated library, Alternatives Considered naming the library and what was chosen instead with the specific rejection constraint, Consequences naming the trade-offs accepted, Revisitation Condition naming the specific trigger for re-evaluation. How to find rejected dependency decisions in AI chat history: sessions that start as installation or comparison questions and conclude with a non-installation have a characteristic shape detectable through negative decision markers; the quarterly extraction pass surfaces them systematically.

2026-06-13 · ~13 min read

The dependency upgrade decision record: documenting the 'why now?' of a breaking migration

Dependency upgrades are almost universally treated as technical tasks, not decisions. What gets committed is "upgrade react to 18.3.1." What disappears is the reasoning — why this quarter rather than last or next, whether pinning and patching was considered, what the gap cost calculation showed, which adjacent decisions the migration forced about rendering patterns or module system boundaries. Three distinct decisions are embedded in every breaking upgrade: the timing decision (the trigger that made this quarter right), the alternatives decision (pin-and-patch, fork, replace, or wrap-and-decouple), and the forced adjacent decisions made inline during migration. The 'why not defer?' question is the dependency management equivalent of the "not building this" record — a deferral without a named revisitation condition becomes permanent by default. The upgrade-vs-replace threshold: the comparison becomes live when migration cost exceeds replacement cost for the same capability, and this evaluation almost never gets written down. How AI chat migration deliberations are among the highest-confidence extraction targets: engineers frame upgrade evaluations as explicit comparison questions from the first message, producing sessions that contain the timing trigger, alternatives, and adjacent decisions in a single recoverable artifact.

2026-06-13 · ~13 min read

ADRs for platform teams: how infrastructure decisions become constraints for product teams

Platform teams make decisions for dozens of product teams who weren't in the room. Standard ADR templates assume the reader was present for the decision. Platform ADRs need four additional sections that product ADRs rarely require: a blast radius (which teams and services does this constrain, and what does it explicitly not constrain), an interface contract (what does compliance actually look like from the outside — not "we chose Kafka" but the bootstrap address, the topic provisioning process, the consumer group naming convention, and the responsibility split), a consulted vs. informed record (the political legitimacy signal — which teams had input before the decision, which were notified after), and a revisitation condition (named triggers under which the platform team will re-evaluate the constraint). The constraint propagation problem: an undocumented naming convention propagates through inference across fourteen services over two years; a blast radius section at decision time converts the implicit constraint into an explicit one before the inference chain forms. The temporal mismatch: platform ADRs are written at maximum context and read at minimum context, months later by engineers encountering a constraint they didn't author. Why platform deliberation sessions in AI chat are particularly valuable for extraction: the blast radius section comes from which teams were explicitly named in the original deliberation, and that organizational context is the first thing lost from memory.

2026-06-13 · ~12 min read

The ADR title convention: why the title should be a decision statement, not a topic

Most ADR titles are topics, not decision statements. "PostgreSQL as the primary database" is a topic. "Chose PostgreSQL over MongoDB for primary persistence" is a decision statement. The difference determines whether the record is findable at the list level — whether a new engineer scanning the decisions directory can answer "has this comparison been made?" without opening each file. The anatomy of a decision statement title (verb + subject + rejected alternative + context), how the verb encodes the decision type (chose / rejected / deferred / kept / adopted), why the rejected alternative belongs in the title and is the part most frequently omitted, the five anti-patterns that produce topic titles, the filename constraint, the MADR convention trade-off (noun-phrase titles for supersession stability vs. decision statement titles for retrieval), and why the title is the right diagnostic for whether the body is done.

2026-06-12 · ~12 min read

How post-mortems and ADRs work together: using incident history to fill the decision log

Post-mortems and ADRs are two separate processes that belong together. The constraint that broke in production was almost always implicit before the incident — the post-mortem surfaces it at peak clarity, but most teams file the learning in an incident ticket where future architectural decisions can't cite it. Three ADR types that post-mortems produce: the original decision ADR (documenting the architectural choice that created the failure mode), the deferral ADR (making explicit the implicit "not yet" decision that allowed the vulnerability to accumulate), and the incident response ADR (the rollback vs. hotfix decision with the actual constraint stated). The 48-hour window: the specific rate at which post-mortem clarity degrades from precise constraint understanding to abstracted lesson. How to use the post-mortem timeline as a date range for the WhyChose extractor — running it on AI chat history from when the original decision was made surfaces the deliberation that was live at the time, which is more accurate than memory. How to structure the post-mortem template to make ADR production a named deliverable, not an oversight.

2026-06-12 · ~13 min read

The ADR as a forcing function: how writing the record changes the decision

Writing an ADR doesn't just document a decision — it changes the quality of the decision itself. Three mechanisms: the Alternatives Considered section forces you to name options you dismissed without full evaluation; the constraint statement forces you to distinguish the constraint that was actually decisive from the ones that felt decisive; the Consequences section forces you to name what you gave up. The blank template is the most useful thinking tool in the room, and most teams never use it that way because they open it only after the decision is made. Covers the blank ADR as a pre-decision agenda, the ADR-first meeting pattern, what retrospective ADRs reveal as a decision quality audit, how AI-chat-based decisions change the forcing-function dynamic, and the second-order effect: teams that write ADRs consistently report making better decisions, not just better-documented ones.

2026-06-12 · ~14 min read

The startup founder's ADR starter pack: 12 decisions every early-stage company should document

Year one is when the most consequential decisions get made and the least documentation happens. Twelve categories ordered by the cost of not having them: tech stack foundational choices, ICP decision (including who you decided NOT to build for), first pricing structure, database and persistence, auth/identity approach, founding-team authority structure, first hire prioritization, infrastructure and deployment, open source vs. closed, monetization model, key deferral decisions (deliberate "not yet" choices with implicit trigger conditions), and the first "no" to a paying customer. For each: what the record looks like, why the absence is expensive, and what AI chat history from year one is likely to contain. Plus the three-sentence minimum viable ADR format, the extraction pass for recovering decisions that were never written down, and the quarterly founding-team review ritual that keeps the starter pack current.

2026-06-12 · ~13 min read

The cost of a wrong constraint: when an ADR documents a constraint that was never real

The constraint field is the most load-bearing element of any ADR — every downstream decision that references it is subtly wrong when the original constraint was misidentified. Three patterns that produce false constraints: the misremembered constraint (clean precision that came from a different project), the rationalized constraint (team familiarity reframed as an operational complexity argument), and the absent constraint (the real driver left implicit because it felt too informal to document). A worked case — the 5ms latency requirement that was actually a team familiarity constraint — and how it propagated through three subsequent decisions over eighteen months. The detection test: run the WhyChose extractor on the original AI chat history and check whether the documented constraint appears in the actual deliberation. The correction protocol: a dated Notes amendment, not an in-place edit, so the historical record stays intact while the informational status of the constraint becomes accurate.

2026-06-12 · ~14 min read

ADR tooling for 2026: Log4Brains, adr-tools, and how to pick the right approach

Most teams that try to adopt architecture decision records stall not on the format question but on the tooling question. A practical comparison of the three main approaches — adr-tools CLI (fast write, no browser view, bash-only), Log4Brains documentation site (searchable static site, non-engineer-readable, Next.js maintenance burden), and the MADR directory approach (zero dependencies, survives any build-chain change, findability caps at ~80 records). Where each breaks, the choice framework by team type (small/single-repo vs. non-engineer readership vs. multi-repo vs. wiki-first), and the shared gap: all three tools assume the reasoning is already available when the record is written — but the deliberation that produced most decisions happened in AI chat sessions that none of these tools can read.

2026-06-12 · ~16 min read

The last 90 days: why recent AI chat history is your team's highest-value undocumented asset

The decisions most worth documenting are the ones your team made last month — not the ones from two years ago. Recent AI chat history degrades fastest because the three types of context that make a decision record interpretable (explicit reasoning, implicit background constraints, interpersonal context) are all still recoverable in the 90-day window and rapidly inaccessible after it. The team knowledge gap isn't about personnel turnover — it's about divergent recollection: each team member participated in a different part of the AI-assisted deliberation (engineer evaluated technical feasibility, PM evaluated user impact, EM evaluated cross-team dependencies) and none has the complete picture. Four decision types that concentrate in the 90-day window: scope decisions made under time pressure, technical approach decisions made during implementation, constraint decisions made when requirements changed, and deferral decisions with implicit conditions. How to run the 90-day extraction pass for a 5-person team (2–3 hours, 15–25 records), why to check for divergent recollections before writing records, and why the quarterly rhythm — not a one-time project — is what closes the knowledge gap permanently.

2026-06-11 · ~15 min read

The startup decision log: what the first year looks like when you build with AI

Solo founders make 3–5 consequential decisions a week in AI chat sessions. After one year, those decisions are invisible — scattered across hundreds of tabs that were closed the moment the thinking was done. A WhyChose extraction pass on twelve months of exports reveals what categories dominate (stack decisions 30–40%, ICP definition 20–25%, pricing iterations 15–20%, scope deferrals 15–20%), which decisions you thought were temporary are now load-bearing, and why the first-hire onboarding problem starts the day you open your first ChatGPT tab. Includes a practical guide to running the extraction pass: how to export both platforms, how to anchor dates from git history, how to triage by category and permanence, and why the ICP records should be written before anything else.

2026-06-11 · ~14 min read

How engineering managers use decision logs: the 1:1 documentation use case

Engineering managers inherit decisions they didn't make and answer for them as if they did. A decision log changes how the "why did we build it this way?" question gets answered in 1:1s. Four EM use cases: the pre-1:1 brief (pull the two or three decisions most relevant to what each engineer is currently working on before the meeting — turns reactive hedging into prepared context), the onboarding fast-path (a curated reading list of the 8–12 most consequential decisions in a new engineer's domain, shared before their first day), the performance conversation (decision records that give specific feedback grounded in artifacts rather than impressions — essential for engineers doing infrastructure or platform work whose output isn't user-facing features), and the promotion case (every claim in the brief maps to a specific record — the difference between "Alice showed good architectural judgment" and a structured argument with citations). Why EMs should also run the WhyChose extractor on their own AI chat exports, which typically surface a different category of decisions than IC exports: technical debt triage, process adoption, resource allocation reasoning. The private-vs-shared log boundary, and the three curator responsibilities that make ADRs a living practice rather than a documentation backlog.

2026-06-11 · ~13 min read

The hiring ADR: what to capture when you're choosing between candidates — and what to leave out

Hiring calls are the most politically sensitive category of decision record. The candidate comparison doesn't go in git. The role design, capability gap, and team structure decision do. Three categories worth documenting: role prioritization decisions (why this hire now instead of alternatives — these encode product thesis and execution priorities), role design decisions (full-time vs. contractor, senior vs. two mid-levels, specialist vs. generalist — these expire as the team grows and need review triggers tied to team size thresholds), and team structure decisions (org topology, reporting lines, decision authority distribution — organizational architecture that outlasts individuals). The candidate-anonymous format that makes hiring ADRs safe for a technical git repository, what goes in the record vs. what stays private, review triggers for hiring decisions, and how to recover hiring deliberation from AI chat history using offer letter dates to narrow the export window.

2026-06-11 · ~13 min read

Decision records for infrastructure-as-code: ADRs alongside Terraform and Kubernetes config

IaC config tells you what the infrastructure looks like. It doesn't tell you why prevent_destroy is set, why workspaces were chosen over folder-per-environment separation, or why Kyverno won the admission controller evaluation. Five categories that consistently need documentation: state management decisions, module boundary decisions, security and lifecycle decisions, provider and service decisions, and naming conventions. Where ADRs live relative to .tf files and Helm charts, Terraform-specific patterns (prevent_destroy audits, provider version pinning, module DRY vs. explicit trade-offs), Kubernetes-specific patterns (CRD adoption, namespace architecture, admission controller decisions, Helm vs. Kustomize), how to extract the deliberation-before-migration from AI chat history using git blame to narrow the export window, and review triggers for IaC decisions that expire with provider pricing changes, API deprecations, and team growth thresholds.

2026-06-11 · ~11 min read

The ADR review checklist: what to verify before merging

Most ADR reviews are format reviews — checking that sections exist and the title is descriptive. Format reviews produce correctly structured records with thin reasoning. A decision review checks whether the ADR captures the three things that make it useful in 18 months: named alternatives with concrete rejection reasons, the constraint that drove the choice, and honest consequences that name a real trade-off. Five checks an ADR author runs before opening the PR, three checks a reviewer runs, and why Alternatives Considered is the most consistently under-populated field in any decision log — plus how running the WhyChose extractor on the AI chat session that preceded the ADR produces better alternatives sections with less reconstruction effort.

2026-06-10 · ~12 min read

The monorepo decision log: how teams with shared infrastructure document decisions that span multiple services

A monorepo collapses the team-boundary signal that polyrepos provide for free. In a polyrepo, the repo boundary tells you whose decision it was. In a monorepo, three services' code coexists in the same commit history, and the decision about which service owns a shared concern may predate any of the affected code. Three categories of monorepo decision with different scoping, governance, and folder placement: service-local (one team, one service folder), cross-service (peer teams, downstream stakeholders field), and platform-wide (platform team, root decisions/ folder, all service teams downstream). How the unified commit history narrows the extraction window for AI chat recovery, why shared library decisions are the highest-priority documentation gap in any monorepo, and the ADR folder structure that makes decision scope visible from the file path.

2026-06-10 · ~12 min read

The distributed team ADR: how async-first organizations document decisions without a shared whiteboard

Distributed teams write more than co-located teams and document decisions worse. The artifacts — Slack threads, PR comments, RFC docs, AI chat sessions — are process artifacts that capture conversation, not decision artifacts that capture named alternatives, the constraint that drove the selection, and the reasoning a new team member needs. Three conventions that fix this for async-first teams: the async RFC pattern (RFC document with 48–72 hour comment window, decision summary appended at close — RFC and ADR live in the same document), the decision channel (a dedicated registry where every closed decision gets a one-paragraph announcement with a link to the full record), and the multi-engineer quarterly extraction pass (every team member exports AI chat, output is pooled and triaged async).

2026-06-10 · ~12 min read

The product decision ADR: why product, hiring, and process decisions belong in the same log

Architecture decision records got their name from software architecture, but the decisions that cause the most expensive confusion at early-stage companies are product bets, pricing calls, hiring decisions, and process choices. They share the same three properties that make technical decisions worth documenting. Why the standard ADR format works without modification, the five categories worth logging (product scope bets, pricing and packaging, hiring and team structure, process adoption, market and ICP), and why AI chat exports make non-technical decision recovery more tractable than for technical decisions — because founders frame product deliberations as decisions from the first message.

2026-06-10 · ~11 min read

Cross-team decisions: when one team's ADR creates another team's constraint

The most expensive class of undocumented decision is the cross-cutting one — API contracts, shared schemas, event formats — where one team's architecture choice silently creates obligations for a different team. What makes a decision cross-team rather than service-local, the three fields that standard ADR templates omit (downstream stakeholders, notification record, migration obligation), the governance ceremony that prevents constraint collisions (RFC before ADR, not ADR then notification), and how to recover cross-team decision deliberation from AI chat history when the original design review happened in a Zoom call that wasn't recorded.

2026-06-09 · ~11 min read

The onboarding use case: how new engineers use a decision log on day one

The new-CTO onboarding problem is the most vivid version of the gap, but the everyday version affects every engineer who joins a team: the "why is it built this way?" questions that nobody can answer confidently because the reasoning happened in AI chat. What the first week looks like when a decision log exists vs. when it doesn't, the four categories of questions new engineers always ask that wikis and READMEs never answer, how to structure the log so it's navigable without prior codebase knowledge, and the onboarding reading list — a curated ten-record subset that turns the first week from inference to reference.

2026-06-06 · ~11 min read

How many decisions should an engineering team make per quarter — and what does "too few" look like?

Most teams have either an overcrowded decision log or an empty one. The benchmarks: 3–6 per quarter for very early teams, 5–10 for small teams in active development, 8–15 per squad at scaling stage. "Too few" is a diagnostic signal — almost always a dormant log, not a quiet team. The difference between a quiet quarter (low activity, low decisions) and a dormant log (high activity, near-empty log), what decision debt costs and how it compounds, and how to use extraction data to measure the gap between decisions made and decisions documented.

2026-06-06 · ~10 min read

When to supersede vs. deprecate an ADR: the decision record lifecycle

An ADR written in year one doesn't stay accurate forever. The library gets replaced. The team structure changes. The product pivots. The lifecycle states (Proposed, Accepted, Deprecated, Superseded), when to use each, how to write the links correctly so both records stay navigable, why you should never delete, and what AI chat extraction reveals about how fast decisions actually go stale — technology choices last about eight months before engineers start questioning them; architectural invariants last eighteen months or more.

2026-06-06 · ~10 min read

Decisions that never get written down: the "not building this" record type

The most valuable and most commonly missing entry in a decision log is the deliberate choice not to build something. "Yes" decisions leave artifacts — code, PRs, deploys. "No" decisions leave nothing except the reasoning, which lives in a chat session that's impossible to search eight months later. Why AI chat captures rejection reasoning better than ADR tooling, what a "not building this" record looks like, the deferred-vs-permanent distinction that determines whether to write a revisit condition, and how to find these records in the quarterly triage pass.

2026-06-06 · ~10 min read

The quarterly decision review: a 30-minute ritual for engineering teams

You extracted your first batch of decisions from AI chat. Now what? The exact 30-minute workflow: request your export the day before, run the extractor, triage the output into four buckets (Promote / Link / Park / Dismiss), write up the 2–5 records that rise to ADR level, archive and set the next quarter's reminder. Plus: which records actually warrant the full ADR treatment, three anti-patterns that kill decision logs, and what the second quarterly review looks like once you have an existing log to check against.

2026-06-05 · ~10 min read

I built an open-source tool to extract decisions from ChatGPT/Claude. Here's every regex I used — and every one I had to throw out.

The engineering story behind the WhyChose extractor: five heuristics that went in the bin (sentence-length thresholds, named-entity recognition, first-person verbs in isolation, message-count filtering, Q&A adjacency), four patterns that survived (question shapes, user commit phrases, trade-off markers, reversal markers), and what each failure mode revealed about how engineers actually think with AI. The two-pass architecture that makes it work — and why 3.1% is the right hit rate.

2026-06-05 · ~7 min read

The MADR 4.0 spec in 15 minutes

MADR (Markdown Any Decision Record) is the ADR format the ecosystem has converged on — YAML frontmatter, a formal Considered Options section, and a structured Decision Outcome with an explicit "because" clause. What it adds over Nygard's original five-section format, what changed in 4.0 vs 3.x, when to use MADR vs Nygard, how AI chat output maps naturally onto MADR structure, and the tooling stack (Log4Brains, GitHub Action CI validation) that makes the format pay off at scale.

2026-06-05 · ~9 min read

From 1,200 ChatGPT chats to 38 durable decisions: a real export walkthrough

18 months of ChatGPT history, 1,214 conversations, 47MB of export. The extractor surfaced 154 candidates and 38 durable decisions after the durability filter — a 3.1% hit rate. Here's the breakdown by category, the three findings that changed how we work, and what the extractor missed. Pure proof, no pitch.

2026-06-05 · ~8 min read

ADR vs Decision Log vs RFC: when to use each

Three terms for capturing decisions, three different jobs. RFC is the pre-decision proposal (you want input before committing); ADR is the post-decision permanent record (immutable, captures the rejected options, answers "why" at staff turnover); decision log is the broader collection (ADRs plus the lighter product and operations calls). The practical disambiguation — which artifact for which situation — and what to do about the 90% of decisions that happened in AI chat without any of the three.

2026-04-29 · ~9 min read

Why your team's ADRs go stale after 60 days (and what to do about it)

Almost every team that adopts an ADR practice abandons it within two months. Not because the team is lazy — because the ceremony costs more per decision than people are willing to pay, and the cost compounds every time a record is skipped. The round-trip math, the broken-window effect, and the two-tier shape that actually sticks: extracted records for the 80%, hand-written ADRs only for the load-bearing 20%.

2026-04-25 · ~8 min read

The new-CTO onboarding problem: when nobody can tell you why

You join a Series A as the new CTO. Six weeks in, someone asks why the stack was picked. The answer isn't in the wiki, isn't in the repo, isn't on the calendar. The reasoning happened — but it happened inside someone else's ChatGPT history, and now they're gone. Why this is the modal new-CTO experience in 2026, why ADR ceremony and Notion wikis don't fix it, and the three-line first-90-days plan that actually does.

2026-06-10 · ~12 min read

The security ADR: how to document decisions that affect threat model, data handling, or compliance scope

Security decisions are the category where undocumented trade-offs are most expensive. SOC 2 auditors ask not just "what are your controls?" but "what was the decision process for each control, and who reviewed it?" The five fields a security ADR adds beyond the standard template (threat model scope, data classification, compliance scope, security reviewers, review triggers), why the standard Consequences section fails for security decisions — it conflates accepted risks with desired outcomes and has no reviewer field or review cadence — the classification problem and how to handle security ADR content that reveals attack surface, how compliance documentation makes security ADRs uniquely valuable at audit time, and why security decisions need review triggers when most ADRs don't require them.