base14 Blog Blog

The Multi-Cloud Design: Engineering your code for Portability

2026-01-28T00:00:00.000Z

In our previous post on Cloud-Native foundations, we explored why running on one cloud isn't lock-in—but designing for one cloud is. Now let's look at how to implement that portability.

Portability is not defined by the ability to run everywhere simultaneously, as that is often a path toward over-engineering. It is, more accurately, a function of reversibility. It provides the technical confidence that if a migration becomes necessary, the system can support it. This quality is not derived from a specific cloud provider, but rather from the deliberate layering of code and environment. While many teams focus on the destination of their deployment, true portability is found in the methodology of the build.

The Two Fronts: Application Layer vs Environment Layer

To keep your options open, you have to work on two fronts.

First, the Application Layer. This is your domain logic. It should be blissfully unaware of whether it's talking to a proprietary cloud queue or a local database. Second, the Environment Layer. This is your config, the "container" your code lives in. It needs to be reproducible and declarative. If an environment cannot be recreated with a single command, the system relies on luck rather than automation.

Most systems don't fail at the deployment stage. They fail in the code. When your business logic starts calling proprietary SDKs directly, you've stopped building a product and started building a feature for your cloud provider. You might be "on Kubernetes," but if your code is married to a specific vendor's identity service or database quirks, you're stuck.

Designing for a Cloud is the Trap: The Cost of Vendor Lock-In

There is no harm in running on one cloud. The risk is making irreversible design decisions. If you build on open interfaces, you can happily stay on one provider for years while still keeping the power to:

Spin up a secondary site during a regional meltdown.
Shift workloads when the "committed spend" math stops adding up.
Actually negotiate your contract because you have a credible exit.

Portability provides strategic leverage

Kafka as a Library: The Interface-First Mindset

Take Kafka. It's a great example because it has evolved from a tool into a protocol. If your app depends on the Kafka API rather than a specific vendor's implementation, Kafka effectively becomes a library. Whether you're using self-hosted Apache Kafka, a managed service, or something like Redpanda, your producers and consumers don't care. Only the plumbing changes.

This pattern is everywhere if you look for it:

Databases: Postgres and MySQL protocols.
Identity: OAuth and OIDC.
Observability: OpenTelemetry.
Storage: The S3 API.

The CNCF landscape is more than a list of tools; it's a map of the interfaces that won. When you see multiple mature implementations for the same protocol, that's your green light to build. It sets the pace for the entire ecosystem, signaling to vendors the language the ecosystem now speaks.

Business Logic Should Be Ignorant: The Adapter Pattern

Portability fails when your code knows too much. The rule is simple: Your business logic should not care where it runs.

This is where "ports and adapters" (hexagonal architecture) moves from theory to practical survival. Your domain talks to an interface; your infrastructure lives behind an adapter.

Yes, this costs something. You pay in abstraction. But 'abstract' shouldn't mean 'complex.' You aren't introducing a heavy new component or a fragile moving part; you're just building a wrapper. This is the adapter pattern in its most practical form. The 'adapter' in a ports and adapters architecture. It's the difference between hard-wiring your logic into a vendor's proprietary API and simply translating their contract into your own domain schema. This minor friction today prevents a total collision during a high-cost migration later.

While portability requires consistent daily investment, it mitigates the significant, sudden costs associated with vendor lock-in..

Intent vs. Instructions: GitOps for Reproducibility

Layered code is insufficient if standing up in a new environment requires tribal knowledge. Infrastructure must be reproducible, which is the core of GitOps. Storing intent is superior to storing instructions, because describing "what" must exist is more durable than the "how" of a dashboard. Mapping this intent to specific cloud APIs is a one-time configuration task that moves cloud-specific friction out of the architecture and into a manageable layer.

GitOps makes this real by storing your intent in Git. Now, let's be clear: this isn't magic. You still have to do the one-time work of mapping that intent to a specific cloud's APIs whether that's configuring a Crossplane provider, a Terraform module, or a specific Ingress controller. Think of it as installing a driver. You do the plumbing once so that your application logic doesn't have to care about it. Once those mappings are in place, the workflow is identical: commit, push, sync. You've successfully moved the cloud-specific friction out of your architecture and into a manageable configuration layer. This is what makes true multi-cloud and hybrid deployments practical.

You Don't Need to Move, You Just Need to Know You Can

At the end of the day, some decisions are hard to undo. Choosing open interfaces and declarative configs makes them easier.

It gives you the room to respond to outages, control your costs, and meet new compliance hurdles without breaking the company.

You don't need to move often or move at all. You just need to know that the door isn't locked from the outside. That's the real value of avoiding vendor lock-in.

What's Next?

In the next post, we'll dig into the stack itself: which protocols actually preserve your freedom, and which ones are "open" in name only.

The Cloud-Native Foundation Layer — Part 1: why running on one cloud isn't lock-in, but designing for one is
Why Unified Observability Matters — Applying vendor-neutral principles to your monitoring stack
Observability Theatre — The cost of fragmented tooling and how to escape it

References

Live Metric Registry: find and understand observability metrics across your stack

2026-01-19T00:00:00.000Z

Introducing Metric Registry: a live, searchable catalog of 3,700+ observability (and rapidly growing) metrics extracted directly from source repositories across the OpenTelemetry, Prometheus, and Kubernetes ecosystems, including cloud provider metrics. Metric Registry is open source and built to stay current automatically as projects evolve.

What you can do today with Metric Registry

Search across your entire observability stack. Find metrics by name, description, or component, whether you're looking for HTTP-related histograms or database connection metrics.

Understand what metrics actually exist. The registry covers 15 sources including OpenTelemetry Collector receivers, Prometheus exporters (PostgreSQL, Redis, MySQL, MongoDB, Kafka), Kubernetes metrics (kube-state-metrics, cAdvisor), and LLM observability libraries.

See which metrics follow standards. Each metric shows whether it complies with OpenTelemetry Semantic Conventions, helping you understand what's standardized versus custom.

Trace back to the source. Every metric links to its origin: the repository, file path, and commit hash. When you need to understand a metric's exact definition, you can go straight to the source.

Trust the data. Metrics are extracted automatically from source code and official metadata files, and the registry refreshes nightly to stay current as projects evolve.

Can't find what you're looking for? Open an issue or better yet, submit a PR to add new sources or improve existing extractors.

Sources already indexed

Category	Sources
OpenTelemetry	Collector Contrib, Semantic Conventions, Python, Java, JavaScript
Prometheus	node_exporter, postgres_exporter, redis_exporter, mysql_exporter, mongodb_exporter, kafka_exporter
Kubernetes	kube-state-metrics, cAdvisor
LLM Observability	OpenLLMetry, OpenLIT
CloudWatch	RDS, ALB, DynamoDB, Lambda, EC2, S3, SQS, API Gateway

Watch: Introduction to the Live Metric Registry.

What's the need for a Metric Registry?

If you've ever tried to answer "what metrics does my stack actually emit?", you know the pain. Observability metrics are scattered across hundreds of repositories, exporters, and instrumentation libraries. The OpenTelemetry Collector Contrib repo alone has over 100 receivers, each emitting dozens of metrics. Add Prometheus exporters for PostgreSQL, Redis, MySQL, Kafka. Then Kubernetes metrics from kube-state-metrics and cAdvisor. Then your application instrumentation across Go, Java, Python, and JavaScript.

Each source uses different formats:

OpenTelemetry Collector uses metadata.yaml files
Prometheus exporters define metrics in Go code via prometheus.NewDesc()
Python instrumentation uses decorators and meter APIs
Some sources just have documentation (if you're lucky)

Different naming conventions compound the problem. Is it http_server_request_duration or http.server.request.duration? Underscores or dots? _total suffix or not?

There's no central registry, no single place to search "show me all histogram metrics related to HTTP requests across my entire observability stack."

Why not just a static list ?

The obvious solution is to create a curated list. Document all the metrics, put them in a spreadsheet or wiki, and call it a day.

This fails for several reasons:

Metrics change constantly. Every release of every exporter can add, modify, or deprecate metrics. The OpenTelemetry Collector Contrib repo has hundreds of commits per month, and a static list becomes outdated quickly.

Manual curation doesn't scale. The registry indexes over 3,400 metrics from just 15 sources. The full observability ecosystem has thousands of exporters and instrumentation libraries. No team can manually track all of this.

No provenance. A static list tells you a metric exists, but not where it's defined, what version introduced it, or whether the definition you're looking at is current. When debugging why a metric isn't appearing as expected, you need to trace back to the source.

No trust levels. Some metric definitions come from official metadata files maintained by the project. Others are inferred from code analysis. A static list treats them the same, but they're not equally reliable.

Its not trivial to build a live Metric Registry - why is that?

Building a system that automatically extracts and catalogs metrics from source repositories sounds straightforward. Clone the repos, parse the files, store the results. In practice, it's surprisingly complicated.

Multi-Language Extraction

Metrics are defined in Go, Python, Java, TypeScript, YAML, and more. Each requires different parsing strategies:

Go: AST parsing to find prometheus.NewDesc() calls, prometheus.NewGauge(), and similar patterns
Python: AST walking to find meter.create_counter() and instrument decorators
TypeScript: Parsing to extract metric definitions from OpenTelemetry JS instrumentation
YAML: Structured parsing for OpenTelemetry metadata files
Regex: Sometimes the cleanest option for semi-structured definitions

A single "parser" doesn't work, since each language and each project has its own patterns.

Multiple Definition Patterns

Even within a single language, metrics are defined differently across projects.

In Go alone, the patterns include:

prometheus.NewDesc() with BuildFQName() for namespaced metrics
Direct string literals for metric names
Map-based definitions where metric metadata is stored in data structures
Constants defined separately from the metric registration

The redis_exporter defines metrics in maps. The postgres_exporter uses the standard NewDesc pattern. kube-state-metrics generates metrics dynamically based on Kubernetes resource types. Each required a different extraction approach.

Normalization Challenge

Once extracted, metrics need normalization into a canonical schema. This means:

Consistent naming: converting between http_server_duration and http.server.duration
Unified types: mapping Prometheus's counter/gauge/histogram/summary to OpenTelemetry's instrument types
Attribute standardization: labels, dimensions, and tags are all the same concept with different names

Without normalization, searching across sources becomes difficult.

Provenance Tracking

Every metric in the registry must link back to:

The source repository
The exact file path
The git commit hash
The extraction timestamp

This information is essential for debugging and trust. When a user questions why a metric has a certain description, they need to see the source.

Trust Levels

Not all metric definitions are equally reliable:

Authoritative: From official metadata files maintained by the project (like OTel Collector's metadata.yaml)
Derived: Extracted from source code via AST analysis
Documented: Scraped from documentation
Vendor-claimed: From vendor docs without source verification

A registry that doesn't distinguish between these levels can mislead users about the reliability of metric definitions.

Semantic Convention Compliance

OpenTelemetry defines semantic conventions, which are standardized metric names and attributes. A useful registry should indicate which metrics comply with these conventions:

Exact match: http.server.request.duration matches the semantic convention exactly
Prefix match: http.server.request.duration.bucket starts with a convention metric
No match: Custom metric not covered by conventions

This helps teams understand which metrics are "standard" versus custom.

And so - source-first metric extraction

The Metric Registry extracts metrics directly from source repositories, normalizes them into a canonical schema, and exposes them via search.

Design Principles

Source-first: Derive metrics from repos. The source code is the ground truth.

Pluggable adapters: Each source gets its own adapter that knows how to fetch and extract. Adding a new source doesn't require changing core logic.

Provenance-aware: Every metric links to its origin. Always know where a metric came from and how trustworthy it is.

Search-oriented: Optimize for discovery. Full-text search, faceted filtering, semantic convention badges.

Architecture Deep Dive

┌──────────────────────────────────────────────────────────────────────┐
│                            Sources                                   │
│  otel-contrib │ postgres │ redis │ ksm │ cadvisor │ otel-python │ ...│
└───────┬───────────┬─────────┬───────┬───────┬───────────┬────────────┘
        │           │         │       │       │           │
        ▼           ▼         ▼       ▼       ▼           ▼
┌──────────────────────────────────────────────────────────────────────┐
│                           Adapters                                   │
│     Each adapter: Fetch (git clone) → Extract (parse) → RawMetric    │
└─────────────────────────────────┬────────────────────────────────────┘
                                  │
                                  ▼
┌──────────────────────────────────────────────────────────────────────┐
│                          Orchestrator                                │
│         RawMetric → CanonicalMetric → Store (SQLite + FTS5)          │
└─────────────────────────────────┬────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                           Enricher                                  │
│      Cross-reference with OTel Semantic Conventions                 │
│      Match types: exact, prefix, none                               │
└─────────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
┌───────────────────────────────────────────────────────────────────────┐
│                       REST API + Next.js UI                           │
│   Search, filter by type/source/component, semantic convention badges │
└───────────────────────────────────────────────────────────────────────┘

Adapters

Each adapter implements a common interface:

type Adapter interface {
    Name() string
    SourceCategory() domain.SourceCategory
    Confidence() domain.ConfidenceLevel
    ExtractionMethod() domain.ExtractionMethod
    RepoURL() string
    Fetch(ctx context.Context, opts FetchOptions) (*FetchResult, error)
    Extract(ctx context.Context, result *FetchResult) ([]*RawMetric, error)
}

The adapter handles everything source-specific: cloning the repo, finding metric definitions, parsing them. The orchestrator doesn't need to know whether it's parsing YAML or walking a Go AST.

Extraction Methods

YAML Parsing (OpenTelemetry Collector Contrib)

The cleanest case. OTel Collector receivers include metadata.yaml files with structured metric definitions:

metrics:
  redis.clients.connected:
    description: Number of client connections
    unit: "{connection}"
    gauge:
      value_type: int

Go AST Parsing (Prometheus Exporters)

Most Prometheus exporters define metrics using prometheus.NewDesc():

prometheus.NewDesc(
    prometheus.BuildFQName(namespace, subsystem, "connections"),
    "Number of active connections",
    []string{"database"},
    nil,
)

The extractor walks the AST to find these calls, resolves the string arguments (including BuildFQName concatenation), and extracts metric name, description, and labels.

Python AST (OpenTelemetry Python, OpenLLMetry)

Python instrumentation uses the meter API:

meter.create_histogram(
    name="http.client.duration",
    description="Duration of HTTP client requests",
    unit="ms"
)

AST walking finds these calls and extracts the arguments.

Custom Patterns

Some sources required custom approaches:

redis_exporter stores metrics in Go maps, so the extractor parses map literals
OpenTelemetry Java uses a mix of constants and method calls, so regex extraction worked best
kube-state-metrics generates metrics dynamically from Kubernetes types

Storage and Search

SQLite with FTS5 (full-text search) provides:

Fast text search across metric names, descriptions, components
Faceted filtering by instrument type, source category, component
Efficient pagination for browsing

Enrichment

After extraction, the enricher cross-references each metric against OpenTelemetry Semantic Conventions:

349 semantic convention metrics parsed from the official repo
Name normalization (underscores → dots) before matching
Three match types: exact, prefix, none
Results stored alongside the metric for filtering and display

What's next

More sources: Cloud provider metrics (AWS CloudWatch, GCP Monitoring), more language instrumentations (.NET), additional Prometheus exporters.

Deeper enrichment: Attribute validation against semantic conventions, stability level tracking, deprecation warnings.

Cross-ecosystem mapping: Identifying equivalent metrics across OpenTelemetry and Prometheus ecosystems.

The observability ecosystem is vast and fragmented. A live metric registry makes "what metrics exist?" an answerable question, and it stays current automatically through nightly extraction from source repositories.

The source code is the truth and this Metric Registry makes it searchable.

Contribute

Metric Registry is open source. We welcome contributions—whether it's adding new metric sources, improving extraction accuracy, or fixing bugs. Check out the repo at github.com/base-14/metric-library and join us in building a comprehensive catalog of observability metrics.

Put these metrics to work. base14 Scout ingests metrics from all the sources indexed in Metric Registry—OpenTelemetry, Prometheus exporters, Kubernetes, and more—into a unified observability platform. Get started with Scout →

The Cloud-Native Foundation Layer — Building observability infrastructure that scales with your stack
Reducing Bus Factor in Observability Using AI — Making metric knowledge accessible across your team
Why Unified Observability Matters for Growing Engineering Teams — The case for consolidating your monitoring stack

Evaluating Database Monitoring Solutions: A Framework for Engineering Leaders

2026-01-18T00:00:00.000Z

It was 5:30 AM when Riya (name changed), VP of Engineering at a Series C e-commerce company, got the page. Morning traffic was climbing into triple digits and catalog latency had spiked to twelve seconds. Within minutes, Slack was flooded with alerts from three different monitoring tools, each painting a partial picture. The APM showed slow API calls. The infrastructure dashboard showed normal CPU and memory. The dedicated PostgreSQL monitoring tool showed elevated query times, but offered no correlation to what changed upstream. Riya watched as her on-call engineers spent the first forty minutes of the incident jumping between dashboards, arguing over whether this was a database problem or an application problem. By the time they traced the issue to a query introduced in the previous night's deployment, the checkout flow had been degraded for nearly ninety minutes. The postmortem would later reveal that all the data needed to diagnose the issue existed within five minutes of the alert firing. It was scattered across three tools, owned by two teams, and required manual timeline alignment to interpret. Riya realized the problem was not instrumentation. It was fragmentation.

The Hidden Cost Model of Fragmented Observability

Engineering leaders evaluating PostgreSQL monitoring solutions typically focus on feature checklists: which metrics are collected, how dashboards look, what alerting options exist. These are reasonable starting points, but they obscure a more significant cost driver that compounds over time.

Fragmented observability, the practice of monitoring databases separately from applications and infrastructure, introduces costs that do not appear on any vendor invoice. These costs manifest as slower incident resolution, reduced velocity in shipping software, erosion of operational culture, and the gradual accumulation of knowledge silos.

Impact on Incident Resolution

The most immediately measurable cost of fragmented observability is extended mean time to resolution. When database metrics live in one tool, application traces in another, and infrastructure signals in a third, engineers must perform manual correlation before diagnosis can begin.

This correlation tax applies to every incident where the root cause is not immediately obvious. Engineers must align timelines across tools by eyeballing timestamps. They must mentally map application identifiers to database identifiers, since different tools use different labeling conventions. They must context-switch between interfaces, each with its own query language and navigation model.

For straightforward issues, this overhead might add ten or fifteen minutes. For complex incidents involving interaction between application behavior and database state, the overhead can dominate the entire investigation. Riya's team spent forty minutes establishing that the database was the victim rather than the cause, before they could begin examining what the previous night's deployment had changed.

Impact on Software Delivery Velocity

The effects extend beyond incident response into day-to-day development. Teams that cannot quickly understand how their changes affect database behavior tend to ship more conservatively, or worse, ship without understanding the database implications at all.

Consider a team deploying a new feature that introduces a new query pattern. With unified observability, they can watch application latency and database behavior on the same timeline, verify that the new queries perform as expected, and catch regressions before users notice them. With fragmented observability, this verification requires opening multiple tools, manually correlating deployment timestamps, and hoping that the metrics granularity aligns closely enough to draw conclusions. Many times they don't even have access to the database monitoring tool, which is owned by a separate team.

Most teams, facing this friction, skip the verification. They deploy and rely on alerts to catch problems. This shifts the feedback loop from proactive to reactive, from minutes to hours. Over time, teams develop less intuition about how their code interacts with the database. Performance regressions creep in gradually rather than being caught immediately.

Impact on Operational Culture

Fragmented observability shapes organizational behavior in ways that extend beyond tooling. When database monitoring is separated from application monitoring, ownership boundaries tend to follow the same split.

This creates a predictable dynamic during incidents. Application teams point to normal application metrics and suggest the database is at fault. Database teams point to normal database metrics and suggest the application is at fault. The first phase of incident response becomes political rather than technical.

Even outside of incidents, the cultural effects are significant. Application developers, lacking integrated visibility into database behavior, treat the database as a black box. Database expertise becomes concentrated in a small number of individuals who become bottlenecks for any work that touches performance.

The Knowledge Silo Problem

Perhaps the most insidious cost of fragmented observability is the creation of knowledge silos. When PostgreSQL monitoring lives in a separate tool, understanding that tool becomes a specialized skill. A small number of engineers develop expertise in the interface, learn which metrics matter, build mental models of how to interpret the data.

This expertise does not transfer. When those engineers leave or are unavailable during an incident, the organization's ability to diagnose database issues degrades significantly. The tools are still there, the metrics are still being collected, but the interpretive knowledge required to use them effectively has walked out the door.

Unified observability does not eliminate the need for database expertise, but it makes that expertise more accessible. When database metrics appear alongside application traces in the same interface, using the same query patterns and visualization conventions, engineers can learn by exposure rather than requiring dedicated study of a separate tooling ecosystem.

A Framework for Evaluation

Given these costs, how should engineering leaders approach PostgreSQL monitoring evaluation? Feature comparisons remain necessary, but they should be secondary to a more fundamental question: does this solution reduce or increase fragmentation?

Criterion	What to Evaluate
Data Unification	Do database metrics, application traces, and infrastructure signals end up in the same analytical backend? Can they be queried together, correlated programmatically, and visualized on shared timelines?
Identifier Consistency	When a slow application request touches the database, can you trace from the request to the specific queries it executed? Are there shared identifiers for services, hosts, databases, and requests?
Workflow Integration	During an incident, can engineers move from symptom to diagnosis to root cause within a single interface? Or must they export data, switch tools, and maintain mental state across context switches?
Knowledge Distribution	Does the solution concentrate expertise or distribute it? Do interfaces follow familiar patterns? Do they surface relevant context without requiring specialized query construction?

The Strategic Choice

Engineering leaders face a choice that will shape their organization's operational capability for years. They can continue adding specialized tools, each excellent in its domain, and accept the ongoing cost of manual correlation, knowledge silos, and fragmented ownership. Or they can prioritize integration, accepting that the best PostgreSQL metrics are worthless if they cannot be understood in context.

The organizations that resolve incidents quickly, ship with confidence, and maintain distributed operational expertise are those where the data needed to understand system behavior is accessible to the engineers who need it, when they need it, without requiring tool-switching or tribal knowledge to interpret.

┌───────────────────────────────────────────────────────────┐
│               Fragmented Observability                    │
├───────────────────────────────────────────────────────────┤
│                                                           │
│  ┌───────────┐   ┌───────────┐   ┌───────────┐            │
│  │ APM Tool  │   │ DB Monitor│   │Infra Tool │            │
│  │           │   │           │   │           │            │
│  │App Traces │   │  Queries  │   │CPU/Memory │            │
│  │ Latency   │   │   Locks   │   │  Disk I/O │            │
│  └─────┬─────┘   └─────┬─────┘   └─────┬─────┘            │
│        │               │               │                  │
│        ▼               ▼               ▼                  │
│  ┌────────────────────────────────────────────────────┐   │
│  │           Manual Correlation Required              │   │
│  │    • Different timestamps  • Different labels      │   │
│  │    • Context switching     • Knowledge silos       │   │
│  └────────────────────────────────────────────────────┘   │
│                                                           │
└───────────────────────────────────────────────────────────┘

                            vs.

┌───────────────────────────────────────────────────────────┐
│                Unified Observability                      │
├───────────────────────────────────────────────────────────┤
│                                                           │
│  ┌───────────┐   ┌───────────┐   ┌───────────┐            │
│  │App Traces │   │ DB Metrics│   │Infra Logs │            │
│  └─────┬─────┘   └─────┬─────┘   └─────┬─────┘            │
│        │               │               │                  │
│        └───────────────┼───────────────┘                  │
│                        ▼                                  │
│  ┌────────────────────────────────────────────────────┐   │
│  │          Single Analytical Backend                 │   │
│  │    • Unified timeline   • Correlated identifiers   │   │
│  │    • One query language • Shared dashboards        │   │
│  └────────────────────────────────────────────────────┘   │
│                        │                                  │
│                        ▼                                  │
│         Faster diagnosis, less context switching          │
│                                                           │
└───────────────────────────────────────────────────────────┘

Conclusion

The change that brought down Riya's checkout flow was a single line modification to a product listing query. A developer had added a filter to support a new search feature. The change worked correctly in staging, where the product catalog had a few hundred items. In production, with tens of thousands of products and no index on the new filter column, the query went from milliseconds to seconds. The deployment had gone out at 11 PM with no load testing, no database review, and no way for the on-call engineer to quickly connect the new code path to the degraded query.

The fix took five minutes once identified. The diagnosis took eighty-five. With unified observability, the deployment marker would have appeared on the same timeline as the latency spike, the slow query would have been traceable to the specific application endpoint, and the missing index would have been visible in the same interface. Riya's team would have been back in bed by 6 AM. Instead, they spent the morning writing a postmortem about tooling fragmentation.

This is exactly what we built pgX for. pgX unifies PostgreSQL monitoring with application traces and infrastructure metrics in a single platform. When a deployment causes query degradation, you see the deployment marker, the latency spike, and the slow query on the same timeline—no tool-switching required. See how pgX works →

Why Unified Observability Matters for Growing Engineering Teams — The case for consolidating your monitoring stack
Introducing pgX: Unified Database and Application Monitoring — How pgX bridges the gap between database and application observability
Understanding What Increases and Reduces MTTR — Actionable strategies to cut incident resolution time

Effective War Room Management: A Guide to Incident Response

2026-01-16T00:00:00.000Z

Incidents are inevitable. What separates resilient organizations from the rest is not whether they experience incidents, but how effectively they respond when problems arise. A well-structured war room process can mean the difference between a minor disruption and a major crisis.

After managing hundreds of critical incidents across my career, I've distilled my key learnings into this guide. These battle-tested practices have repeatedly proven their value in high-pressure situations.

Initialization

The first minutes of an incident response are critical. Having clear, consistent procedures for war room initialization ensures a swift and organized start to your incident management process.

Key Elements of Initialization

Single-access point: Always have one consistent link for all war rooms that everyone can access quickly. This eliminates confusion about where to go when an incident occurs.
Universal access: Everyone in the organization should have access to this link, even if they don't typically participate in incident response. This allows subject matter experts to join immediately when needed.
Pre-configured environment: Set up standard tools and dashboards in advance, so they're ready when an incident occurs.
Automated notifications: Implement automated alerting to notify the appropriate teams when a war room is initiated.
Initialization checklist: Create a standardized procedure for declaring an incident and starting the war room process.

Clear Role Definition

Effective war rooms require clear responsibilities. Each participant should understand their specific role and boundaries of authority.

Core Roles

Incident Manager

Leads the overall response
Makes final decisions when consensus can't be reached
Ensures the response follows established processes
Manages escalations when needed
Declares when the incident is resolved

Scribe

Documents all significant events, decisions, and actions in real-time
Maintains a timeline of the incident
Captures action items for follow-up
Ensures all key information is accessible to war room participants

Communications Person

Manages external and internal communications
Drafts and sends updates to stakeholders at regular intervals
Fields inquiries from other parts of the organization
Ensures consistent messaging about the incident

Actors

Technical resources performing the actual investigation and remediation
Provide expertise in specific systems or technologies
Execute changes and verify results
Report findings back to the war room

Effective Practices

The structure and approach of your war room significantly impact its effectiveness. Well-designed practices help maintain focus and productivity during high-stress situations.

Recommended Practices

Shared visibility: Maintain one shared screen that everyone can see, showing the primary investigation or discussion. All key actions should be performed visibly to the entire team.
Sub-team breakouts: When a specific line of inquiry requires focused attention, create separate rooms with the same role structure. These breakout teams should report findings back to the main war room regularly.
Regular status updates: Schedule brief status updates at consistent intervals to ensure everyone has the same understanding of the current situation.
Engineering pairing: All changes should be made by a pair of engineers, not a single person. Pairing ensures instant review and is critical to correct solutioning. This reduces errors and provides redundancy of knowledge during critical moments.
Clear decision-making framework: Establish in advance how decisions will be made during an incident (consensus, incident manager decision, etc.).
Time-boxing: Set time limits for investigation paths to avoid rabbit holes. Re-evaluate progress regularly.
Documentation first: Ensure all hypotheses, findings, and actions are documented before they're acted upon.
Standardized RCA template: Maintain a consistent RCA template that captures all necessary information: incident timeline, impact assessment, root cause identification, contributing factors, and action items. Standardization ensures comprehensive analysis and makes RCAs easier to compare and learn from over time.
Centralized knowledge repository: Establish a shared Google Drive, SharePoint, or similar solution where all RCAs are stored and accessible to everyone in the organization. This transparency builds institutional knowledge and allows teams to learn from past incidents regardless of their direct involvement.

War Room Etiquette

The discipline and focus of war room participants can make or break your incident response. Clear expectations for behavior help maintain an effective environment.

Etiquette Guidelines

Speak purposefully: Don't talk unless you have something meaningful to contribute. Background chatter makes it difficult to focus on critical information.
Respect role boundaries: Trust people in their designated roles to perform their functions without interference.
Minimize distractions: Turn off notifications and avoid multitasking during active incident response.
Stay focused on resolution: Keep discussions centered on understanding and resolving the current incident. Save process improvement discussions for after the incident.
Use clear, direct communication: Avoid ambiguous language. Be specific about what you're seeing, what you believe is happening, and what you're doing.
Mind cognitive load: Recognize that everyone's mental capacity is limited during high-stress situations, and communicate accordingly.

Post-Incident Activities

How you handle the aftermath of an incident is just as important as the initial response. Effective post-incident processes turn experiences into organizational learning.

Post-Incident Process

RCA assignment: The Incident Manager assigns root cause analysis responsibilities to a smaller group with relevant expertise.
Blameless postmortem: Conduct a thorough review focused on systems and processes, not individual mistakes.
Action item tracking: Document and assign follow-up items with clear ownership and timelines.
Knowledge sharing: Distribute learnings from the incident throughout the organization.
Process refinement: Update war room procedures based on lessons learned from each incident.
Recognition: Acknowledge the contributions of all participants in the incident response.

Shared visibility starts with unified observability. When your war room has a single pane of glass showing application traces, database queries, and infrastructure metrics on the same timeline, engineers spend less time correlating data and more time solving problems. See how base14 Scout enables faster incident resolution →

Understanding What Increases and Reduces MTTR — Data-driven strategies for cutting incident resolution time
Reducing Bus Factor in Observability Using AI — How to distribute operational knowledge across your team
Why Unified Observability Matters for Growing Engineering Teams — The case for consolidating your monitoring stack

pgX: Comprehensive PostgreSQL Monitoring at Scale

2026-01-06T00:00:00.000Z

Watch: Tracing a slow query from application latency to PostgreSQL stats with pgX.

For many teams, PostgreSQL monitoring begins and often ends with pg_stat_statements and basic postgres_exporter metrics. That choice is understandable. It provides normalized query statistics, execution counts, timing data, and enough signal to identify slow queries and obvious inefficiencies. For a long time, that is sufficient.

But as PostgreSQL clusters grow in size and importance, the questions engineers need to answer change. Instead of "Which query is slow?", the questions become harder and more operational:

Why is replication lagging right now?
Which application is exhausting the connection pool?
What is blocking this transaction?
Is autovacuum keeping up with write volume?
Did performance degrade because of query shape, data growth, or resource pressure?

These are not questions pg_stat_statements is designed to answer.

Most teams eventually respond by stitching together ad-hoc queries against pg_stat_activity, pg_locks, pg_stat_replication, pg_stat_user_tables, and related system views. This works until an incident demands answers in minutes, not hours.

As we discussed in our introduction to pgX, PostgreSQL monitoring in isolation creates blind spots. This post lays out what comprehensive PostgreSQL monitoring actually looks like at scale: the nine observability domains that matter, the kinds of metrics each domain requires, and why moving beyond query-only monitoring is unavoidable for serious production systems.

What pg_stat_statements Does Well

Before discussing its limits, it is worth acknowledging what pg_stat_statements does exceptionally well.

It provides:

Normalized, per-query execution statistics
Call counts and total execution time
Min, max, mean, and standard deviation of execution time
Buffer hits vs reads
Temporary file usage
Planning time (PostgreSQL 13+)
WAL byte generation (PostgreSQL 13+)

These metrics enable teams to:

Identify slow or expensive queries
Detect N+1 query patterns
Track query regressions after deployments
Find cache-inefficient query shapes
Understand which queries dominate workload

For early-stage systems, or for focused query optimization work, this is invaluable. It answers the first generation of performance questions clearly and efficiently.

However, several limitations become significant at scale:

Statistics reset on restart unless persisted externally
No visibility into query plans
No real-time view of current contention
Limited to top-level statements
Storage overhead grows with high query diversity
No context about why queries are slow at a given moment

These limitations are not flaws. They reflect the narrow scope pg_stat_statements was designed for. The problem arises when teams expect it to explain behaviors that live outside that scope.

The 9 Observability Domains Every Engineer Should Know

At scale, PostgreSQL behavior is shaped by far more than query execution time. Comprehensive monitoring requires visibility across nine distinct domains, each answering a different class of operational question.

Domain 1: Connections

Connection behavior often explains system instability long before queries look slow. This is essential for PostgreSQL connection pool monitoring and capacity planning. pgX tracks connection state, ownership, and duration patterns. See the pgX Connections documentation for detailed visualizations.

Signal	Why It Matters
Total connections vs `max_connections`	Headroom before exhaustion
State breakdown (active, idle, idle in transaction)	Identifies connection leaks
Connections by `application_name`	Pinpoints responsible service
Connection duration heatmaps	Reveals long-lived connection patterns

Key pgX Metrics

pg_connections, pg_backend_type_count, pg_backend_age_seconds, pg_backend_wait_events ^[1]

Key question: "We're hitting max_connections. Which service is responsible?"

Domain 2: Replication

Replication health determines both performance and reliability. Effective PostgreSQL replication lag monitoring requires visibility into multiple layers. pgX monitors lag, WAL flow, and standby conflicts across your entire topology. Explore the pgX Replication tab for standby monitoring.

Signal	Why It Matters
Write, flush, and replay lag per standby	Pinpoints where lag occurs
WAL generation rate	Baseline for capacity planning
Replication slot state	WAL retention risk
Standby conflicts (snapshot, lock, buffer pin)	Explains unexpected lag spikes

Key pgX Metrics

pg_replication_lag_milliseconds, pg_replication_outgoing, pg_replication_slot_lag_bytes, pg_replication_incoming ^[1]

Key question: "The replica is 30 seconds behind - I/O, network, or query conflicts?"

Domain 3: Locks & Waits

Locking behavior is emergent. It arises from concurrency patterns, transaction duration, and workload shape. PostgreSQL lock monitoring and debugging lock contention requires real-time visibility. pgX surfaces blocking chains and wait events in real time. The pgX Locks & Waits view surfaces blocking chains instantly.

Signal	Why It Matters
Lock counts by type (relation, tuple, txid, advisory)	Categorizes contention
Lock wait queue depth	Shows contention severity
Blocking session chains	Identifies who blocks whom
Wait event distribution (Lock, LWLock, IO, BufferPin)	Classifies wait types
Deadlock frequency	Detects design issues

Key pgX Metrics

pg_locks_count, pg_lock_detail, pg_blocking_pids, pg_backend_wait_events ^[1]

Key question: "Transactions are timing out. What's the blocking chain?"

Domain 4: Tables

Table-level health directly impacts performance and predictability. PostgreSQL bloat detection and table health monitoring are essential for performance tuning. pgX tracks bloat, cache efficiency, scan patterns, and freeze age per table. See the pgX Tables & Indexes view for detailed table health metrics.

Signal	Why It Matters
Live vs dead tuple counts	Bloat indicator
Estimated bloat percentage	Maintenance urgency
Cache hit ratio per table	Hot vs cold data
Sequential vs index scan counts	Query plan efficiency
Row activity (inserts, updates, deletes, HOT)	Write pattern visibility
Freeze age	Wraparound risk

Key pgX Metrics

pg_table_stats: n_live_tup, n_dead_tup, bloat_bytes, seq_scan, idx_scan, heap_blks_hit, age_relfrozenxid ^[1]

Key question: "Performance degraded - bloat or scan regressions?"

Domain 5: Indexes

Indexes improve read performance but impose write overhead and maintenance cost. pgX measures index usage, efficiency, and bloat to identify optimization opportunities.

Signal	Why It Matters
Index scan counts	Usage frequency
Tuples read vs fetched	Selectivity efficiency
Index cache hit ratios	Memory effectiveness
Index bloat estimates	Maintenance needs
Unused/rarely used indexes	Candidates for removal

Key pgX Metrics

pg_index_stats: idx_scan, idx_tup_read, idx_tup_fetch, idx_blks_hit, bloat_bytes, size_bytes ^[1]

Key question: "Which indexes are helping, and which are hurting?"

Domain 6: Maintenance (Vacuum & Analyze)

Maintenance debt accumulates quietly and surfaces as sudden performance regressions. Effective PostgreSQL vacuum monitoring and autovacuum monitoring prevent the silent accumulation of bloat. pgX tracks vacuum and analyze activity, dead tuple growth, and autovacuum effectiveness. Track maintenance health in the pgX Maintenance dashboard.

Signal	Why It Matters
Last vacuum/autovacuum per table	Maintenance recency
Last analyze/autoanalyze per table	Statistics freshness
Dead tuple accumulation rate	Bloat velocity
Autovacuum worker activity	Worker saturation
Rows modified since last analyze	Stale statistics risk

Key pgX Metrics

pg_table_stats: last_vacuum, last_autovacuum, vacuum_count, autovacuum_count, n_mod_since_analyze / pg_vacuum_progress ^[1]

Key question: "Autovacuum is running - why is bloat still growing?"

Domain 7: Performance (Beyond Aggregates)

Performance monitoring at scale requires distributional insight, not just averages. PostgreSQL performance tuning needs more than mean latency. pgX provides percentile breakdowns, query heatmaps, and per-query drill-downs over time. Drill into query performance in the pgX Queries view and explore the pgX Performance tab.

Signal	Why It Matters
Response time percentiles (p50, p90, p95, p99)	Tail latency visibility
Query heatmaps over time	Temporal patterns
Query type distribution (SELECT, INSERT, etc.)	Workload characterization
Per-query drill-downs (cache, I/O, planning, temp, WAL)	Root cause detail

Key pgX Metrics

pg_statement_stats: calls, total_time_ms, avg_time_ms, rows, shared_blks_hit, shared_blks_read, temp_blks_written ^[1]

Key question: "Which queries degraded, when, and under what conditions?"

Domain 8: Resources

Database performance is inseparable from the resources underneath it. pgX correlates database behavior with CPU, memory, disk, and network metrics.

Signal	Why It Matters
CPU utilization	Compute saturation
Memory pressure	Buffer/cache effectiveness
Disk I/O throughput and latency	Storage bottlenecks
Network throughput	Replication/client bandwidth

Key pgX Metrics

pg_system_load_avg, pg_system_memory_bytes, pg_system_swap_bytes, pg_system_info ^[1]

Key question: "Queries look slow, but Postgres looks normal - is the instance saturated?"

Domain 9: Topology & Health

Operational awareness requires a coherent view of the cluster. This is foundational context for managing Postgres at scale.

Signal	Why It Matters
Application-to-database topology	Connection flow visibility
Primary/replica layout	Replication architecture
Cluster health checks	Availability status
Error rates and database size	Growth and stability trends

Key pgX Metrics

pg_up, pg_database_size_bytes, pg_server_version, pg_settings, pg_database_stats ^[1]

Key question: "What is the current health and shape of the cluster?"

The Operational Gap

All of this data already exists inside PostgreSQL. It lives in system catalogs and views such as:

pg_stat_activity
pg_stat_replication
pg_locks
pg_stat_user_tables
pg_stat_user_indexes
pg_stat_bgwriter
pg_stat_wal

The challenge is its operationalization.

Teams must:

Collect metrics at appropriate intervals
Store them as time-series
Build dashboards per domain
Define meaningful alerts
Maintain and evolve the stack as Postgres versions change

A common DIY approach looks like:

Prometheus + postgres_exporter
Custom SQL queries for gaps
Grafana dashboards
Alertmanager for notifications

This works, but comes with hidden costs:

Partial coverage (bloat, per-query drill-downs, and maintenance are often skipped)
Configuration drift across environments
Tribal knowledge about which queries matter
No prebuilt investigation workflows
High cognitive load during incidents

This maintenance burden is a form of observability tax that compounds over time. Teams spend more effort maintaining observability than using it.

What “Comprehensive” Actually Looks Like

Comprehensive monitoring is about structured coverage, sufficient depth, and usable workflows.

In practice, this means:

Coverage across all nine domains
Hundreds of metrics, each with meaningful dimensions (database, table, index, user, application)
Time-series retention that preserves behavioral trends
Dashboards organized by operational concern, not metric type

pgX follows this model:

Metrics are grouped into logical categories
Each category exposes deep sub-metrics
Dashboards are prebuilt and aligned to real investigative workflows

For example:

Query metrics expose timing percentiles, buffer behavior, temp file usage, planning time, and WAL impact
Table metrics include bloat, cache efficiency, scan patterns, maintenance history, and freeze age
Index metrics surface usage effectiveness, bloat, and cache behavior

Crucially, these views are interconnected. An engineer can start from a high-level performance regression and drill down into the exact structural or operational cause without switching tools.

From Metrics to Answers

For us, the goal is to help engineers achieve faster, more confident resolution.

A representative workflow looks like this:

Users report slow checkout requests
Performance view shows a p95 response time spike
Drill into Queries, filter to high-percentile latency
Identify a degraded query
Inspect cache hit ratio and I/O patterns for that query
Navigate to Tables & Indexes
Discover 40% bloat on the primary table
Check Maintenance and see autovacuum hasn’t run recently
Root cause identified in minutes, not hours

Alerting also becomes more meaningful:

Compound conditions such as replication lag + read traffic
Trend-based alerts on connection exhaustion
Early warnings on maintenance debt

When PostgreSQL metrics share the same data lake as application telemetry, teams can move seamlessly from slow endpoints to slow queries to underlying data health.

Conclusion

Comprehensive PostgreSQL monitoring requires visibility across queries, connections, replication, locks, tables, indexes, maintenance, resources, and topology.

Teams face a choice:

Build and maintain this visibility themselves, or
Use tooling designed to provide it out of the box

pgX delivers structured coverage across all nine domains, with deep metrics, prebuilt dashboards, and workflows integrated into the same observability surface as application telemetry. For teams experiencing long incident resolution times, these capabilities directly help reduce MTTR. PostgreSQL does not operate in isolation. Its behavior is shaped by application code, request patterns, background jobs, deployments, and infrastructure constraints. To reliably debug production issues, engineers also need application traces, logs, and infrastructure signals in the same place, sharing the same time axis and context.

This is where unified observability matters. When PostgreSQL metrics live alongside application and infrastructure telemetry, stored in the same data lake and explored through the same interface, teams can move from symptoms to causes without stitching data across tools. Slow endpoints can be traced to slow queries, degraded queries to table bloat or lock contention, and database pressure back to application behavior or infrastructure limits.

That ability to reason about the system end-to-end is what ultimately separates surface-level monitoring from true operational understanding. You can find the technical setup in our pgX documentation, including the quickstart guide and the complete metrics reference. And if you're navigating this exact problem—figuring out how to unify database observability with the rest of your stack—we'd be interested to hear how you're approaching it.

Footnotes

¹ For the complete list of pgX metrics, see the Metrics Reference.

Introducing pgX: Bridging the Gap Between Database and Application Monitoring for PostgreSQL

2026-01-05T00:00:00.000Z

Watch: Tracing a slow query from application latency to PostgreSQL stats with pgX.

Modern software systems do not fail along clean architectural boundaries. Application latency, database contention, infrastructure saturation, and user behavior are tightly coupled, yet most observability setups continue to treat them as separate concerns, creating silos between database monitoring and APM tools. PostgreSQL, despite being a core component in most production systems, is often monitored in isolation—through a separate tool, separate dashboards, and separate mental models.

This separation works when systems are small and traffic patterns are simple. As systems scale, however, PostgreSQL behavior becomes a direct function of application usage: query patterns change with features, load fluctuates with users, and database pressure reflects upstream design decisions. At this stage, isolating database monitoring from application and infrastructure observability actively slows down diagnosis and leads teams to optimize the wrong layer.

In-depth PostgreSQL monitoring is necessary—but depth alone is not sufficient. Metrics without context force engineers to manually correlate symptoms across tools, timelines, and data models. What is required instead is component-level observability—a unified database observability platform where PostgreSQL metrics live alongside application traces, infrastructure signals, and deployment events, sharing the same time axis and the same analytical surface.

This is why PostgreSQL observability belongs in the same place as application and infrastructure observability. When database behavior is observed as part of the system rather than as a standalone dependency, engineers can reason about causality instead of coincidence, and leaders gain confidence that performance issues are being addressed at their source-not just mitigated downstream.

Why PostgreSQL Is Commonly Observed in Isolation?

PostgreSQL's popularity is not accidental. Its defaults are sensible, its abstractions are strong, and it shields teams from operational complexity early in a system's life. Standard views such as pg_stat_activity, pg_stat_statements, and replication statistics provide enough visibility to operate comfortably at a modest scale.

As a result, many teams adopt a mental model where:

The application is monitored via APM and logs
Infrastructure is monitored via host or container metrics
The database is monitored "over there," often with a specialized tool

This division is rarely intentional. It emerges organically from tooling ecosystems and organizational boundaries. Database monitoring tools evolved separately, application observability evolved separately, and teams adapted around the seams. This is a form of the observability sprawl we discussed in why unified observability matters.

The problem is that the system itself does not respect these seams.

The Inflection Point: When Isolation Stops Working

There is a predictable point where this model begins to fail. It typically coincides with one or more of the following:

Increased concurrency and mixed workloads
Features that introduce new query shapes or access patterns
Multi-tenant or user-driven traffic variability
Latency budgets that tighten as the product matures

At this stage, PostgreSQL metrics start reflecting effects, not causes. This is where pg_stat_statements alone stops being sufficient for PostgreSQL performance troubleshooting.

Engineers see:

Rising query latency without obvious query changes
Lock contention that appears sporadic
CPU or IO pressure that correlates weakly with query volume
Replication lag that spikes during "normal" traffic

Each tool shows part of the picture, but none show the system.

Jumping through dashboards to correlate, step after step, adds to erroneous attribution and higher MTTR

The engineer is forced into manual correlation:

Jumping between dashboards
Aligning timelines by eye
Inferring causality from coincidence

This is not an engineer skill problem. It is a tooling model problem. As dedicated DBA roles continue to vanish, we must put expert-level tooling directly into the hands of every developer. pgX doesn't just show data; it empowers every engineer to perform the deep-dive analysis traditionally reserved for database specialists

The Cost of Split Observability

When database observability is isolated, several failure modes become common:

	Technical Impact	Organizational Impact
During Incidents	Slower Response - Engineers spend time proving whether the database is the cause or the victim. Valuable minutes are lost ruling things out instead of addressing the root cause, directly increasing Mean Time to Recovery.	Blurred Ownership - "Database issue" and "application issue" become political labels rather than technical diagnoses. Accountability diffuses.
After Incidents	Incorrect Optimization - Teams tune queries when the real issue is connection churn, or scale infrastructure when the bottleneck is lock contention driven by application behavior.	Leadership Mistrust - When explanations rely on inferred correlation rather than observed causality, confidence erodes—both in the tools and in the process.

These are organizational costs, not just technical ones.

Databases Are Not Dependencies - They Are Components

A critical mental shift is required: PostgreSQL is not just an external dependency that occasionally misbehaves. It is a stateful component whose behavior is continuously shaped by the application.

Queries do not exist in isolation. They are the result of:

User behavior
Feature flags
Request fan-out
ORM behavior
Deployment changes
Background jobs and scheduled work

Observing PostgreSQL without this context is akin to observing CPU usage without knowing which process is running.

True observability requires that all major components of a system be observed together, not just deeply.

What "Bridging the Gap" Actually Means

Bridging database and application monitoring requires structural alignment:

Requirement	Description
Shared Time Axis	PostgreSQL metrics, application traces, and infrastructure signals must be observable on the same timeline, dashboards and logs, without manual alignment.
Shared Identifiers	Queries, requests, services, and hosts should be linkable through consistent labels and metadata.
Unified Storage	Data should live in the same analytical backend, enabling cross-signal analysis rather than stitched views.
One Alerting Surface	Alerts should trigger based on system behavior, not tool-specific thresholds, and remediation should not require jumping between platforms.
Integrated Workflows	Investigation workflows should flow seamlessly from application symptoms to database causes, without context switching.

Depth Alone Is Not Enough

Many teams respond to observability gaps by adding more detailed database monitoring. While depth is necessary, it introduces new challenges when implemented in isolation:

High-cardinality metrics become expensive and noisy
Engineers struggle to determine which signals matter
Data volume grows without improving understanding

Depth without context increases cognitive load. Depth with context reduces it. To truly reduce cognitive load, a tool needs to act as a guide. It should enable engineers to understand the 'why' behind Postgres behaviors like vacuuming issues or index bloat providing the guardrails and insights needed to master the database layer without a steep learning curve

As compared to collecting every possible PostgreSQL metric and analyzing in isolation, the right approach is to observe the database as it participates in the system.

PostgreSQL Observed as Part of the System

When PostgreSQL observability is unified with application and infrastructure observability, several things change:

Query latency is evaluated against request latency, not in isolation
Lock contention is correlated with deployment or traffic patterns
Resource pressure is interpreted in light of workload mix
Performance regressions are traced to code paths, not just queries

Instead of asking "What is the database doing?" engineers start asking "Why is the system behaving this way?"

That distinction marks a fundamental cultural shift.

The Strategic Implication for Engineering Leaders

For engineering leaders, this shift is not merely technical. It affects:

Mean time to resolution
Reliability perception across teams
Cost efficiency of scaling decisions
Confidence in operational readiness

Fragmented observability systems scale poorly-not just in cost, but in organizational trust.

Choosing to observe PostgreSQL alongside application and infrastructure signals is a statement about how seriously an organization treats system understanding.

Introducing pgX

To address these challenges, we are excited to introduce pgX, Base14's PostgreSQL observability integration designed to unify database monitoring with application and infrastructure observability.

pgX captures PostgreSQL diagnostic and monitoring data at a depth no other observability platform offers—and integrates it directly alongside your application traces, logs, and infrastructure metrics. This allows engineers to analyze database behavior in the context of application performance and infrastructure health, enabling faster slow query troubleshooting and more effective optimization. In our companion post, we detail the nine PostgreSQL observability domains that pgX covers comprehensively.

Getting Started

PostgreSQL remains the default database for a reason: it is robust, flexible, and capable of supporting complex workloads. But as systems grow, the way PostgreSQL is observed must evolve.

In-depth monitoring is table stakes. What differentiates effective teams is whether that depth exists in context. With pgX, you get comprehensive Postgres metrics flowing into the same data lake as your application and infrastructure telemetry-designed for correlation, not just collection.

You can find the technical setup in our pgX documentation, including the quickstart guide to get started. And if you're navigating this exact problem—figuring out how to unify database observability with the rest of your stack—we'd be interested to hear how you're approaching it.

In our next post, we'll dive deeper into what pgX collects and the visualizations it provides to help you understand your PostgreSQL performance at a glance.

Reducing Bus Factor in Observability Using AI

2025-12-10T00:00:00.000Z

We’ve gotten pretty good at collecting observability data, but we’re terrible at making sense of it. Most teams—especially those running complex microservices—still rely on a handful of senior engineers who just know how everything fits together. They’re the rockstars who can look at alerts, mentally trace the dependency graph, and figure out what's actually broken.

When they leave, that knowledge walks out the door with them. That is the observability Bus Factor.

The problem isn't a lack of data; we have petabytes of it. The problem is a lack of context. We need systems that can actually explain what's happening, not just tell us that something is wrong.

This post explores the concept of a "Living Knowledge Base", Where the context is built based on the telemetry data application is emitting, not based on the documentations or confluence docs. Maintaining docs is a nightmare and we cannot always keep up Why not just build a system that will do this

The Current Situation: Telemetry Overload and Alert Fatigue

We live in an age of "complete observability." We send logs, metrics, and traces to powerful platforms, giving us beautiful dashboards, rich history, and deep APM insights. Yet, when an incident hits, we often still feel blind.

The Microservices Dilemma In a microservices world, one problem can trigger ten seemingly unrelated alerts.

Service A throws a 500 error alert.

The downstream Kafka topic latency spikes (metric alert).

The Kubernetes Node running Service A reports high memory usage (infra alert).

A junior engineer sees the 500 alert and stares at Service A's code. A senior engineer glances at the high memory usage on the node, remembers Service B was deployed an hour ago, and knows that Service A holds data in memory for retries when Service B is slow. The entire diagnosis takes 15 minutes mostly because it takes 14 minutes to track down the engineer with the tribal knowledge, and just 1 minute for them to pinpoint the actual issue.

This is because of The Human-in-the-Loop Dependency

Making it Better: The Living Knowledge Base (LKB)

The solution is to codify system knowledge using the system's own data. A real knowledge base isn’t just a dependency diagram—it’s the combination of relationships and the metadata around them. Instead of relying on static configs or runbooks that go stale, we let telemetry update those relationships continuously.

We call this a Living Knowledge Base (LKB).

Building the LKB with a Graph Database The foundation of the LKB is a Graph Database (like Neo4j, Memgraph, or others). A graph database excels at storing relationships between data points, which is exactly what a distributed system is.

Instead of just sending telemetry to the standard observability backend, we also route a stream of high-volume telemetry (spans, metrics, pod metadata) to a processing agent.

This agent builds the graph in real-time:

Node (Entity)	Edge (Relationship)
Service A	CALLS
Service A Pod 1	RUNS_ON
K8s Node X	REPORTS
Service B	DEPENDS_ON

As the application scales, deploys, and changes its dependencies, the graph adapts automatically

Adding an Intelligent layer over Knowledge base

The Knowledge base gives us the dynamic map; the LLM gives us the intelligence to interpret it.

We put an LLM in front of the LKB, making the graph accessible via a controlled interface (sometimes called a Model Context Protocol). This creates an Observability Agent.

From "Alert Fatique" to "Ask the Expert" When the triple alert hits (500, Kafka spike, Node memory), we don't have to manually click through dashboards. We simply prompt the Observability Agent:

Prompt: "Why did payment service latency spike?"

The agent does not guess; it walks the graph:

Find Node: Find the Service A 500 Error node.
Walk Upstream: Follow the CAUSED_BY edge (derived from trace data) to find the dependency on Service B.
Correlate: Find the Service B node. Walk the RUNS_ON edge to the K8s Node Y node.
Contextualize: Query the time-series data related to K8s Node Y and discover a memory leak or a recent deployment event.

Synthesize: The LLM translates the complex graph traversal into a simple, natural language root cause: “Payment service latency spiked because Service B, which runs on Node Y, suffered a memory leak after a recent deployment, causing high memory pressure. Service A's resulting connection timeouts triggered its internal retry loop, leading to high CPU and the 500 errors.”

The result is a nearly instant, accurate root cause analysis that democratizes the knowledge of your most senior engineers. It cuts a 30-minute debugging session down to 30 seconds.

Beyond Observability: Real-Time Insights

This Living Knowledge Base has applications far beyond just incident response.

Preventative Insight: The LKB can be continuously queried by an algorithm or Agent to find odd patterns—not just broken things. For instance, it might discover a service that has always called four other services, but for the last three days, it has only been calling three. This is a drift in behavior that can be flagged as a high-risk anomaly, allowing you to fix a bug before it impacts users.
Automated Runbook Generation: Since the LKB understands the system's current state, the LLM can generate live, current runbooks for a specific incident—not generic, outdated documents. It knows the exact steps to restart the specific dependency that's currently failing.

Conclusion

By using the structure of a Graph Database to give our telemetry data context and an LLM to give it intelligence, we finally move beyond simply collecting data. We create a system that understands itself, drastically reducing the Bus Factor and making every engineer capable of instant, deep root cause analysis.

The Cloud-Native Foundation Layer: A Portable, Vendor-Neutral Base for Modern Systems

2025-11-25T00:00:00.000Z

Cloud-native began with containers and Kubernetes. Since then, it has become a set of open standards and protocols that let systems run anywhere with minimal friction.

Today's engineering landscape spans public clouds, private clouds, on-prem clusters, and edge environments - far beyond the old single-cloud model. Teams work this way because it's the only practical response to cost, regulation, latency, hardware availability, and outages.

If you expect change, you need an architecture that can handle it.

Deploying on One Cloud Isn't Lock-In. Designing for One Cloud Is

Two recent outages show how risky this is:

Cloudflare — 18 Nov 2025

A routing bug took down large parts of the internet for hours. Many companies broke even if they weren't Cloudflare customers. Their DNS, CDN, or WAF traffic still flowed through Cloudflare somewhere.

AWS us-east-1 — 20 Oct 2025

Cascading control-plane failures halted services across the industry. Anyone tied to us-east-1 had no alternatives.

These failures weren't unusual. They were predictable outcomes of stacking critical workloads in one place.

If your whole system sits on one provider, their failures become your failures.

Cloud Costs Make Lock-In Expensive

DHH's "We Have Left the Cloud" is a clear example. Basecamp/HEY left AWS after realizing the cost no longer made sense. Doing so saved them millions.

Their situation was unusual, but the point is general:

You cannot control cost if you cannot move.

If all your workloads sit on one cloud, you lose the ability to:

Shift workloads to cheaper regions
Compare GPU pricing across clouds
Escape sudden egress spikes
Negotiate pricing at all

The problem isn't being on one cloud. It's losing the option to leave. With portable designs, you can sidestep outages like Cloudflare's or AWS's by running elsewhere, and you regain leverage on price. Freedom comes from reversibility.

Most Lock-In Doesn't Come From Vendors. It Comes From Your Code

The trap usually starts small:

An SDK call deep in your business logic
A dependency on a proprietary database
A CI pipeline that only works in one cloud
An IAM model you can't reproduce anywhere else
A networking or eventing pattern that has no equivalent outside your vendor

None of these feel like lock-in at the time. They become lock-in when you try to change something and can't.

What the Foundation Layer Really Is

A Cloud-Native Foundation Layer isn't extra architecture. It's the minimum structure you need to stay free:

1. Composable Infrastructure

Use components that behave the same everywhere: containers, GitOps, Terraform.

2. Open Interfaces and Protocols

Choose interfaces that don't care where they run: HTTP/JSON, gRPC, SQL, OTel, S3-compatible storage.

3. Unified Observability

Instrument with OpenTelemetry so your telemetry can go to any backend without changes.

If you do these three things, you get:

Portability
Better uptime
Lower cost volatility
Easier compliance
Freedom to adopt new technology

None of this is abstraction for its own sake. It's the cheapest way to avoid expensive mistakes later.

A Foundation Layer: The Ability to Change Your Mind

Outages will happen. Pricing will change. AI hardware will appear in one cloud before another. Data residency rules will tighten.

A foundation layer gives you space to respond. Without it, every change is painful.

What's Next

In Post 2, we'll cover how to structure your code so your domain logic doesn't depend on any one cloud — the core of true portability.

Meanwhile you can read about what we wrote about the learnings from the recent cloudflare outage.

References

Cloudflare Outage (18 Nov 2025): https://blog.cloudflare.com/18-november-2025-outage/
Learnings from Cloudflare Outage: https://www.linkedin.com/pulse/my-learnings-from-cloudflare-nov-18-incident-ranjan-sakalley-bxwbc
AWS us-east-1 Outage (20 Oct 2025): https://www.thousandeyes.com/blog/aws-outage-analysis-october-20-2025
DHH — We Have Left the Cloud: https://world.hey.com/dhh/we-have-left-the-cloud-251760fb

Making Certificate Expiry Boring

2025-11-20T00:00:00.000Z

On 18 November 2025, GitHub had an hour-long outage that affected the heart of their product: Git operations. The post-incident summary was brief and honest - the outage was triggered by an internal TLS certificate that had quietly expired, blocking service-to-service communication inside their platform. It's the kind of issue every engineering team knows can happen, yet it still slips through because certificates live in odd corners of a system, often far from where we normally look.

What struck me about this incident wasn't that GitHub "missed something." If anything, it reminded me how easy it is, even for well-run, highly mature engineering orgs, to overlook certificate expiry in their observability and alerting posture. We monitor CPU, memory, latency, error rates, queue depth, request volume - but a certificate that's about to expire rarely shows up as a first-class signal. It doesn't scream. It doesn't gradually degrade. It just keeps working… until it doesn't.

And that's why these failures feel unfair. They're fully preventable, but only if you treat certificates as operational assets, not just security artefacts. This article is about building that mindset: how to surface certificate expiry as a real reliability concern, how to detect issues early, and how to ensure a single date on a single file never brings down an entire system.

Why certificate-expiry outages happen

Most outages have a shape: a graph that starts bending the wrong way, an error budget that begins to evaporate, a queue that grows faster than it drains. Teams get early signals. They get a chance to react.

Certificate expiry is different. It behaves more like a trapdoor. Everything works perfectly… until the moment it doesn't.

And because certificates sit at the intersection of security and infrastructure, ownership is often ambiguous. One team issues them, another deploys them, a third operates the service that depends on them. Over time, as systems evolve, certificates accumulate in places no one remembers - a legacy load balancer here, a forgotten internal endpoint there, an old mutual-TLS handshake powering a background job that hasn't been touched in years. Each one quietly counts down to a date that may not exist anywhere in your dashboards.

It's not that engineering teams are careless. It's that distributed systems create distributed responsibilities. And unless expiry is treated as an operational metric - something you can alert on, page on, and practice recovering from - it becomes a blind spot.

The GitHub incident is just a recent reminder of a pattern most of us have seen in some form: the system isn't failing, but our visibility into its prerequisites is.

That's what we'll fix next.

Where certificates actually live in a modern system

Before we talk about detection and automation, it helps to map the terrain. Certificates don't sit in one place; they're spread across a system the same way responsibilities do. And when teams are busy shipping features, it's easy to forget how many places depend on a valid chain of trust.

A few common patterns:

1. Public entry points These are the obvious ones - the certificates on your API gateway, load balancers, reverse proxies, or CDN. They're usually tracked because they're customer-facing. But even here, expiry can slip through if ownership rotates or if the renewal mechanism silently fails.

2. Internal service-to-service communication Modern systems often use mTLS internally. That means each service, sidecar, or pod may hold its own certificate, usually short-lived and automatically rotated. The catch: these automation pipelines need monitoring too. When they fail, the failure is often invisible until the cert expires.

3. Databases, message brokers, and internal control planes Many teams enable TLS for PostgreSQL, MongoDB, Kafka, Redis, or internal admin endpoints - and then forget about those certs entirely. These can be some of the hardest outages to debug because the components are not exposed externally and failures manifest as connection resets or handshake errors deep inside a dependency chain.

4. Cloud-managed infrastructure AWS ALBs, GCP Certificate Manager, Azure Key Vault, CloudFront, IoT gateways - each keeps its own certificate store. These systems usually help with automation, but they don't always alert when renewal fails, and they certainly don't alert when your usage patterns change.

5. Legacy or security-adjacent components Some of the most outage-prone certificates sit in places we rarely revisit:

VPN servers
old NGINX or HAProxy nodes
staging environments
batch jobs calling external APIs
IoT devices or firmware-level certs
integrations with third-party partners

If even one of these expires, the blast radius can be surprisingly wide.

What all of this shows is that certificate expiry isn't a single-problem problem - it's an inventory problem. You can't secure or monitor what you don't know exists. And you can't rely on tribal memory to keep track of everything.

The next step, naturally, is stitching visibility back into the system: turning this scattered landscape into something observable, alertable, and resilient.

Detecting certificate expiry across different environments

Once you understand where certificates tend to hide, the next question becomes: how do we surface their expiry in a way that fits naturally into our observability stack? The good news is that we don't need anything exotic. We just need a reliable way to extract expiry information and feed it into whatever monitoring and alerting system we already trust.

The exact approach varies by environment, but the principle stays the same: expiry should show up as a first-class metric - just like latency, errors, or disk space.

Let’s break this down across the most common setups.

1. Kubernetes (with cert-manager)

If you're using cert-manager, you already have expiry information available - it's just a matter of surfacing it.

Cert-manager stores certificate metadata in the Kubernetes API, including status.notAfter. Expose that through:

cert-manager’s built-in metrics
a Kubernetes metadata exporter
or a lightweight custom controller if you prefer tighter control

Once the metric is flowing into your observability stack, you can build straightforward alerts:

30 days → warning
14 days → urgent
7 days → critical

This handles most cluster-level certificates, especially ingress TLS and ACME-issued certs.

2. Kubernetes (without cert-manager)

Many clusters use:

TLS secrets created manually
certificates provisioned by CI/CD
certificates embedded inside service mesh CA infrastructure
or certificates uploaded to cloud load balancers

In these cases, you can extract expiry from:

the tls.crt in Kubernetes Secrets
mesh control plane metrics (e.g., Istio’s CA exposes rotation details)
endpoint probes from blackbox exporters
cloud provider API calls that list certificate metadata

The pattern stays the same: gather expiry → convert to a metric → alert early.

3. Virtual machines, bare metal, or traditional workloads

This is where certificate expiry issues happen the most, often because the monitoring setup predates the current system complexity.

Your options here are simple and effective:

Run a small cron job that calls openssl against known endpoints
Parse certificates from local files or keystores
Use a Prometheus blackbox exporter to probe TLS endpoints
Query cloud APIs for LB or certificate manager expiry
Forward results as metrics or events into your observability system

Nearly every major outage caused by certificate expiry outside Kubernetes happens in these environments - mostly because there's no single place where certificates live, so no single tool naturally monitors them. A tiny script with a 30-second probe loop can save hours of downtime.

4. Cloud-managed ecosystems

AWS, GCP, and Azure all provide mature certificate stores:

AWS ACM, CloudFront, API Gateway
GCP Certificate Manager, Load Balancing
Azure Key Vault, App Gateway

They usually renew automatically, but renewals can fail silently for reasons like:

unnecessary domain validation retries
DNS misconfigurations
permissions regressions
quota limits
or expired upstream intermediates

The fix: poll these APIs on a schedule and compare expiry timestamps with your policy thresholds. Treat those just like metrics from a node or pod.

5. The hard-to-see corners

No matter how modern your architecture is, you’ll find certificates in:

internal admin endpoints
Kafka, RabbitMQ, or PostgreSQL TLS configs
legacy VPN boxes
IoT gateways
partner API integrations
staging environments that don’t receive the same scrutiny

These deserve monitoring too, and the process is no different: probe, parse, publish.

Focus on expiry as a metric

When certificate expiry becomes just another number that your dashboards understand - a timestamp that can be plotted, queried, alerted on - the problem changes shape. It stops being a last-minute surprise and becomes part of your normal operational rhythm.

The next question, then, is how to automate renewals and rotations so that even when alerts happen, they're nothing more than a nudge.

Automating certificate renewal and rotation

Detecting certificates before they expire is necessary, but it's not the end goal. The real win is when expiry becomes uninteresting - when certificates rotate quietly in the background, without paging anyone, and without becoming a stress point before every major release.

Most organisations get stuck on renewals for one of two reasons:

They assume automation is risky.
Their infrastructure is too fragmented for a single renewal flow.

But automation doesn't have to be fragile. It just has to be explicit.

Here are the most reliable patterns that work across environments.

1. ACME-based automation (Let’s Encrypt and internal ACME servers)

If your certificates can be issued via ACME, life becomes dramatically simpler. ACME clients - whether cert-manager inside Kubernetes or acme.sh / lego on a traditional VM - handle the full cycle:

request
validation
issuance
renewal
rotation

And because ACME certificates are intentionally short-lived, your system gets frequent practice, making renewal failures visible long before a real expiry.

For internal systems, tools like Smallstep, HashiCorp Vault (ACME mode), or Pebble can act as internal ACME CAs, giving you automatic rotation without public DNS hoops.

2. Renewal via internal CA (Vault PKI, Venafi, Active Directory CA)

Some environments need tighter control than ACME allows. In those cases:

Vault's PKI engine can issue short-lived certs on demand
Venafi integrates with enterprise workflows and HSM-backed keys
Active Directory Certificate Services can automate internal certs for Windows-heavy stacks

The trick is to treat issuance and renewal as API-driven processes - not as manual handoffs.

The pipeline should be able to:

generate or reuse keys
request a new certificate
store it securely
trigger a reload or rotation
validate that clients accept the new chain

Once this flow exists, adding observability around it is straightforward.

**3. Automating the distribution step**

Most certificate outages happen after renewal succeeds - when the new certificate exists but hasn't been rolled out cleanly.

To make rotation safe and predictable:

Upload the new certificate alongside the old one
Switch your service or load balancer to the new certificate atomically
Gracefully reload instead of restarting
Keep the old cert around for a short overlap window
Validate that clients, proxies, and edge layers all trust the new chain

This overlap pattern avoids the "everything broke because we reloaded too aggressively" class of outages, which is surprisingly common.

4. Cloud-managed rotation

Cloud providers do a decent job of renewing certificates automatically, but they won't validate your whole deployment chain. That's on you.

The safe pattern:

Let the cloud provider renew
Poll for renewal events
Verify that listeners, API gateways, and CDN distributions have updated attachments
Validate downstream systems that import or pin certificates
Raise alerts if anything gets stuck on an older version

This closes the gap between "cert renewed" and "cert in use."

5. Rotation in service meshes and sidecar-based systems

Istio, Linkerd, Consul Connect, and similar meshes issue short-lived certificates to workloads and rotate them frequently. This is excellent for security - but only if rotation stays healthy.

You want to monitor:

workload certificate rotation age
control-plane CA expiry
sidecar rotation errors
issuance backoff or throttling

If rotation falls behind, it should be alerted on long before expiry.

The goal is predictability, not cleverness

A good renewal system doesn't try to be "smart." It tries to be boring - predictable, transparent, observable, and easy to test.

The next step is tying this predictability into your alerting strategy: you want enough signal to catch problems early, but not so much noise that expiry becomes background static.

Alerting strategies that actually prevent downtime

Once certificates are visible in your monitoring system, the next challenge is deciding when to alert and how loudly. Expiry isn't like latency or saturation - it doesn't fluctuate minute-to-minute. It moves slowly, predictably, and without drama. That means your alerts should feel the same: calm, early, and useful.

A good alert for certificate expiry does two things:

It tells you early enough that the fix is routine.
It doesn't page the team unless the system is genuinely at risk.

Taking the risk and being prescriptive, here's how to design that balance.

1. Use long, staggered alert windows

A 90-day certificate doesn't need a red alert at day 89. But it also shouldn't wait until day 3.

A common, reliable pattern is:

30 days → warning (non-paging)
14 days → urgent (may page depending on environment)
7 days → critical (should page)

This staggered approach ensures:

your team has multiple chances to notice
you can distinguish "renewal hasn't happened yet" from "renewal failed"
you avoid last-minute firefighting, especially around holidays or weekends

The goal is to turn expiry into a background piece of operational hygiene - not an adrenaline spike.

2. Alert on renewal failures, not just expiry

A certificate expiring is usually a symptom. The real problem is that the renewal automation stopped working.

So your monitoring should include:

ACME failures (DNS, HTTP-01/ALPN-01 challenges failing)
mesh-sidecar rotation failures
Vault or CA issuance errors
permissions regressions (role can no longer request or upload certs)
cloud-provider renewal stuck in "pending validation"

These alerts often matter more than the expiry date itself.

3. Detect chain issues and intermediate expiries

Sometimes the leaf certificate is fine - but an intermediate in the chain is not. Many teams miss this, because they only check the surface-level cert.

Your probes should validate the full chain:

intermediate expiry
missing intermediates
mismatched issuer
unexpected CA
weak algorithms

Broken chains can create outages that look like TLS handshake mysteries, even when the leaf cert is fresh.

4. Surface expiry as a metric your dashboards understand

A certificate's expiry date is just a timestamp. Expose it like any other metric:

ssl_not_after_seconds
cert_expiry_timestamp
x509_validity_seconds

Once it’s a metric:

you can plot trends
you can compare environments
you can find components with unusually short or long TTLs
you can build SLOs around the rotation process

It becomes part of your observability ecosystem, not an afterthought.

5. Don't rely on humans to remember edge cases

If your alerts depend on tribal knowledge - someone remembering that "there's an old VPN gateway in staging with a cert that expires in March" - then you don't have an alerting strategy, you have a memory test that your team will fail.

Every certificate, in every environment, should be:

discoverable
monitored
alertable

The moment monitoring depends on someone remembering "that one place we keep certs," you're back to hoping instead of observing.

Alerting should create confidence, not anxiety

Good alerts help teams sleep better. They remove uncertainty and allow engineers to trust that the system will tell them when something important is off. Certificate expiry should fall squarely into this camp - predictable, early, and boring.

With detection and alerting covered, the next piece is ensuring the system behaves safely when certificates actually rotate: how to design zero-downtime deployment patterns so rotation never becomes an outage event.

Zero-downtime rotation patterns

Even with good monitoring and robust automation, certificate renewals can still cause trouble if the rotation process itself is fragile. A surprising number of certificate-related outages happen after a new certificate has already been issued - during the switch-over phase where services, load balancers, or sidecars pick up the new credentials.

Zero-downtime rotation isn't complicated, but it does require deliberate patterns. Most of these boil down to one principle:

| Never replace a certificate in a way that surprises the system.

Here are the patterns that make rotation predictable and safe.

1. Overlap the old and new certificates

A simple but powerful rule: Always have a window where both the old and new certificates are valid and deployed.

This overlap ensures:

long-lived clients can finish their sessions
short-lived clients pick up the new cert seamlessly
you avoid "half the system has the new cert, half has the old one" situations

In practice, this can mean:

adding the new certificate as a second chain in a load balancer
rotating the private key but temporarily supporting both versions
waiting for a full deployment cycle before removing the old cert

Overlap is your safety net.

2. Use atomic attachment for load balancers and gateways

Cloud load balancers usually support:

uploading a new certificate
switching the listener to the new certificate in a single update

This is vastly safer than:

deleting and re-adding
reloading configuration mid-traffic
relying on an external script to get timing right

Atomic attachment ensures that the traffic shift is instantaneous and consistent.

3. Prefer graceful reloads over restarts

Some services pick up new certificates on reload, others need restarts. Where you can, choose the reload path.

Graceful reloads:

avoid dropping connections
preserve in-flight requests
avoid spikes in error rates and latency
allow blue-green or rolling processes inside Kubernetes, Nomad, or VMs

If a service truly cannot reload (rare today), wrap rotation in a:

rolling restart
node-by-node drain
health-checked deployment sequence

The idea is the same: no hard cuts.

4. Validate after rotation - not just before

Many teams validate certificates before they rotate:

subject, issuer
SAN list
expiry date
chain
signature

All good - but not enough.

You also need post-rotation validation:

do clients still trust the chain?
is OCSP/CRL working?
did any pinned-certificate clients break?
did any intermediate certificates unexpectedly change?
did the system propagate the new certificate everywhere?

Treat rotation as a deployment, not a file update.

5. Treat service meshes as first-class rotation systems

Sidecar-based meshes like Istio or Linkerd already rotate certificates frequently. But the control-plane CA certificates still need careful handling.

When rotating a CA certificate in a mesh:

introduce the new root or intermediate
allow both chains temporarily
ensure workloads are receiving new leaf certs under the new CA
only retire the old CA when no workload depends on it

Skipping these steps can break mTLS cluster-wide.

Certificate rotation has a habit of failing silently. Most debugging sessions start with, "Did the certificate get picked up?" and end in grepping logs or diffing secrets.

A good rotation system records:

when certificates were requested
when they were issued
where they were distributed
when services reloaded/restarted
which version is currently active

This is invaluable during an incident, and equally helpful for audits or compliance. Drop it into the #release or #deployment slack channel so others can debug faster when things go bad.

Rotation should feel like any other deploy

The most reliable teams treat certificate rotation exactly like they treat code deployment:

staged
observable
reversible
tested
boring

When a certificate rotation feels as uninteresting as a config push or a canary rollout, you've reached operational maturity in this area.

Building organisation-wide guardrails around certificate management

Everything we've covered so far - inventory, monitoring, renewal, rotation - solves the technical side of certificate expiry. But outages rarely happen because of a missing script or exporter. They happen because systems grow, responsibilities shift, and operational assumptions slowly drift out of sync with reality.

Preventing certificate-expiry outages at scale requires more than good automation. It needs guardrails: lightweight, durable structures that support engineers without slowing them down. This isn't governance, and it isn't process for process' sake. It's giving teams the clarity and safety they need so certificates don't become an invisible failure mode.

Some if not all of these guardrails aren't needed if you have a single well known and automated way of dealing with certificates. Sometimes that's not the case, and that's where you need guardrails. Here are guardrails that have helped me manage the complexity of manual certificate lifecycle.

1. Make ownership explicit - for every certificate

Every certificate in your system should have:

an owner
a renewal mechanism
a rotation mechanism
a monitoring hook
an escalation path

This sounds formal, but it can be as simple as three fields in an internal inventory:

Service name
Team
Contact channel

When ownership is clear, expiry becomes a maintenance task, not a detective story.

2. Set policy, but keep it lightweight

Certificate policies often fail because they become too rigid or too verbose. A practical policy should answer only the essentials:

What is the recommended TTL?
Which CAs are approved?
How should private keys be stored?
What is the expected rotation pattern?

3. Use the same observability channels you use for everything else

A certificate expiring should appear in:

the same dashboard
the same alerting system
the same on-call rotation
the same incident workflow

If you need a separate tool or a second inbox to monitor certificates, you've already created inefficiencies and you are going to add more to the confusion. The best guardrail is simply: "This is part of our normal operational metrics."

4. Run periodic "expiry audits" without blame

Once or twice a year, do a small audit:

list certificates expiring within N days
identify certificates with missing owners
catch stray certs on forgotten hosts
verify mesh CA rotations
clean up unused secrets

The best option is to automate this audit.

5. Practice a certificate-rotation drill

Just like fire drills, rotation drills can build confidence by exposing vulnerabilities and gaps. Pick a non-critical service once a quarter:

issue a new certificate
rotate it using your recommended method
validate behaviour
document any rough edges

This helps teams become comfortable with rotations, and uncovers issues that only show up during real renewals - mismatched trust stores, pinned clients, stale intermediates, or forgotten nodes. Better still, do it in production for a service.

6. Encourage teams to prefer automation over manual fixes

When a certificate is close to expiring, the fastest fix is often manual: generate a cert, upload it, restart a service - thank your sir.

It works in the moment, but creates a hidden cost: the automation is bypassed, and the system drifts.

Guardrails help by making the automated path the default:

CI pipelines that issue certs consistently
templates that enforce expiry monitoring
runbooks that always reference the automated flow
dashboards that show rotation health

Guardrails keep engineering energy focused where it matters

Good guardrails don't feel heavy. They feel like support structures - the kind that keep important details visible even when everyone is moving fast. They reduce cognitive load, eliminate invisible traps, and give teams a shared mental model for how certificates behave in their environment.

When these guardrails are in place, certificate expiry stops being a background anxiety. It becomes just another part of the system that's well understood, continuously monitored, and quietly maintained.

Bringing it all together - from trapdoor failures to predictable operations

Certificate-expiry outages feel disproportionate. They don't arise from a complex scaling limit or an unexpected dependency interaction. They come from a single date embedded in a file - a detail that quietly counts down while everything else appears healthy. And when that date finally arrives, the failure is abrupt. No slow burn, no early symptoms. Just a trapdoor.

But it doesn't need to be this way. Expiry is one of the few reliability risks that is both entirely predictable and entirely preventable.

When we treat certificates as operational assets - things we can inventory, observe, rotate, and practice with - the problem changes shape. Instead of scrambling during an incident, teams build a steady rhythm around expiry:

certificates are visible as metrics
renewals happen automatically
rotations are safe and boring
alerts arrive early and calmly
ownership is clear
guardrails carry the organisational weight

And the result is a system that behaves the way resilient systems should: not because people remembered every corner, but because the structure makes forgetting impossible.

The GitHub outage was a reminder, not a criticism. It showed that even the most sophisticated engineering organisations can be caught off-guard by something small and silent. But it also demonstrated why it's worth building a culture - and a set of practices - where small and silent things are surfaced early.

If your team can get certificate expiry out of the class of "we hope this doesn't bite us" and into the class of "this is a well-managed part of our infrastructure," you've eliminated an entire category of avoidable outages.

That's the goal. Not perfect governance. Just clear guardrails, steady habits, and a system you can trust - even on the days when nothing looks wrong.

base14 Product Engineering Principles

2025-11-19T00:00:00.000Z

At base14, everyone is always

shipping
forward deployed
helping customers
on production support

Principles

Craftsmanship

Take time, do the right thing
Leave the codebase better than you found it
Build for the long term

Ownership

Own your learnings
Enforce radical transparency
Figure out the best thing to do to help our customers

Collaboration

Communicate clearly
Ask the hard questions
When in doubt, ask the customer
Assume good intent, seek shared understanding

Frugal innovation

Do more with less
Automate everything
Choose the simplest tool that works
Let constraints drive better solutions

Understanding What Increases and Reduces MTTR

2025-11-03T00:00:00.000Z

What makes recovery slower — and what disciplined, observable teams do differently.

In reliability engineering, MTTR (Mean Time to Recovery) is one of the clearest indicators of how mature a system — and a team — really is. It measures not just how quickly you fix things, but how well your organization detects, communicates, and learns from failure.

Every production incident is a test of the system's design, the team's reflexes, and the clarity of their shared context. MTTR rises when friction builds up in those connections — between tools, roles, or data. It falls when context flows freely and decisions move faster than confusion.

The table below outlines what typically increases MTTR, and what helps reduce it.

What Increases MTTR	What Reduces MTTR
Tool fragmentation — Engineers switching between 5–6 systems to correlate metrics, logs, and traces.	Unified observability — One system of record for signals, context, and dependencies.
Ambiguous ownership — No clear incident lead or decision-maker during crises.	Clear incident command — Defined roles: Incident Lead, Scribe, Technical Actors, Comms Lead.
Tribal knowledge dependency — Critical know-how lives in people's heads, not in runbooks or documentation.	Documented runbooks & shared context — Institutionalize recovery steps and system behavior.
Delayed or low-quality alerts — Issues detected late, or alerts lack relevance or context.	Contextual and prioritized alerting — Alerts linked to user impact, with clear severity and ownership.
Unstructured communication — Slack chaos, overlapping updates, unclear status.	War-room discipline — Structured updates, timestamped actions, single-threaded communication.
Noisy or false-positive monitoring — Engineers waste time triaging irrelevant alerts.	Adaptive thresholds & anomaly detection — Focus attention on meaningful deviations.
Complex release pipelines — Hard to correlate incidents with recent deployments or config changes.	Deployment correlation — Automated linkage between system changes and emerging anomalies.
Lack of observability in dependencies — Blind spots in upstream or third-party systems.	End-to-end visibility — Instrumentation across services and dependencies.
No post-incident learning — Same issues recur because lessons aren't captured.	Structured postmortems — Document root causes, timelines, and action items for systemic fixes.
Overly reactive culture — Teams firefight repeatedly without addressing systemic issues.	Reliability mindset — Invest in prevention: better testing, chaos drills, resilience engineering.

Tool Fragmentation → Unified Observability

One of the biggest sources of friction during incidents is tool fragmentation. When every function — metrics, logs, traces — lives in a separate system, engineers lose time stitching context instead of resolving the issue. Database monitoring is a common blind spot—see how pgX unifies PostgreSQL observability with application telemetry.

Unified observability doesn't mean one vendor or dashboard. It means a single, correlated view where you can trace a signal from symptom to cause without tab-switching or guesswork.

Ambiguous Ownership → Clear Incident Command

The first few minutes of an incident often determine the total MTTR. If no one knows who's in charge, time is lost to hesitation.

A clear incident command structure — with a Lead, a Scribe, and defined technical owners — turns panic into coordination. Clarity is a multiplier for speed.

Tribal Knowledge Dependency → Documented Runbooks

Systems recover faster when knowledge isn't person-bound. When only one engineer "knows" how a component behaves under failure, every minute of their absence adds to downtime.

Runbooks and architectural notes make recovery procedural, not heroic. Institutional knowledge beats tribal knowledge, every time.

Delayed or Low-Quality Alerts → Contextual and Prioritized Alerting

MTTR starts at detection. If alerts arrive late, or worse, arrive noisy and without context, the system is already behind.

Good alerting surfaces what matters first: alerts linked to user impact, enriched with context and severity. A well-designed alert doesn't just notify — it orients.

Unstructured Communication → War-Room Discipline

Incident channels often devolve into noise — too many voices, overlapping updates, and no clear sequence of events.

War-room discipline restores order: timestamped updates, designated leads, and a single thread of record. The structure may feel rigid, but it accelerates clarity.

Noisy Monitoring → Adaptive Thresholds

When everything is "critical," nothing is.

Teams lose urgency when faced with hundreds of alerts of equal importance. Adaptive thresholds and anomaly detection help focus human attention where it matters — on genuine deviations from normal behavior.

Complex Releases → Deployment Correlation

During incidents, teams often waste time rediscovering that the issue began right after a deploy.

Correlating incidents with deployment timelines or configuration changes reduces uncertainty. This isn't about assigning blame — it's about shrinking the search space quickly.

Systems rarely fail in isolation. An API latency spike in one service can cascade into failures elsewhere.

End-to-end visibility helps teams see across boundaries — understanding not just their own service, but how it fits into the larger reliability graph.

No Post-Incident Learning → Structured Postmortems

If an incident doesn't produce learning, it's bound to repeat.

Structured postmortems — with clear timelines, decisions, and next actions — transform operational pain into organizational learning. Reliability improves when teams close the feedback loop.

Reactive Culture → Reliability Mindset

Finally, reliability isn't built during incidents — it's built between them.

A reactive culture celebrates firefighting; a reliability mindset values prevention. Investing in chaos drills, resilience patterns, and testing failure paths ensures MTTR naturally trends downward over time.

MTTR reflects not just the health of systems, but the health of collaboration.

Reliable systems recover quickly not because they never fail, but because when they do, everyone knows exactly what to do next.

Why Unified Observability Matters for Growing Engineering Teams

2025-08-18T00:00:00.000Z

Last month, I watched a senior engineer spend three hours debugging what should have been a fifteen-minute problem. The issue wasn't complexity—it was context switching between four different monitoring tools, correlating timestamps manually, and losing their train of thought every time they had to log into yet another dashboard. If this sounds familiar, you're not alone. This is the hidden tax most engineering teams pay without realizing there's a better way.

As engineering teams grow from 20 to 200 people, the observability sprawl becomes a significant drag on velocity. What starts as "let's use the best tool for each job" often ends up as a maze of disconnected systems that make simple questions surprisingly hard to answer. The cost of this fragmentation compounds over time, much like technical debt, but it's often invisible until it becomes painful.

Unified observability isn't about having fewer tools for the sake of simplicity. It's about creating a coherent system where your teams can move from question to answer without losing context, where correlation happens automatically, and where the cognitive load of understanding your systems doesn't grow exponentially with their complexity.

The Real Cost of Fragmented Observability

Most teams don't set out to create observability sprawl. It happens gradually—the infrastructure team picks a metrics solution, the application team chooses an APM tool, someone adds a log aggregator, and before you know it, you have what I call the "observability tax." Every new engineer needs to learn multiple tools, every incident requires juggling browser tabs, and every post-mortem reveals gaps between systems that no one noticed until something broke.

The immediate costs are obvious: longer incident resolution times, frustrated engineers, and missed SLA breaches. But the hidden costs are what really hurt. Engineers start avoiding investigations because they're too cumbersome. They make decisions based on partial data because getting the full picture takes too long. Worse, they begin to distrust the tools themselves, creating a culture where gut feelings override data-driven decisions.

I've seen teams where senior engineers keep personal docs on "which tool to check for what". When your observability strategy requires tribal knowledge to navigate, you've already lost. The irony is that these teams often have excellent coverage—they can observe everything, they just can't make sense of it efficiently.

Faster Incident Resolution

The most immediate benefit of unified observability is dramatically faster incident resolution. But it's not just about speed—it's about maintaining context and reducing the cognitive load during high-stress situations. When an incident hits at 2 AM, the difference between clicking through one interface versus four isn't just minutes saved; it's the difference between a focused investigation and a frantic scramble.

Consider a typical scenario: your payment service starts failing. With fragmented tools, you check application logs in one system, infrastructure metrics in another, trace the request flow in a third, and finally correlate user impact in a fourth. Each transition loses context, each tool has different time formats, and by the time you've gathered all the data, you've lost the thread of your investigation. With unified observability, you start with the symptom and drill down through correlated data without context switches. The failed payments lead directly to the slow database queries, which link to the infrastructure metrics showing disk I/O saturation—all in one flow. This is exactly the kind of correlation that pgX enables for PostgreSQL workloads.

The real magic happens when your tools share the same understanding of your system. Service names, tags, and timestamps align automatically. What used to require manual correlation now happens instantly. I've seen teams reduce their mean time to resolution (MTTR) by 50-60% just by eliminating the friction of tool-switching. But more importantly, incidents become learning opportunities rather than fire drills, because engineers can focus on understanding the problem rather than wrestling with the tools.

Reduced Context Switching and Cognitive Load

Engineers are expensive, and not just in salary terms. Their ability to maintain flow state and solve complex problems is your competitive advantage. Every context switch—whether between tools, documentation, or mental models—degrades this ability. Unified observability isn't just about efficiency; it's about preserving your team's cognitive capacity for the problems that matter.

The math is simple but often overlooked. If an engineer spends 30% of their debugging time just navigating between tools and correlating data manually, that's 30% less time understanding and fixing the actual problem. Multiply this across every engineer, every incident, every investigation, and you're looking at significant productivity loss. But it's worse than just time lost—context switching increases error rates and decision fatigue.

What's less obvious is how this affects your team's willingness to investigate issues proactively. When checking a hypothesis requires logging into three different systems, engineers stop checking hunches. They wait for problems to become critical enough to justify the effort. This reactive stance means you're always playing catch-up, fixing problems after they've impacted customers rather than preventing them. A unified system lowers the activation energy for investigation, encouraging engineers to dig deeper and catch issues early.

Cost Optimization Through Correlation

The conversation about observability costs often focuses on the wrong metrics. Yes, unified platforms can reduce licensing fees and infrastructure costs, but the real savings come from correlation and deduplication. When your metrics, logs, and traces live in separate silos, you're not just paying for storage three times—you're missing the insights that come from connecting the dots.

Take a real example: a team I worked with discovered they were spending $50K monthly on log storage, with 70% being redundant debug logs from a misconfigured service. This wasn't visible in their log aggregator alone—it only became clear when they correlated log volume with service deployment patterns and actual incident investigations. The logs looked important in isolation but were noise when viewed in context. Unified observability makes these patterns visible.

The strategic advantage goes beyond cost cutting. When you can correlate resource usage with business metrics in real-time, you make better scaling decisions. You can see that the spike in infrastructure costs correlates with a specific customer behavior pattern, not just increased load. This visibility helps you optimize for the right things—maybe that expensive query is worth it because it drives significant revenue, or maybe that efficient service is actually hurting customer experience. Without unified observability, these trade-offs remain invisible.

Proactive Problem Detection

The shift from reactive to proactive operations is where unified observability really shines. It's not about having more alerts—most teams already have too many. It's about having smarter, correlated detection that understands your system holistically. When your observability platform understands the relationships between services, it can detect patterns that would be invisible to isolated monitoring tools.

Consider service degradation that doesn't breach any individual threshold. Response times increase by 20%, error rates bump up by 0.5%, and throughput drops by 10%. Individually, none of these trigger alerts, but together they indicate a problem brewing. Unified observability platforms can detect these composite patterns, surfacing issues before they become incidents. More importantly, they can correlate these patterns with changes — deployments, configuration updates, or traffic shifts - giving you not just detection but probable cause.

The real transformation happens when teams internalize this capability. Engineers start thinking in terms of system health rather than individual metrics. They set up learning alerts that identify new patterns rather than just threshold breaches. Product teams begin incorporating observability into feature design, asking "how will we know if this is working?" before they build. This proactive mindset, enabled by unified observability, is what separates teams that scale smoothly from those that lurch from crisis to crisis.

Better Cross-Team Collaboration

Observability silos create organizational silos. When the frontend team uses different tools than the backend team, and infrastructure has its own stack, you're not just fragmenting your data—you're fragmenting your culture. Unified observability becomes a shared language that breaks down these barriers.

The transformation is subtle but powerful. In incident reviews, instead of each team presenting their view from their tools, everyone looks at the same data. The frontend engineer can see how their API calls impact backend services. The infrastructure team can trace how capacity affects application performance. Product managers can directly see how technical metrics relate to user experience. This shared visibility creates shared ownership.

More importantly, it changes how teams design and build systems. When everyone can see the full impact of their decisions, they make better choices. API designers think about client-side impact. Frontend developers consider backend load. Infrastructure teams understand application patterns. This isn't about making everyone responsible for everything—it's about making the impacts visible so teams can collaborate effectively. The best architectural decisions I've seen have come from these moments of shared understanding, enabled by unified observability.

Implementation Considerations

The right time to invest in unified observability is before you think you need it. Like setting up continuous integration or automated testing, the cost of implementation grows exponentially with system complexity. If you're past Series A and haven't thought seriously about this, you're already behind—but it's not too late if you approach it strategically.

The build versus buy decision usually comes down to a false economy. Yes, you can stitch together open-source tools and build your own correlations. But unless observability is your core business, you're better off buying a platform and customizing it to your needs. The real cost isn't in the initial setup—it's in maintaining, upgrading, and training people on a bespoke system. I've seen too many teams build "simple" observability platforms that become full-time jobs to maintain.

Cultural change is the hardest part. Engineers comfortable with their tools resist change, especially if they've built expertise in navigating the current maze. The key is to start with a pilot team solving real problems, not a big-bang migration. Show, don't tell. When other teams see the pilot team resolving incidents faster and catching problems earlier, adoption becomes organic. Avoid the temptation to mandate adoption before proving value—you'll create compliance without buy-in, which is worse than fragmentation.

Measuring Success

Success metrics for unified observability should focus on outcomes, not usage. Tool adoption rates and dashboard views tell you nothing about value. That's Observability Theatre. Instead, measure what matters: mean time to resolution, proactive issue detection rate, and engineering satisfaction scores. If these aren't improving, you're just consolidating complexity without solving the underlying problems.

Set realistic timelines. You won't see dramatic MTTR improvements in the first month—teams need time to learn new workflows and build confidence. The typical pattern I've observed is: month one to three shows mild improvement as teams learn the tools, months three to six show significant gains as teams optimize their workflows, and after six months, you see transformational changes as teams shift from reactive to proactive operations.

The most telling sign of success is what engineers do when they're curious. Do they open the observability platform to explore hypotheses, or do they wait for alerts? When debugging, do they start with broad system views and drill down, or do they still check individual tools? When planning new features, do they consider observability from the start? These behavioral changes indicate true adoption and value realization.

Looking Forward

Unified observability is a capability that evolves with your system. The goal isn't to have one tool that does everything, but rather a coherent system where data flows naturally, correlation happens automatically, and insights emerge from connection rather than isolation. It's about building a culture where observability is a first-class concern, not an afterthought.

The teams that get this right don't just resolve incidents faster—they build more reliable systems from the start. They make better architectural decisions because they can see the implications. They ship faster because they have confidence in their ability to understand and fix problems. Most importantly, they create an engineering culture that values understanding over guessing, data over opinions, and proactive improvement over reactive firefighting.

If you're on the fence about investing in unified observability, consider this: the cost of implementation is finite and decreasing, while the cost of fragmentation is ongoing and increasing. Every new service you add, every new engineer you hire, every new customer you onboard increases the complexity that fragmented observability has to handle. At some point, the weight of this complexity will force your hand. The only question is whether you'll act proactively or reactively. Based on everything I've seen, being proactive is significantly less painful

Thanks for reading. If you're in the process of evaluating or implementing unified observability for your team, I'd love to hear about your experience. The patterns I've described are common, but every team's journey is unique.

Observability Theatre

2025-08-12T00:00:00.000Z

the·a·tre (also the·a·ter) /ˈθiːətər/ noun

: the performance of actions or behaviors for appearance rather than substance; an elaborate pretense that simulates real activity while lacking its essential purpose or outcomes

Example: "The company's security theatre gave the illusion of protection without addressing actual vulnerabilities."

Your organization has invested millions in observability tools. You have dashboards for everything. Your teams dutifully instrument their services. Yet when incidents strike, engineers still spend hours hunting through disparate systems, correlating timestamps manually, and guessing at root causes. When the CEO forwards a customer complaint asking "are we down?", that's when the dev team gets to know about incidents.

You're experiencing observability theatre—the expensive illusion of system visibility without its substance.

The Symptoms

Walk into any engineering organization practicing observability theatre and you'll find:

Tool sprawl. Different teams have purchased different monitoring solutions—Datadog here, New Relic there, Prometheus over there, ELK stack in the corner. Each tool was bought to solve an immediate problem, creating a patchwork of incompatible systems that cannot correlate data when you need it most.

Dead dashboards. Over 90% of dashboards are created once and never viewed again. Engineers build them for specific incidents or projects, then abandon them. Your Grafana instance becomes a graveyard of good intentions, each dashboard a monument to a problem solved months ago.

Alert noise. When 90% of your alerts are meaningless, teams adapt by ignoring them all. Slack channels muted. Email filters sending alerts straight to trash.

Sampling and Rationing. To manage observability costs, teams sample data down to 50% or less. They keep data for days instead of months. During an incident, you discover you can't analyze the problem because half the relevant data was discarded. That critical trace showing the root cause? It was in the 50% you threw away to save money.

Fragile self-hosted systems. The observability stack requires constant nursing. Engineers spend days debugging why Prometheus is dropping metrics, why Jaeger queries timeout, or why Elasticsearch ran out of disk space again. During major incidents—when twenty engineers simultaneously open dashboards—the system slows to a crawl or crashes entirely. The tools meant to help you debug problems become problems themselves.

Instrumentation chaos. Debug logs tagged as errors flood your systems with noise. Critical errors buried in info logs go unnoticed. One service emits structured JSON, another prints strings, a third uses a custom format. Service A calls it "user_id", Service B uses "userId", Service C prefers "customer.id". When you need to trace an issue across services, you're comparing apples to jackfruits.

Uninstrumented code everywhere. New services ship with zero metrics. Features go live without trace spans. Error handling consists of console.log("error occurred"). When incidents happen, you're debugging blind—no metrics to check, no traces to follow, no structured logs to query. Entire microservices are black boxes, visible only through their side effects on other systems.

Archaeological dig during incidents. Every incident becomes an hours-long excavation. Engineers share screenshots in Slack because they can't share dashboard links. They manually correlate timestamps across three different tools. Someone always asks "which timezone is this log in?" The same investigations happen repeatedly because there's no shared context or runbooks.

Vanity metrics. Dashboards full of technical measurements that tell you nothing about what matters. Engineers know CPU is at 80%, memory usage is climbing, p99 latency increased 50ms. Meanwhile, checkout conversion plummeted 30%, revenue is down $100K per hour, and customers are abandoning carts in droves. Observability tracks server health while business bleeds money.

Reactive-only mode. Your customers are your monitoring system. They discover bugs before your engineers do. They report outages before your alerts fire. You only look at dashboards after Twitter lights up with complaints or support tickets spike. No proactive monitoring, no SLOs, no error budgets—just perpetual firefighting mode. The CEO forwards a customer complaint asking "are we down?", and then you check your dashboards.

Why Organizations Fall Into Observability Theatre

These symptoms don't appear in isolation. They emerge from fundamental organizational patterns and human tendencies that push observability to the margins. Understanding these root causes is the first step toward meaningful change.

Never anyone's first priority. Business wants to ship new features. Engineers want to learn new frameworks, design patterns, or distributed systems—not observability tools. It's perpetually someone else's problem. Even in organizations that preach "you build it, you run it," observability remains an afterthought.

No instant karma. Bad observability practices don't hurt immediately. Like technical debt, its pain compounds slowly. The engineer who skips instrumentation ships faster and gets praised. By the time poor observability causes a major incident, they've been promoted or moved on. Without immediate consequences, there's no learning loop.

Siloed responsibilities. In most companies, a small SRE team owns observability while hundreds of engineers ship code. This 100:1 ratio guarantees failure. The people building systems aren't responsible for making them observable. No one adds observability to acceptance criteria. It's always someone else's job—until 3 AM when it's suddenly everyone's problem.

Reactive budgeting. Observability never gets proactive budget allocation. Teams cobble together tools reactively. Three months later, sticker shock hits. Panicked cost-cutting follows—sampling, shortened retention, tool consolidation. The very capabilities you need during incidents get sacrificed to control costs you never planned for.

Data silos and fragmentation. Different teams implement different tools, creating isolated islands of data. Frontend uses one monitoring service, backend another, infrastructure a third. When issues span systems—which they always do—you can't correlate. Each team optimizes locally while system-wide observability degrades.

No business alignment. Observability remains a technical exercise divorced from business outcomes. Dashboards track CPU and memory, not customer experience or revenue. Leaders see it as a cost center, not a business enabler. Without clear connection to business value, observability always loses budget battles.

The magic tool fallacy. Organizations buy tools expecting them to solve structural problems automatically. Without standards, training, or cultural change, expensive tools become shelfware. Now they have N+1 problems.

Root Cause Analysis : The Mechanisms at Work

Understanding how these root causes transform into symptoms reveals why observability theatre is so persistent. These aren't isolated failures—they're interconnected mechanisms that reinforce each other.

Poor planning leads to tool proliferation

No upfront observability strategy means each team solves immediate problems with whatever tool seems easiest. Frontend adopts Sentry. Backend chooses Datadog. Infrastructure runs Prometheus. Data science uses something else entirely. Without coordination, you get:

Multiple overlapping tools with partial coverage
Inability to correlate issues across system boundaries
Escalating costs from redundant functionality
Integration nightmares when trying to build unified views

Cost-cutting degrades incident response

The cycle is predictable. No budget planning leads to bill shock. Panicked executives demand cost reduction. Teams implement aggressive sampling and short retention. Then:

Critical data missing during incidents (the error happened in the discarded 50%)
Can't identify patterns in historical data (it's already deleted)
Slow-burn issues remain invisible until they explode
MTTR increases, causing more business impact than the saved tooling costs

Missing standards multiply debugging time

Without instrumentation guidelines, every service becomes a unique puzzle:

Inconsistent log formats require custom parsing per service
Naming conventions vary (is it "user_id", "userId", or "uid"?)
Critical context missing from some services but not others
Engineers waste hours translating between formats during incidents

Knowledge loss perpetuates bad practices

The slow feedback loop creates a vicious cycle:

Engineers implement quick fixes without understanding long-term impact
By the time problems manifest (months later), they've moved to new teams or companies
New engineers inherit the mess without context
They make similar decisions, not knowing the history
Documentation, if it exists, captures what was built, not why it fails
Each generation repeats the same mistakes

Alert fatigue becomes normalized dysfunction

The progression is insidious:

Initial alerts seem reasonable
Without standards, everyone adds their own "important" alerts
Alert volume grows exponentially
Teams start ignoring non-critical alerts
Soon they're ignoring all alerts
Channels get muted, rules send alerts to /dev/null
Real incidents go unnoticed until customers complain

The self-hosted software trap deepens over time

What starts as cost-saving becomes a resource sink:

"Free" OSS tools require dedicated engineering time
At scale, they need constant tuning, upgrades, capacity planning
Your best engineers get pulled into observability infrastructure
The system works fine in steady state but fails under incident load
Upgrades get deferred (too risky during business hours)
Technical debt accumulates until the system is barely functional
By then, migration seems impossible

Observability as Infrastructure

The solution isn't another tool or methodology. It's a fundamental shift in how we think about observability. Stop treating it as an add-on. Start treating it as infrastructure—as fundamental to your systems as your database or load balancer.

Start with what you already understand

You wouldn't run production without:

Databases to store your data
Load balancers to distribute traffic
Security systems to protect assets
Backup systems to ensure recovery
Version control to track changes

Yet many organizations run production without observable systems. Observability isn't optional infrastructure; it's foundational infrastructure. You need it before you need it.

The business case is undeniable

When observability is foundational infrastructure:

Incidents resolve 50-70% faster. Unified tools and standards mean engineers find root causes in minutes, not hours
False alerts drop by 90%. Thoughtful instrumentation replaces noise with signal
Engineering productivity increases. Less time firefighting, more time building
Customer experience improves. You detect issues before customers do
Costs become predictable. Planned investment replaces reactive spending

When observability is theatre:

Every incident is a marathon. Hours spent correlating data across tools
Engineers burn out. Constant firefighting with broken tools
Customers find your bugs. They're your most expensive monitoring system
Costs spiral unpredictably. Emergency tool purchases, extended downtime, lost customers

Metric	Observability Theatre	Observability as Infrastructure
Incident Resolution	Hours wasted correlating across systems	50-70% faster MTTR with unified tools
Alert Quality	Noise drowns out real issues	90% reduction in false positives
Engineering Focus	Constant firefighting and tool debugging	Building features and improving systems
Issue Detection	Customers report problems first	Proactive detection before customer impact
Cost Management	Reactive spending and hidden downtime costs	Predictable, planned investment
Team Health	Burnout from broken tools and processes	Sustainable on-call, clear procedures
Business Impact	Lost sales, damaged reputation	Protected revenue, better customer trust

How treating observability as infrastructure transforms decisions

When leadership recognizes observability as infrastructure, everything changes:

Budgeting: You allocate observability budget upfront, just like you do for databases or cloud infrastructure. No more scrambling when bills arrive. No more choosing between visibility and cost. You plan for the observability your system scale requires.

Staffing: Observability becomes everyone's responsibility. You hire engineers who understand instrumentation. You train existing engineers on observability principles. You don't dump it on a small SRE team—you embed it in your engineering culture.

Development practices: Observability requirements appear in every design document. Story tickets include instrumentation acceptance criteria. Code reviews check for proper logging, metrics, and traces. You build observable systems from day one, not bolt on monitoring as an afterthought.

Tool selection: You choose tools strategically for the long term, not reactively for immediate fires. You prioritize integration and correlation capabilities over feature lists. You invest in tools that grow with your needs, not fragment your visibility.

Standards first: Before the first line of code, you establish instrumentation standards. Log formats. Metric naming. Trace attribution. Alert thresholds. These become as fundamental as your coding standards.

The widening gap: Your competition isn't waiting

Here's the stark reality: while you're performing observability theatre, your competitors are building genuinely observable systems. The gap compounds daily.

Capability	Organizations Stuck in Theatre	Organizations with Observability
Deployment Velocity	Ship slowly,fearing invisible problems	Ship features faster with confidence
Incident Management	Learn about problems from customers	Resolve incidents before customers notice
Technical Decisions	Architecture based on guesses and folklore	Data-driven decisions on architecture and investment
Talent Retention	Lose engineers tired of broken tooling	Attract top talent who demand proper tools
Scaling Ability	Hit mysterious walls they can't diagnose	Scale confidently with full visibility
On-Call Experience	3 AM debugging sessions with fragmented tools	Efficient resolution with unified observability

Organizations with observability:

Ship features faster because they trust their visibility
Resolve incidents before customers notice
Make data-driven decisions about architecture and investment
Attract top engineering talent who refuse to work blind
Scale confidently, knowing they can see what's happening

Organizations stuck in theatre:

Ship slowly, fearing what they can't see
Learn about problems from Twitter and support tickets
Make architectural decisions based on guesses and folklore
Lose engineers tired of 3 AM debugging sessions with broken tools
Hit scaling walls they can't diagnose

This gap isn't linear—it's exponential. Every month you delay treating observability as infrastructure, your competitors pull further ahead. They're iterating faster, learning quicker, and serving customers better. Your observability theatre isn't just costing money. It's costing market position.

The choice is stark: evolve or become irrelevant. Your systems will only grow more complex. Customer expectations will only increase. The organizations that can see, understand, and respond to their systems will win. Those performing theatre in the dark will not.

base14 Blog Blog

The Multi-Cloud Design: Engineering your code for Portability

The Two Fronts: Application Layer vs Environment Layer​

Designing for a Cloud is the Trap: The Cost of Vendor Lock-In​

Kafka as a Library: The Interface-First Mindset​

Business Logic Should Be Ignorant: The Adapter Pattern​

Intent vs. Instructions: GitOps for Reproducibility​

You Don't Need to Move, You Just Need to Know You Can​

What's Next?​

Related Reading​

References​

Live Metric Registry: find and understand observability metrics across your stack

What you can do today with Metric Registry​

Sources already indexed​

What's the need for a Metric Registry?​

Why not just a static list ?​

Its not trivial to build a live Metric Registry - why is that?​

Multi-Language Extraction​

Multiple Definition Patterns​

Normalization Challenge​

Provenance Tracking​

Trust Levels​

Semantic Convention Compliance​

And so - source-first metric extraction​

Design Principles​

Architecture Deep Dive​

Adapters​

Extraction Methods​

Custom Patterns​

Storage and Search​

Enrichment​

What's next​

Contribute​

Related Reading​

Evaluating Database Monitoring Solutions: A Framework for Engineering Leaders

The Hidden Cost Model of Fragmented Observability​

Impact on Incident Resolution​

Impact on Software Delivery Velocity​

Impact on Operational Culture​

The Knowledge Silo Problem​

A Framework for Evaluation​

The Strategic Choice​

Conclusion​

Related Reading​

Effective War Room Management: A Guide to Incident Response

Initialization​

Key Elements of Initialization​

Clear Role Definition​

Core Roles​

Incident Manager​

Scribe​

Communications Person​

Actors​

Effective Practices​

Recommended Practices​

War Room Etiquette​

Etiquette Guidelines​

Post-Incident Activities​

Post-Incident Process​

Related Reading​

pgX: Comprehensive PostgreSQL Monitoring at Scale

What pg_stat_statements Does Well​

The 9 Observability Domains Every Engineer Should Know​

Domain 1: Connections​

Domain 2: Replication​

Domain 3: Locks & Waits​

Domain 4: Tables​

Domain 5: Indexes​

Domain 6: Maintenance (Vacuum & Analyze)​

Domain 7: Performance (Beyond Aggregates)​

Domain 8: Resources​

Domain 9: Topology & Health​

The Operational Gap​

What “Comprehensive” Actually Looks Like​

From Metrics to Answers​

Conclusion​

Footnotes​

Introducing pgX: Bridging the Gap Between Database and Application Monitoring for PostgreSQL

Why PostgreSQL Is Commonly Observed in Isolation?​

The Inflection Point: When Isolation Stops Working​

The Two Fronts: Application Layer vs Environment Layer

Designing for a Cloud is the Trap: The Cost of Vendor Lock-In

Kafka as a Library: The Interface-First Mindset

Business Logic Should Be Ignorant: The Adapter Pattern

Intent vs. Instructions: GitOps for Reproducibility

You Don't Need to Move, You Just Need to Know You Can

What's Next?

Related Reading

References

What you can do today with Metric Registry

Sources already indexed

What's the need for a Metric Registry?

Why not just a static list ?

Its not trivial to build a live Metric Registry - why is that?

Multi-Language Extraction

Multiple Definition Patterns

Normalization Challenge

Provenance Tracking

Trust Levels

Semantic Convention Compliance

And so - source-first metric extraction

Design Principles

Architecture Deep Dive

Adapters

Extraction Methods

Custom Patterns

Storage and Search

Enrichment

What's next

Contribute

Related Reading

The Hidden Cost Model of Fragmented Observability

Impact on Incident Resolution

Impact on Software Delivery Velocity

Impact on Operational Culture

The Knowledge Silo Problem

A Framework for Evaluation

The Strategic Choice

Conclusion

Related Reading

Initialization

Key Elements of Initialization

Clear Role Definition

Core Roles

Incident Manager

Scribe

Communications Person

Actors

Effective Practices

Recommended Practices

War Room Etiquette

Etiquette Guidelines

Post-Incident Activities

Post-Incident Process

Related Reading

What pg_stat_statements Does Well

The 9 Observability Domains Every Engineer Should Know

Domain 1: Connections

Domain 2: Replication

Domain 3: Locks & Waits

Domain 4: Tables

Domain 5: Indexes

Domain 6: Maintenance (Vacuum & Analyze)

Domain 7: Performance (Beyond Aggregates)

Domain 8: Resources

Domain 9: Topology & Health

The Operational Gap

What “Comprehensive” Actually Looks Like

From Metrics to Answers

Conclusion

Footnotes

Why PostgreSQL Is Commonly Observed in Isolation?

The Inflection Point: When Isolation Stops Working

The Cost of Split Observability

Databases Are Not Dependencies - They Are Components

What "Bridging the Gap" Actually Means

Depth Alone Is Not Enough

PostgreSQL Observed as Part of the System

The Strategic Implication for Engineering Leaders

Introducing pgX