An alarm goes off in the production environment; the dashboard turns red, and after 40 minutes of analyzing metrics and logs, the on-call engineer says, "It's probably the database connection pool." If this scenario sounds familiar, the real problem is not your tool, but the fact that you do not have a causality charts of your system.
Causal graphs are mathematical representations that model cause-and-effect relationships between components in a system (services, hosts, metrics, log sources) using a Directed Acyclic Graph (DAG) structure. While each node represents a metric or service, each edge expresses a causal dependency such as "If A changes, B also changes."
While traditional monitoring systems rely solely on correlation, causal graphs find the true trigger (who triggers whom) by using conditional independence tests and causal discovery algorithms extracted from time-series data.
A causal graphs typically consists of three layers:
To understand why modern systems are turning to causal graphs, we need to look at where the classic observability stack gets stuck.
The system performs anomaly scoring independently for each node (metric, service) using statistical methods (z-score, EWMA), machine learning (isolation forest), or time-series models. An "anomaly timestamp" and an "anomaly severity" score are generated for each node.
The system extracts a "propagation chain" by looking at the locations of anomalous nodes on the causal graph and their anomaly timestamps. If the anomaly timestamp at node A is earlier than at B, it means there is a causal effect signal from A to B; the time lag is also included in the model during this process.
A simulation/intervention process is performed for the found root cause candidate to verify whether other anomalies would occur if that node behaved normally. This method provides a truly causal claim: "This is the cause of these anomalies, because if we removed it, the others would also disappear."
The common causal ancestor of anomalous nodes is mathematically isolated, reducing the search space from hundreds of nodes to a single-digit number of candidates.
RCA accelerates post-mortem documentation and knowledge sharing by generating a step-by-step traceable chain (e.g., payment-service -> latency increase -> order-service timeout).
50 downstream alarms triggered by a root cause are classified as "1 root cause + 49 consequences", ensuring engineers treat only 1 cause, directly improving MTTR.
Graphs containing temporal information completely answer the question "Did X happen first, or Y?", eliminating the common mistakes of reversing the cause-and-effect sequence.
| Component | Description |
|---|---|
| Required Data Sources | Metrics, distributed traces, and logs form the first layer; topology data like service mesh configurations and event data like deployment logs provide the other layers. |
| Discovery Algorithms | Causal discovery algorithms like the PC algorithm, FCI, GES, NOTEARS, and PCMCI are used; PCMCI is preferred for high-dimensional and lagged system metrics. |
| Creation Process | Raw telemetry data is collected and time-synchronized, topology information is injected into the graph as prior knowledge, and the algorithm is run to continuously retrain the graph. |
| Data Quality Needs | Insufficient sampling frequency, incomplete trace propagation, and inconsistent timestamps are the most common data quality problems. |
Dynatrace creates a real-time dependency map (Smartscape) with the Davis AI engine and tracks faults over the topology; Datadog Watchdog correlates anomalies across services, but its analysis is mostly correlation-heavy.
Next-generation tools like NeuBird AI work with a context engineering approach, gathering dynamic information and operating in a distributed manner by connecting to multiple monitoring stacks (Prometheus, Datadog, etc.).
The Tigramite library, containing PCMCI algorithms, has become the academic standard and serves as a cornerstone for teams wanting to build custom RCA pipelines.
Causal graphs represent the shift from "what happened" to "why it happened", from correlation to causality, in the monitoring world. Teams building this technology on the right data infrastructure measurably reduce their MTTR while moving on-call engineers to a data-driven decision-making position. As the complexity of your systems increases, doing root cause analysis without causal graphs is like driving in a city without a map; you arrive at a point, but you never know which way you took or why it took so long.
Consult our experts to reduce your MTTR times by transitioning to a modern monitoring and observability infrastructure.
Contact Us →