Judea Pearl’s Causal Thinking: Better AI, Debugging, Decisions

Q: What’s the practical difference between correlation and causation in product and AI work?

Correlation helps you predict or detect (e.g., “when X rises, Y often rises too”). Causation answers a decision question: “If we change X on purpose, will Y change?” Use correlation for forecasting and monitoring; use causal thinking when you’re about to ship a change, set a policy, or allocate budget.

Q: Why did “more notifications = higher retention” fail when the team sent more notifications?

Because the correlation may be driven by confounding . In the notifications example, highly engaged users both trigger/receive more notifications and return more . If you increase notifications for everyone, you’ve changed the experience (an intervention) without changing the underlying engagement—so retention may not improve and can even worsen.

Q: What is a causal diagram (DAG), and why should a team bother drawing one?

A DAG (Directed Acyclic Graph) is a simple diagram where: - nodes are variables you care about - arrows mean “A causes B” (if changing A would change B) It’s useful because it makes assumptions explicit, helping teams agree on what to adjust for , what not to adjust for, and what experiment would actually answer the question.

Q: What are confounders, mediators, and colliders—and why do they matter?

- Confounder: affects both the proposed cause and the outcome (creates a misleading association). - Mediator: sits on the pathway from cause → outcome (part of the mechanism). - Collider: is caused by two variables; conditioning on it can create a fake relationship. A common mistake is “control for everything,” which can accidentally adjust for mediators or colliders and bias the result.

Q: What’s a counterfactual, and when is it useful?

A counterfactual asks: for this specific case , what would have happened if we had done something else. It’s useful for: - recourse (“what would need to change to get approved?”) - fairness checks (“would the decision change if only a sensitive attribute differed?”) - debugging odd decisions (“what minimal change flips the prediction?”) It requires a causal model so you don’t propose impossible changes.

Q: How does causal thinking help when an ML model’s performance drops in production?

Focus on what changed upstream and what the model might be exploiting: - dataset shift (user mix, UI, seasonality) - spurious shortcuts (proxies like watermarks or phrasing artifacts) - leakage (features downstream of the label/labeling process) A causal mindset pushes you to test targeted interventions (ablations, perturbations) instead of chasing coincident metric movements.

Q: Why can model “explainability” be misleading without causality?

Not necessarily. Feature importance explains what influenced the prediction , not what you should change . A highly “important” feature can be a proxy or symptom (e.g., support tickets predict churn). Intervening on the proxy (“reduce tickets by hiding support”) can backfire. Causal explanations tie importance to valid levers and expected outcomes under intervention.

Q: How do we incorporate causal thinking into PRDs and decision docs?

Add a short section that forces clarity before analysis: - Intervention: what exactly are we changing? - Outcome + guardrails: what should improve, and what must not worsen? - Confounders: what else could move the metrics at the same time? - Measurement plan: experiment, phased rollout, holdout, or matched comparison This keeps the team aligned on a causal question rather than post-hoc dashboard storytelling.

Judea Pearl’s Causal Thinking: Better AI, Debugging, Decisions | Koder.ai

Why Cause-and-Effect Beats Pattern-Spotting

A team notices something “obvious” in their dashboard: users who receive more notifications come back more often. So they crank up notification volume. A week later, retention dips and churn complaints rise. What happened?

The original pattern was real—but misleading. The most engaged users naturally trigger more notifications (because they use the product more), and they also naturally return more. Notifications didn’t cause retention; engagement caused both. The team acted on correlation and accidentally created a worse experience.

What “causal thinking” means (in plain language)

Causal thinking is the habit of asking: what causes what, and how do we know? Instead of stopping at “these two things move together,” you try to separate:

Signals you observe (what you see in logs, metrics, and charts)
Levers you can pull (what you can change in the real world)
Side effects and hidden influences (other factors pushing both)

It’s not about being skeptical of data—it’s about being specific about the question. “Do notifications correlate with retention?” is different from “Will sending more notifications increase retention?” The second question is causal.

Where this helps immediately

This post focuses on three practical areas where pattern-spotting often fails:

AI systems: Understanding whether a model is using the right reasons (or just shortcuts) when it makes predictions.
Debugging: Finding the actual root cause when metrics regress or incidents happen, instead of chasing the loudest coincidence.
Product decisions: Choosing changes that will move outcomes, not just “match” high-performing user segments.

What to expect from this article

This isn’t a math-heavy tour of causal inference. You won’t need to learn do-calculus notation to get value here. The goal is a set of mental models and a workflow your team can use to:

phrase better questions,
avoid common traps like confounding,
and decide when you need an experiment versus careful observational reasoning.

If you’ve ever shipped a change that “looked good in the data” but didn’t work in reality, causal thinking is the missing link.

Who Is Judea Pearl, and What Did He Change?

Judea Pearl is a computer scientist and philosopher of science whose work reshaped how many teams think about data, AI, and decision-making. Before his causal revolution, much of “learning from data” in computing focused on statistical associations: find patterns, fit models, predict what happens next. That approach is powerful—but it often breaks down the moment you ask a product or engineering question that contains the word because.

Pearl’s core shift was to treat causality as a first-class concept, not a vague intuition layered on top of correlations. Instead of only asking, “When X is high, is Y also high?”, causal thinking asks, “If we change X, will Y change?” That difference sounds small, but it separates prediction from decision-making.

From associations to causal questions

Association answers “what tends to co-occur.” Causation aims to answer “what would happen if we intervened.” This matters in computing because many real decisions are interventions: shipping a feature, changing rankings, adding a guardrail, altering a training set, or tweaking a policy.

Not magic: assumptions you can state and debate

Pearl made causality more practical by framing it as a modeling choice plus explicit assumptions. You don’t “discover” causality automatically from data in general; you propose a causal story (often based on domain knowledge) and then use data to test, estimate, and refine it.

The key artifacts Pearl popularized

Causal graphs (DAGs): Simple diagrams that encode assumed cause-and-effect relationships.
Interventions (“do”): Reasoning about what changes when you actively set a variable, not just observe it.
Counterfactuals: “What would have happened for this specific case if we had done something else?”

These tools gave teams a shared language to move from pattern-spotting to answering causal questions with clarity and discipline.

Correlation vs Causation: The Question You’re Really Asking

Correlation means two things move together: when one goes up, the other tends to go up (or down). It’s extremely useful—especially in data-heavy teams—because it helps with prediction and detection.

If ice cream sales spike when temperature rises, a correlated signal (temperature) can improve forecasting. In product and AI work, correlations power ranking models (“show more of what similar users clicked”), anomaly spotting (“this metric usually tracks that one”), and quick diagnostics (“errors rise when latency rises”).

The trouble starts when we treat correlation as an answer to a different question: what happens if we change something on purpose? That’s causation.

Why correlation fails for “what if we change X?”

A correlated relationship may be driven by a third factor that affects both variables. Changing X doesn’t necessarily change Y—because X might not be the reason Y moved in the first place.

A simple confounding example: marketing spend vs sales

Imagine you plot weekly marketing spend against weekly sales and see a strong positive correlation. It’s tempting to conclude “more spend causes more sales.”

But suppose both rise during holidays. The season (a confounder) drives higher demand and also triggers bigger budgets. If you increase spend in a non-holiday week, sales might not rise much—because the underlying demand isn’t there.

Signs you’re really asking a causal question

You’re in causal territory when you hear yourself asking:

“If we increase/decrease X, what will happen to Y?”
“Should we launch this feature or keep the old one?”
“Which change will reduce churn, not just predict it?”
“Did this campaign work, or would sales have risen anyway?”
“What’s the impact of removing a step, adding a warning, or changing pricing?”

When the verb is change, launch, remove, or reduce, correlation is a starting clue—not the decision rule.

Causal Diagrams (DAGs) as a Shared Team Language

A causal diagram—often drawn as a DAG (Directed Acyclic Graph)—is a simple way to make a team’s assumptions visible. Instead of arguing in vague terms (“it’s probably the model” or “maybe the UI”), you put the story on paper.

Nodes and arrows: the basic grammar

Nodes are variables you care about: marketing email sent, user intent, model score, purchase.
Directed arrows represent a causal influence: if changing A would change B, draw A → B.

The goal isn’t perfect truth; it’s a shared draft of “how we think the system works” that everyone can critique.

Confounders, mediators, and colliders (with one small example)

Suppose you’re evaluating whether a new onboarding tutorial (T) increases activation (A).

Confounder: user motivation (M) affects both whether they complete the tutorial and whether they activate: M → T and M → A. If you ignore M, you may credit the tutorial for what motivation caused.
Mediator: the tutorial might improve product understanding (U), which then increases activation: T → U → A. U is part of the mechanism.
Collider: imagine you analyze only users who contact support (S), where both confusion and motivation increase support tickets: U → S ← M. Conditioning on S can create a misleading connection between U and M, distorting the estimated effect of T on A.

Why “adjusting for everything” can backfire

A common analytics reflex is to “control for all available variables.” In DAG terms, that can mean accidentally adjusting for:

Mediators (which can hide part of the effect you’re trying to measure), or
Colliders (which can introduce bias out of nowhere).

With a DAG, you adjust for variables for a reason—typically to block confounding paths—rather than because they exist.

How to sketch a first graph in a meeting

Start with a whiteboard and three steps:

Write the outcome on the right (e.g., activation), and the proposed cause on the left (e.g., tutorial).
Ask: “What makes both more likely?” (confounders) and “What sits in the middle?” (mediators).
Mark what you’re conditioning on in analysis (filters, cohorts, eligibility rules). Those often hide colliders.

Even a rough DAG aligns product, data, and engineering around the same causal question before you run numbers.

Interventions: Thinking in “Do”, Not “See”

A big shift in Judea Pearl’s causal thinking is separating observing something from changing it.

If you observe that users who enable notifications retain better, you’ve learned a pattern. But you still don’t know whether notifications cause retention, or whether engaged users are simply more likely to turn notifications on.

An intervention is different: it means you actively set a variable to a value and ask what happens next. In product terms, that’s not “users chose X,” it’s “we shipped X.”

“Do” vs “See” (without the math)

Pearl often labels this difference as:

See: “We noticed notifications are ON.”
Do: “We turned notifications ON (or made them default) and now we measure the effect.”

The “do” idea is basically a mental note that you’re breaking the usual reasons a variable takes a value. When you intervene, notifications aren’t ON because engaged users opted in; they’re ON because you forced the setting (or nudged it). That’s the point: interventions help isolate cause-and-effect.

Interventions are how product decisions actually happen

Most real product work is intervention-shaped:

Feature launches and UI changes
Ranking or recommendation policy tweaks
Pricing and packaging updates
Fraud rules, moderation thresholds, or credit policies

These actions aim to change outcomes, not merely describe them. Causal thinking keeps the question honest: “If we do this, what will it change?”

The catch: interventions still require assumptions

You can’t interpret an intervention (or even design a good experiment) without assumptions about what affects what—your causal diagram, even if it’s informal.

For example, if seasonality influences both marketing spend and sign-ups, then “doing” a spend change without accounting for seasonality can still mislead you. Interventions are powerful, but they only answer causal questions when the underlying causal story is at least approximately right.

Counterfactuals: Answering “What If?” for One Case

Make causality operational

Turn “do vs see” into action by shipping a controlled change this sprint.

Start Building

A counterfactual is a specific kind of “what if?” question: for this exact case, what would have happened if we had taken a different action (or if one input had been different)? It’s not “What happens on average?”—it’s “Would this outcome have changed for this person, this ticket, this transaction?”

Why teams care: recourse, fairness, and support tickets

Counterfactuals show up whenever someone asks for a path to a different outcome:

User recourse: “What would I need to change to get approved?”
Fairness investigations: “If this applicant had identical qualifications but a different sensitive attribute, would the decision change?”
Support and debugging: “This user says the system ‘made no sense’—what input change would have flipped the prediction?”

These questions are naturally user-level. They’re also concrete enough to guide product changes, policies, and explanations.

A concrete AI example

Imagine a loan model that rejects an application. A correlation-based explanation might say, “Low savings correlates with rejection.” A counterfactual asks:

If the applicant’s savings were $3,000 higher (everything else the same), would the model approve them?

If the answer is “yes,” you’ve learned something actionable: a plausible change that flips the decision. If the answer is “no,” you’ve avoided giving misleading advice like “increase savings” when the real blocker is debt-to-income or unstable employment history.

The key limit: counterfactuals aren’t “in the data”

Counterfactuals depend on a causal model—a story about how variables influence each other—not just a dataset. You must decide what can realistically change, what would change as a consequence, and what must stay fixed. Without that causal structure, counterfactuals can become impossible scenarios (“increase savings without changing income or spending”) and produce unhelpful or unfair recommendations.

Causal Thinking for AI Reliability and Debugging

When an ML model fails in production, the root cause is rarely “the algorithm got worse.” More often, something in the system changed: what data you collect, how labels are produced, or what users do. Causal thinking helps you stop guessing and start isolating which change caused the degradation.

Common failure modes (and why they fool metrics)

A few repeat offenders show up across teams:

Spurious shortcuts: the model learns an easy proxy (watermarks, background color, phrasing quirks) that correlates with the label in training but isn’t the real signal.
Dataset shift: the data-generating process changes (new user segments, new UI, seasonality), so the training relationship no longer holds.
Leakage: features accidentally include information that sits downstream of the label (or of the labeling process), inflating offline performance.

These can look “fine” in aggregate dashboards because correlation can stay high even when the reason the model is right has changed.

How a causal graph exposes the shortcut

A simple causal diagram (DAG) turns debugging into a map. It forces you to ask: is this feature a cause of the label, a consequence of it, or a consequence of how we measure it?

For example, if Labeling policy → Feature engineering → Model inputs, you may have built a pipeline where the model predicts the policy rather than the underlying phenomenon. A DAG makes that pathway visible so you can block it (remove the feature, change instrumentation, or redefine the label).

Interventions for debugging (think “change X and see Y”)

Instead of only inspecting predictions, try controlled interventions:

Targeted data edits: swap backgrounds, remove watermarks, perturb timestamps—then re-run inference.
Ablations: drop suspected proxy features and measure the causal impact on errors.
Counterfactual slices: keep everything fixed except one factor (device type, locale) to test sensitivity.

Checklist: causal questions when performance degrades

What upstream change could have caused this (product, logging, user behavior, label policy)?
Which features might be downstream of the label or of the labeling process (leakage risk)?
What confounder could explain both the feature and the outcome (e.g., region affects both language and conversion)?
What intervention can we run safely to isolate the suspect factor?
If we remove the shortcut, do we still have a causal path from real signal → prediction?

From Explanations to Causes: What AI “Explainability” Misses

Experiment with rollback ready

Deploy, observe, and roll back quickly when side effects show up.

Ship Safely

Many “explainability” tools answer a narrow question: Why did the model output this score? They often do this by highlighting influential inputs (feature importance, saliency maps, SHAP values). That can be useful—but it’s not the same as explaining the system the model sits inside.

Explaining a prediction vs. explaining a system

A prediction explanation is local and descriptive: “This loan was declined mainly because income was low and utilization was high.”

A system explanation is causal and operational: “If we increased verified income (or reduced utilization) in a way that reflects a real intervention, would the decision change—and would downstream outcomes improve?”

The first helps you interpret model behavior. The second helps you decide what to do.

Why causal models change what “explanations” mean

Causal thinking ties explanations to interventions. Instead of asking which variables correlate with the score, you ask which variables are valid levers and what effects they produce when changed.

A causal model forces you to be explicit about:

What can be intervened on (pricing, messaging, thresholds, UI)
What is merely observed (user intent, economic conditions)
What’s confounded (a hidden factor driving both the input and the outcome)

This matters because an “important feature” might be a proxy—useful for prediction, dangerous for action.

The risk of post‑hoc explanations that track correlation

Post‑hoc explanations can look persuasive while staying purely correlational. If “number of support tickets” strongly predicts churn, a feature-importance plot may tempt a team to “reduce tickets” by making support harder to reach. That intervention could increase churn, because tickets were a symptom of underlying product issues—not a cause.

Correlation-based explanations are also brittle during distribution shifts: once user behavior changes, the same highlighted features may no longer mean the same thing.

Where causal explanations earn their keep

Causal explanations are especially valuable when decisions have consequences and accountability:

Audits: justify decisions in terms of plausible interventions and fairness-sensitive pathways.
Incident reviews: separate root causes from correlated signals when something breaks.
QA and monitoring: test “what-if” changes (thresholds, policies, UX) before shipping and after drift.

When you need to act, not just interpret, explanation needs a causal backbone.

Experiments, A/B Tests, and When You Can’t Randomize

A/B testing is causal inference in its simplest, most practical form. When you randomly assign users to variant A or B, you’re performing an intervention: you’re not just observing what people chose, you’re setting what they see. In Pearl’s terms, randomization makes “do(variant = B)” real—so differences in outcomes can credibly be attributed to the change, not to who happened to pick it.

Why randomization is so powerful

Random assignment breaks many hidden links between user traits and exposure. Power users, new users, time of day, device type—these factors still exist, but they’re (on average) balanced across groups. That balance is what turns a metric gap into a causal claim.

When experiments are hard (or inappropriate)

Even great teams can’t always run clean randomized tests:

Small samples: low traffic makes results noisy and slow.
Long-term effects: retention, trust, and churn may take months to show up.
Interference: one user’s treatment affects another (social sharing, marketplace dynamics).
Ethics and safety: you can’t randomly “test” harmful experiences or unfair policies.
Operational constraints: platform limitations, legal rules, or partner dependencies.

In these cases, you can still think causally—you just need to be explicit about assumptions and uncertainty.

Quasi-experimental alternatives (high level)

Common options include difference-in-differences (compare changes over time between groups), regression discontinuity (use a cutoff rule like “only users above score X”), instrumental variables (a natural nudge that changes exposure without directly changing the outcome), and matching/weighting to make groups more comparable. Each method trades randomization for assumptions; a causal diagram can help you state those assumptions clearly.

Pre-register what “success” means

Before shipping a test (or an observational study), write down: the primary metric, guardrails, target population, duration, and decision rule. Pre-registration won’t eliminate bias, but it reduces metric shopping and makes causal claims easier to trust—and easier to debate as a team.

Better Product Decisions with Causal Questions

Most product debates sound like: “Metric X moved after we shipped Y—so Y worked.” Causal thinking tightens that into a clearer question: “Did change Y cause metric X to move, and by how much?” That shift turns dashboards from proof into starting points.

Three common decisions, rewritten as causal questions

Pricing change: instead of “Did revenue go up after the price increase?”, ask:

“What is the effect of raising price by 10% on paid conversion, churn, and support tickets, holding seasonality constant?”

Onboarding tweak: instead of “New users complete onboarding more often now,” ask:

“If we shorten onboarding from 6 to 4 steps, what happens to activation and week-4 retention for new users?”

Recommendation ranking change: instead of “CTR improved,” ask:

“If we reorder results to promote freshness, what is the effect on long-term satisfaction (returns, hides, unsubscribes), not just clicks?”

How confounding sneaks into dashboards

Dashboards often mix “who got the change” with “who would have done well anyway.” A classic example: you ship a new onboarding flow, but it’s first shown to users on the newest app version. If newer versions are adopted by more engaged users, your chart may show a lift that’s partly (or mostly) version adoption, not onboarding.

Other frequent confounders in product analytics:

Seasonality and campaigns (a promo drives both signups and conversion)
User mix shifts (more enterprise leads this month)
Support load (outages increase tickets and reduce retention)

Add causal questions to PRDs (so teams stay aligned)

A useful PRD section is literally titled “Causal Questions,” and includes:

Primary: “What change are we making, and what outcome should it cause?”
Guardrails: “What should not worsen if this works?”
Confounders: “What else could move the metric at the same time?”
Measurement plan: “Experiment, holdout, phased rollout, or matched comparison?”

If you’re using a rapid build loop (especially with LLM-assisted development), this section becomes even more important: it prevents “we can ship it fast” from turning into “we shipped it without knowing what it caused.” Teams building in Koder.ai often bake these causal questions into planning mode up front, then implement feature-flagged variants quickly, with snapshots/rollback to keep experimentation safe when results (or side effects) surprise you.

Align PM, data, engineering, and support

PMs define the decision and success criteria. Data partners translate it into measurable causal estimates and sanity checks. Engineering ensures the change is controllable (feature flags, clean exposure logging). Support shares qualitative signals—pricing changes often “work” while silently increasing cancellations or ticket volume. When everyone agrees on the causal question, shipping becomes learning—not just shipping.

A Practical Workflow: Add Causality to Your Team’s Toolkit

Instrument the right variables

Create a Go plus PostgreSQL backend that matches your causal model and data needs.

Build Backend

Causal thinking doesn’t need a PhD-level rollout. Treat it like a team habit: write down your causal story, pressure-test it, then let data (and experiments when possible) confirm or correct it.

What you need (before arguing about results)

To make progress, collect four inputs up front:

A graph: a quick causal diagram (DAG) of the key variables.
Assumptions: what you believe is driving what, and what you’re choosing to ignore.
Data sources: where each variable comes from (logs, CRM, surveys), plus known gaps.
Validation plan: how you’ll check assumptions (A/B test, natural experiment, sensitivity checks, or expert review).

A lightweight process: sketch → critique → test → iterate

Sketch the simplest diagram that answers one question (e.g., “Will onboarding emails increase week-4 retention?”).
Critique it with the team: analytics, PM, engineering, and someone close to the user.
Test assumptions: look for confounding, selection effects, and “missing arrows.” If possible, design a small experiment.
Iterate: update the diagram and the measurement plan as you learn.

In practice, speed matters here: the faster you can turn a causal question into a controlled change, the less time you spend arguing about ambiguous patterns. That’s one reason teams adopt platforms like Koder.ai to go from “hypothesis + plan” to a working, instrumented implementation (web, backend, or mobile) in days instead of weeks—while still keeping rigor through staged rollouts, deployments, and rollback.

A causal diagram review template (copy/paste)

Decision / intervention: What action might we take?
Outcome: What are we trying to change?
Main causal path: How does the intervention reach the outcome?
Confounders: What affects both intervention and outcome?
Mediators: What sits in the middle (don’t control for these by accident)?
Colliders / selection filters: Where could conditioning create fake relationships?
Measurement notes: How are variables observed; what’s missing or noisy?
Proposed check: Experiment? Quasi-experiment? Sensitivity analysis?

If you want a refresher on experiments, see /blog/ab-testing-basics. For common traps in product metrics that mimic “effects,” see /blog/metrics-that-mislead.

Key Takeaways and Next Steps

Causal thinking is a shift from “what tends to move together?” to “what would change if we acted?” That shift—popularized in computing and statistics by Judea Pearl—helps teams avoid confident-sounding stories that don’t survive real-world interventions.

Main takeaways (4–6 lines)

Correlation is a clue, not an answer.

Causal diagrams (DAGs) make assumptions visible and discussable.

Interventions (“do”) are different from observations (“see”).

Counterfactuals help explain single cases: “what if this one thing were different?”

Good causal work documents uncertainty and alternative explanations.

Start this week: a small, practical checklist

One meeting (45 minutes): Pick one high-stakes question (e.g., “Will this feature reduce churn?”) and rewrite it as an intervention: “If we do X, what changes in Y?”
One diagram (15–30 minutes): Sketch a simple DAG on a whiteboard: the intervention, the outcome, and 3–6 likely causes that affect both. Mark what you can measure vs. what’s missing.
One test (this sprint): Choose the strongest feasible check—an A/B test if you can randomize, or a careful quasi-experimental comparison if you can’t. Decide upfront what result would change your decision.

Don’t mistake neat diagrams for truth

Causality requires care: hidden confounders, measurement errors, and selection effects can flip conclusions. The antidote is transparency—write down assumptions, show what data you used, and note what would falsify your claim.

If you want to go deeper, browse related articles on /blog and compare causal approaches with other analytics and “explainability” methods to see where each one helps—and where it can mislead.

FAQ

What’s the practical difference between correlation and causation in product and AI work?

Correlation helps you predict or detect (e.g., “when X rises, Y often rises too”). Causation answers a decision question: “If we change X on purpose, will Y change?”

Use correlation for forecasting and monitoring; use causal thinking when you’re about to ship a change, set a policy, or allocate budget.

Why did “more notifications = higher retention” fail when the team sent more notifications?

Because the correlation may be driven by confounding. In the notifications example, highly engaged users both trigger/receive more notifications and return more.

If you increase notifications for everyone, you’ve changed the experience (an intervention) without changing the underlying engagement—so retention may not improve and can even worsen.

What is a causal diagram (DAG), and why should a team bother drawing one?

A DAG (Directed Acyclic Graph) is a simple diagram where:

nodes are variables you care about
arrows mean “A causes B” (if changing A would change B)

It’s useful because it makes assumptions explicit, helping teams agree on what to adjust for, what not to adjust for, and what experiment would actually answer the question.

What are confounders, mediators, and colliders—and why do they matter?

Confounder: affects both the proposed cause and the outcome (creates a misleading association).
Mediator: sits on the pathway from cause → outcome (part of the mechanism).
Collider: is caused by two variables; conditioning on it can create a fake relationship.

A common mistake is “control for everything,” which can accidentally adjust for mediators or colliders and bias the result.

What does “do vs see” mean without the math?

“See” is observing what naturally happened (users opted in, a score was high). “Do” is actively setting a variable (shipping a feature, forcing a default).

The key idea: an intervention breaks the usual reasons a variable takes a value, which is why it can reveal cause-and-effect more reliably than observation alone.

What’s a counterfactual, and when is it useful?

A counterfactual asks: for this specific case, what would have happened if we had done something else.

It’s useful for:

recourse (“what would need to change to get approved?”)
fairness checks (“would the decision change if only a sensitive attribute differed?”)
debugging odd decisions (“what minimal change flips the prediction?”)

It requires a causal model so you don’t propose impossible changes.

How does causal thinking help when an ML model’s performance drops in production?

Focus on what changed upstream and what the model might be exploiting:

dataset shift (user mix, UI, seasonality)
spurious shortcuts (proxies like watermarks or phrasing artifacts)
leakage (features downstream of the label/labeling process)

A causal mindset pushes you to test targeted interventions (ablations, perturbations) instead of chasing coincident metric movements.

Why can model “explainability” be misleading without causality?

Not necessarily. Feature importance explains what influenced the prediction, not what you should change.

A highly “important” feature can be a proxy or symptom (e.g., support tickets predict churn). Intervening on the proxy (“reduce tickets by hiding support”) can backfire. Causal explanations tie importance to valid levers and expected outcomes under intervention.

When should we run an A/B test, and what if we can’t randomize?

Randomized A/B tests are best when feasible, but you may need alternatives when:

traffic is small
effects take a long time
there’s interference (users affect each other)
ethics/safety prevent randomization

In those cases, consider quasi-experiments like difference-in-differences, regression discontinuity, instrumental variables, or matching/weighting—while being explicit about assumptions.

How do we incorporate causal thinking into PRDs and decision docs?

Add a short section that forces clarity before analysis:

Intervention: what exactly are we changing?
Outcome + guardrails: what should improve, and what must not worsen?
Confounders: what else could move the metrics at the same time?
Measurement plan: experiment, phased rollout, holdout, or matched comparison

This keeps the team aligned on a causal question rather than post-hoc dashboard storytelling.