Learn Brendan Gregg’s practical methods (USE, RED, flame graphs) to investigate latency and production bottlenecks with data, not guesswork.

Brendan Gregg is one of the most influential voices in systems performance, especially in the Linux world. He’s written widely used books, built practical tooling, and—most importantly—shared clear methods for investigating real production problems. Teams adopt his approach because it works under pressure: when latency spikes and everyone wants answers, you need a way to move from “maybe it’s X” to “it’s definitely Y” with minimal drama.
A performance methodology isn’t a single tool or a clever command. It’s a repeatable way to investigate: a checklist for what to look at first, how to interpret what you see, and how to decide what to do next.
That repeatability is what reduces guesswork. Instead of relying on whoever has the most intuition (or the loudest opinion), you follow a consistent process that:
Many latency investigations go wrong in the first five minutes. People jump straight to fixes: “add CPU,” “restart the service,” “increase the cache,” “tune the GC,” “it must be the network.” Sometimes those actions help—often they hide the signal, waste time, or introduce new risk.
Gregg’s methods push you to delay “solutions” until you can answer simpler questions: What’s saturated? What’s erroring? What got slower—throughput, queueing, or individual operations?
This guide helps you narrow the scope, measure the right signals, and confirm the bottleneck before you optimize. The goal is a structured workflow for investigating latency and profiling issues in production so results don’t depend on luck.
Latency is a symptom: users wait longer for work to finish. The cause is usually elsewhere—CPU contention, disk or network waits, lock contention, garbage collection, queueing, or remote dependency delays. Measuring latency alone tells you that pain exists, not where it originates.
These three signals are coupled:
Before tuning, capture all three for the same time window. Otherwise you may “fix” latency by dropping work or failing faster.
Average latency hides spikes users remember. A service with a 50 ms average can still have frequent 2 s stalls.
Track percentiles:
Also watch the shape of latency: a stable p50 with a rising p99 often indicates intermittent stalls (e.g., lock contention, I/O hiccups, stop-the-world pauses) rather than a general slowdown.
A latency budget is a simple accounting model: “If the request must finish in 300 ms, how can time be spent?” Break it into buckets such as:
This budget frames the first measurement task: identify which bucket grew during the spike, then investigate that area instead of tuning blindly.
Latency work goes sideways when the “problem” is described as the system is slow. Gregg’s methods start earlier: force the issue into a specific, testable question.
Write down two sentences before you touch any tools:
This prevents you from optimizing the wrong layer—like host CPU—when the pain is isolated to one endpoint or one downstream dependency.
Pick a window that matches the complaint and includes a “good” comparison period if possible.
Scope your investigation explicitly:
Being precise here makes later steps (USE, RED, profiling) faster because you’ll know what data should change if your hypothesis is right.
Note deploys, config changes, traffic shifts, and infra events—but don’t assume causality. Write them as “If X, then we’d expect Y,” so you can confirm or reject quickly.
A small log prevents duplicated work across teammates and makes handoffs smoother.
Time | Question | Scope | Data checked | Result | Next step
Even five lines like this can turn a stressful incident into a repeatable process.
The USE Method (Utilization, Saturation, Errors) is Gregg’s quick checklist for scanning the “big four” resources—CPU, memory, disk (storage), and network—so you can stop guessing and start narrowing the problem.
Instead of staring at dozens of dashboards, ask the same three questions for each resource:
Applied consistently, this becomes a fast inventory of where “pressure” exists.
For CPU, utilization is CPU busy %, saturation shows up as run-queue pressure or threads waiting to run, and errors can include throttling (in containers) or misbehaving interrupts.
For memory, utilization is used memory, saturation often appears as paging or frequent garbage collection, and errors include allocation failures or OOM events.
For disk, utilization is device busy time, saturation is queue depth and read/write wait time, and errors are I/O errors or timeouts.
For network, utilization is throughput, saturation is drops/queues/latency, and errors are retransmits, resets, or packet loss.
When users report slowness, saturation signals are often the most revealing: queues, wait time, and contention tend to correlate more directly with latency than raw utilization.
Service-level metrics (like request latency and error rate) tell you impact. USE tells you where to look next by identifying which resource is under strain.
A practical loop is:
The RED Method keeps you anchored to user experience before you dive into host graphs.
RED prevents you from chasing “interesting” system metrics that don’t affect users. It forces a tighter loop: which endpoint is slow, for which users, and since when? If Duration spikes only on a single route while overall CPU is flat, you already have a sharper starting point.
A useful habit: keep RED broken down by service and top endpoints (or key RPC methods). That makes it easy to distinguish a broad degradation from a localized regression.
RED tells you where the pain is. USE helps you test which resource is responsible.
Examples:
Keep the layout focused:
If you want a consistent incident workflow, pair this section with the USE inventory in /blog/use-method-overview so you can move from “users are feeling it” to “this resource is the constraint” with less thrash.
A performance investigation can explode into dozens of charts and hypotheses in minutes. Gregg’s mindset is to keep it narrow: your job is not to “collect more data,” but to ask the next question that most quickly eliminates uncertainty.
Most latency problems are dominated by a single cost (or a small pair): one hot lock, one slow dependency, one overloaded disk, one GC pause pattern. Prioritization means hunting for that dominant cost first, because shaving 5% off five different places rarely moves user-visible latency.
A practical test: “What could explain most of the latency change we’re seeing?” If a hypothesis can only account for a tiny slice, it’s a lower-priority question.
Use top-down when you’re answering “Are users impacted?” Start from endpoints (RED-style signals): latency, throughput, errors. This helps you avoid optimizing something that isn’t on the critical path.
Use bottom-up when the host is clearly sick (USE-style symptoms): CPU saturation, runaway memory pressure, I/O wait. If a node is pegged, you’ll waste time staring at endpoint percentiles without understanding the constraint.
When an alert hits, pick a branch and stay on it until you confirm or falsify it:
Limit yourself to a small starting set of signals, then drill down only when something moves. If you need a checklist to keep focus, link your steps to a runbook like /blog/performance-incident-workflow so every new metric has a purpose: answering a specific question.
Production profiling can feel risky because it touches the live system—but it’s often the fastest way to replace debate with evidence. Logs and dashboards can tell you that something is slow. Profiling tells you where time goes: which functions run hot, which threads wait, and what code paths dominate during the incident.
Profiling is a “time budget” tool. Instead of debating theories (“it’s the database” vs “it’s GC”), you get evidence like “45% of CPU samples were in JSON parsing” or “most requests are blocked on a mutex.” That narrows the next step to one or two concrete fixes.
Each answers a different question. High latency with low CPU often points to off-CPU or lock time rather than CPU hot spots.
Many teams start on-demand, then graduate to always-on once they trust the safety and see recurring issues.
Production-safe profiling is about controlling cost. Prefer sampling (not tracing every event), keep capture windows short (for example, 10–30 seconds), and measure overhead in a canary first. If you’re unsure, start with low-frequency sampling and increase only if the signal is too noisy.
Flame graphs visualize where sampled time went during a profiling window. Each “box” is a function (or stack frame), and each stack shows how execution reached that function. They’re excellent for spotting patterns quickly—but they don’t automatically tell you “the bug is here.”
A flame graph usually represents on-CPU samples: time the program was actually running on a CPU core. It can highlight CPU-heavy code paths, inefficient parsing, excessive serialization, or hotspots that truly burn CPU.
It does not directly show waiting on disk, network, scheduler delays, or time blocked on a mutex (that’s off-CPU time and needs different profiling). It also doesn’t prove causality for user-visible latency unless you tie it to a scoped symptom.
The widest box is tempting to blame, but ask: is it a hotspot you can change, or just “time spent in malloc, GC, or logging” because the real issue is upstream? Also watch for missing context (JIT, inlining, symbols) that can make a box look like the culprit when it’s only the messenger.
Treat a flame graph as an answer to a scoped question: which endpoint, which time window, which hosts, and what changed. Compare “before vs after” (or “healthy vs degraded”) flame graphs for the same request path to avoid profiling noise.
When latency spikes, many teams stare at CPU% first. That’s understandable—but it often points in the wrong direction. A service can be “only 20% CPU” and still be painfully slow if its threads spend most of their life not running.
CPU% answers “how busy is the processor?” It does not answer “where did my request time go?” Requests can stall while threads are waiting, blocked, or parked by the scheduler.
A key idea: a request’s wall-clock time includes both on-CPU work and off-CPU waiting.
Off-CPU time typically hides behind dependencies and contention:
A few signals often correlate with off-CPU bottlenecks:
These symptoms tell you “we’re waiting,” but not what you’re waiting on.
Off-CPU profiling attributes time to the reason you weren’t running: blocked in syscalls, waiting on locks, sleeping, or descheduled. That’s powerful for latency work because it turns vague slowdowns into actionable categories: “blocked on mutex X,” “waiting on read() from disk,” or “stuck in connect() to an upstream.” Once you can name the wait, you can measure it, confirm it, and fix it.
Performance work often fails at the same moment: someone spots a suspicious metric, declares it “the problem,” and starts tuning. Gregg’s methods push you to slow down and prove what’s limiting the system before you change anything.
A bottleneck is the resource or component that currently caps throughput or drives latency. If you relieve it, users see improvement.
A hot spot is where time is spent (for example, a function that appears frequently in a profile). Hot spots can be real bottlenecks—or simply busy work that doesn’t affect the slow path.
Noise is everything that looks meaningful but isn’t: background jobs, one-off spikes, sampling artifacts, caching effects, or “top talkers” that don’t correlate with the user-visible problem.
Start by capturing a clean before snapshot: the user-facing symptom (latency or error rate) and the leading candidate signals (CPU saturation, queue depth, disk I/O, lock contention, etc.). Then apply a controlled change that should affect only your suspected cause.
Examples of causal tests:
Correlation is a hint, not a verdict. If “CPU goes up when latency goes up,” verify by changing CPU availability or reducing CPU work and observing whether latency follows.
Write down: what was measured, the exact change made, the before/after results, and the observed improvement. This turns a one-time win into a reusable playbook for the next incident—and prevents “intuition” from rewriting history later.
Performance incidents feel urgent, which is exactly when guesswork slips in. A lightweight, repeatable workflow helps you move from “something’s slow” to “we know what changed” without thrashing.
Detect: alert on user-visible latency and error rate, not just CPU. Page when p95/p99 latency crosses a threshold for a sustained window.
Triage: immediately answer three questions: what’s slow, when did it start, and who is affected? If you can’t name the scope (service, endpoint, region, cohort), you’re not ready to optimize.
Measure: collect evidence that narrows the bottleneck. Prefer time-bounded captures (e.g., 60–180 seconds) so you can compare “bad” vs “good.”
Fix: change one thing at a time, then re-measure the same signals to confirm improvement and rule out placebo.
Keep a shared dashboard that everyone uses during incidents. Make it boring and consistent:
The goal isn’t to graph everything; it’s to shorten time-to-first-fact.
Instrument the endpoints that matter most (checkout, login, search), not every endpoint. For each, agree on: expected p95, max error rate, and key dependency (DB, cache, third-party).
Before the next outage, agree on a capture kit:
Document it in a short runbook (e.g., /runbooks/latency), including who can run captures and where artifacts are stored.
Gregg’s methodology is fundamentally about controlled change and fast verification. If your team builds services using Koder.ai (a chat-driven platform for generating and iterating on web, backend, and mobile apps), two features map cleanly to that mindset:
Even if you’re not generating new code during an incident, those habits—small diffs, measurable outcomes, and quick reversibility—are the same habits Gregg promotes.
It’s 10:15am and your dashboard shows p99 latency for the API climbing from ~120ms to ~900ms during peak traffic. Error rate is flat, but customers report “slow” requests.
Begin service-first: Rate, Errors, Duration.
You slice Duration by endpoint and see one route dominating p99: POST /checkout. Rate is up 2×, errors are normal, but Duration spikes specifically when concurrency rises. That points to queueing or contention, not an outright failure.
Next, check whether the latency is compute time or waiting time: compare application “handler time” vs total request time (or upstream vs downstream spans if you have tracing). The handler time is low, total time is high—requests are waiting.
Inventory likely bottlenecks: Utilization, Saturation, Errors for CPU, memory, disk, and network.
CPU utilization is only ~35%, but CPU run queue and context switches climb. Disk and network look steady. That mismatch (low CPU%, high waiting) is a classic hint: threads aren’t burning CPU—they’re blocked.
You capture an off-CPU profile during the spike and find heavy time in a mutex around a shared “promotion validation” cache.
You replace the global lock with a per-key lock (or a lock-free read path), deploy, and watch p99 return to baseline while Rate stays high.
Post-incident checklist: