Apr 23, 2025·8 min

Brendan Gregg’s Performance Methods for Latency and Profiling

Learn Brendan Gregg’s practical methods (USE, RED, flame graphs) to investigate latency and production bottlenecks with data, not guesswork.

Why Brendan Gregg’s approach reduces guesswork

Brendan Gregg is one of the most influential voices in systems performance, especially in the Linux world. He’s written widely used books, built practical tooling, and—most importantly—shared clear methods for investigating real production problems. Teams adopt his approach because it works under pressure: when latency spikes and everyone wants answers, you need a way to move from “maybe it’s X” to “it’s definitely Y” with minimal drama.

What “performance methodology” really means

A performance methodology isn’t a single tool or a clever command. It’s a repeatable way to investigate: a checklist for what to look at first, how to interpret what you see, and how to decide what to do next.

That repeatability is what reduces guesswork. Instead of relying on whoever has the most intuition (or the loudest opinion), you follow a consistent process that:

narrows the problem to a specific resource, service, or code path
measures what’s happening in the same time window as the incident
confirms the bottleneck with evidence before changes are made

The common failure mode: fixing before measuring

Many latency investigations go wrong in the first five minutes. People jump straight to fixes: “add CPU,” “restart the service,” “increase the cache,” “tune the GC,” “it must be the network.” Sometimes those actions help—often they hide the signal, waste time, or introduce new risk.

Gregg’s methods push you to delay “solutions” until you can answer simpler questions: What’s saturated? What’s erroring? What got slower—throughput, queueing, or individual operations?

What this guide helps you do

This guide helps you narrow the scope, measure the right signals, and confirm the bottleneck before you optimize. The goal is a structured workflow for investigating latency and profiling issues in production so results don’t depend on luck.

Latency basics: what to measure before you tune

Latency is a symptom: users wait longer for work to finish. The cause is usually elsewhere—CPU contention, disk or network waits, lock contention, garbage collection, queueing, or remote dependency delays. Measuring latency alone tells you that pain exists, not where it originates.

Throughput, latency, and errors move together

These three signals are coupled:

Throughput (requests/second) rising can increase queueing, which increases latency.
Errors can reduce observed latency (fast failures) or increase it (retries and timeouts).
Limiting throughput (rate limits, backpressure) can improve tail latency while making fewer requests succeed.

Before tuning, capture all three for the same time window. Otherwise you may “fix” latency by dropping work or failing faster.

Don’t trust averages: percentiles and tail latency

Average latency hides spikes users remember. A service with a 50 ms average can still have frequent 2 s stalls.

Track percentiles:

p50: typical user experience
p95/p99: tail latency (where most incident pain lives)

Also watch the shape of latency: a stable p50 with a rising p99 often indicates intermittent stalls (e.g., lock contention, I/O hiccups, stop-the-world pauses) rather than a general slowdown.

Latency budgets: where time is allowed to go

A latency budget is a simple accounting model: “If the request must finish in 300 ms, how can time be spent?” Break it into buckets such as:

time in your service (compute + waiting)
time in downstream services
time in databases/caches
network transit and TLS
time spent queued (threads, connection pools, load balancers)

This budget frames the first measurement task: identify which bucket grew during the spike, then investigate that area instead of tuning blindly.

Start with a clear question and scope

Latency work goes sideways when the “problem” is described as the system is slow. Gregg’s methods start earlier: force the issue into a specific, testable question.

Define what “slow” means (and for whom)

Write down two sentences before you touch any tools:

What is slow? (page load, API endpoint, batch job, login, checkout, a specific SQL query)
Where is slow observed? (customer browser, mobile app, one region, one pod, one host, an internal service)

This prevents you from optimizing the wrong layer—like host CPU—when the pain is isolated to one endpoint or one downstream dependency.

Choose a time window and a scope

Pick a window that matches the complaint and includes a “good” comparison period if possible.

Scope your investigation explicitly:

Host vs. service vs. endpoint: “One Kubernetes node” is different from “one API route.”
Which slice of traffic: region, customer tier, erroring requests only, or all requests.
What signal is driving the report: p95 latency, timeouts, queue depth, or user timing.

Being precise here makes later steps (USE, RED, profiling) faster because you’ll know what data should change if your hypothesis is right.

Treat recent changes as hypotheses, not answers

Note deploys, config changes, traffic shifts, and infra events—but don’t assume causality. Write them as “If X, then we’d expect Y,” so you can confirm or reject quickly.

Keep a lightweight investigation log

A small log prevents duplicated work across teammates and makes handoffs smoother.

Time | Question | Scope | Data checked | Result | Next step

Even five lines like this can turn a stressful incident into a repeatable process.

The USE Method: a fast inventory of resource bottlenecks

The USE Method (Utilization, Saturation, Errors) is Gregg’s quick checklist for scanning the “big four” resources—CPU, memory, disk (storage), and network—so you can stop guessing and start narrowing the problem.

What it is: a per-resource checklist

Instead of staring at dozens of dashboards, ask the same three questions for each resource:

Utilization: How busy is it right now?
Saturation: Is work piling up (queues, wait time), even if utilization isn’t maxed?
Errors: Is it failing or retrying in ways that create delay?

Applied consistently, this becomes a fast inventory of where “pressure” exists.

How to apply it in practice

For CPU, utilization is CPU busy %, saturation shows up as run-queue pressure or threads waiting to run, and errors can include throttling (in containers) or misbehaving interrupts.

For memory, utilization is used memory, saturation often appears as paging or frequent garbage collection, and errors include allocation failures or OOM events.

For disk, utilization is device busy time, saturation is queue depth and read/write wait time, and errors are I/O errors or timeouts.

For network, utilization is throughput, saturation is drops/queues/latency, and errors are retransmits, resets, or packet loss.

Most useful signals during latency incidents

When users report slowness, saturation signals are often the most revealing: queues, wait time, and contention tend to correlate more directly with latency than raw utilization.

USE complements service metrics (it doesn’t replace them)

Service-level metrics (like request latency and error rate) tell you impact. USE tells you where to look next by identifying which resource is under strain.

A practical loop is:

Confirm user impact (Duration/Errors)
Run the USE inventory
Zoom into the suspicious resource with deeper tooling (profiles, traces, kernel stats)

The RED Method: service-first signals that point to impact

The RED Method keeps you anchored to user experience before you dive into host graphs.

Rate: how many requests per second your service or endpoint is handling
Errors: how many requests are failing (and what “failure” means for your app)
Duration: how long successful requests take (tracked as percentiles, not averages)

Why RED helps you prioritize

RED prevents you from chasing “interesting” system metrics that don’t affect users. It forces a tighter loop: which endpoint is slow, for which users, and since when? If Duration spikes only on a single route while overall CPU is flat, you already have a sharper starting point.

A useful habit: keep RED broken down by service and top endpoints (or key RPC methods). That makes it easy to distinguish a broad degradation from a localized regression.

Mapping RED symptoms to USE checks

RED tells you where the pain is. USE helps you test which resource is responsible.

Examples:

Duration up + Rate stable → check saturation/queueing: CPU run queue, storage latency, DB connection pools.
Errors up + Duration up → check timeouts and retries: overloaded downstreams, thread pools, network drops.
Rate up + Duration up → check capacity limits: CPU utilization, load balancer behavior, autoscaling delays.

A minimal “what changed?” dashboard

Keep the layout focused:

RED overview: Rate, Errors, and p50/p95/p99 Duration for the service.
Top endpoints: the same RED signals per endpoint, sorted by traffic or worst p95.
Dependencies: RED-style panels for major downstreams (DB, cache, external APIs).
One correlation row: a small set of system metrics (CPU, memory pressure, disk I/O latency, network retransmits) to speed up the jump from service view to root-cause testing.

If you want a consistent incident workflow, pair this section with the USE inventory in /blog/use-method-overview so you can move from “users are feeling it” to “this resource is the constraint” with less thrash.

Prioritization: choose the next best question to ask

Test performance changes fast

Deploy a test version and measure p95 and p99 before you touch production.

Deploy Now

A performance investigation can explode into dozens of charts and hypotheses in minutes. Gregg’s mindset is to keep it narrow: your job is not to “collect more data,” but to ask the next question that most quickly eliminates uncertainty.

The 80/20 rule for bottlenecks

Most latency problems are dominated by a single cost (or a small pair): one hot lock, one slow dependency, one overloaded disk, one GC pause pattern. Prioritization means hunting for that dominant cost first, because shaving 5% off five different places rarely moves user-visible latency.

A practical test: “What could explain most of the latency change we’re seeing?” If a hypothesis can only account for a tiny slice, it’s a lower-priority question.

Top-down vs. bottom-up: where to start

Use top-down when you’re answering “Are users impacted?” Start from endpoints (RED-style signals): latency, throughput, errors. This helps you avoid optimizing something that isn’t on the critical path.

Use bottom-up when the host is clearly sick (USE-style symptoms): CPU saturation, runaway memory pressure, I/O wait. If a node is pegged, you’ll waste time staring at endpoint percentiles without understanding the constraint.

A simple decision tree that prevents thrash

When an alert hits, pick a branch and stay on it until you confirm or falsify it:

Latency spike + errors spike → “Is this a dependency or capacity event?” (timeouts, connection pool exhaustion, downstream 5xx)
Latency spike + CPU spike → “Is CPU doing useful work or stuck?” (on-CPU vs off-CPU time)
Latency spike + high I/O wait → “Which device or filesystem queue is backing up?”
Latency spike without resource spikes → “Where is time spent waiting?” (locks, scheduler, network, remote calls)

Avoid metric overload, stay systematic

Limit yourself to a small starting set of signals, then drill down only when something moves. If you need a checklist to keep focus, link your steps to a runbook like /blog/performance-incident-workflow so every new metric has a purpose: answering a specific question.

Profiling in production without taking the system down

Production profiling can feel risky because it touches the live system—but it’s often the fastest way to replace debate with evidence. Logs and dashboards can tell you that something is slow. Profiling tells you where time goes: which functions run hot, which threads wait, and what code paths dominate during the incident.

What profiling actually answers

Profiling is a “time budget” tool. Instead of debating theories (“it’s the database” vs “it’s GC”), you get evidence like “45% of CPU samples were in JSON parsing” or “most requests are blocked on a mutex.” That narrows the next step to one or two concrete fixes.

Common types you can use in production

CPU profiling: shows what code is executing on-CPU.
Off-CPU (wait) profiling: shows where threads spend time blocked (I/O waits, scheduler delays, sleep, network, disk).
Lock profiling: shows contention—time lost waiting for locks, mutexes, and read/write latches.

Each answers a different question. High latency with low CPU often points to off-CPU or lock time rather than CPU hot spots.

Always-on vs on-demand

Always-on profiling (continuous, low overhead) helps with “it happened at 3am” mysteries because you can look back.
On-demand profiling is a targeted capture during a spike. It’s simpler to adopt, but you must be ready to trigger it quickly.

Many teams start on-demand, then graduate to always-on once they trust the safety and see recurring issues.

Safety: overhead, sampling, and short windows

Production-safe profiling is about controlling cost. Prefer sampling (not tracing every event), keep capture windows short (for example, 10–30 seconds), and measure overhead in a canary first. If you’re unsure, start with low-frequency sampling and increase only if the signal is too noisy.

Flame graphs: how to read them and avoid false conclusions

Change safely during incidents

Take a snapshot, try one fix, and roll back fast if results disappoint.

Try Koder

Flame graphs visualize where sampled time went during a profiling window. Each “box” is a function (or stack frame), and each stack shows how execution reached that function. They’re excellent for spotting patterns quickly—but they don’t automatically tell you “the bug is here.”

What a flame graph shows (and what it doesn’t)

A flame graph usually represents on-CPU samples: time the program was actually running on a CPU core. It can highlight CPU-heavy code paths, inefficient parsing, excessive serialization, or hotspots that truly burn CPU.

It does not directly show waiting on disk, network, scheduler delays, or time blocked on a mutex (that’s off-CPU time and needs different profiling). It also doesn’t prove causality for user-visible latency unless you tie it to a scoped symptom.

Reading width and stack depth

Width: how often that frame appeared in samples. Wider usually means “more CPU time,” but only within the chosen time window.
Stack depth: call depth. Deep stacks aren’t inherently bad; what matters is which paths dominate and whether they match the work you care about.

Common traps to avoid

The widest box is tempting to blame, but ask: is it a hotspot you can change, or just “time spent in malloc, GC, or logging” because the real issue is upstream? Also watch for missing context (JIT, inlining, symbols) that can make a box look like the culprit when it’s only the messenger.

Pair flame graphs with a precise question

Treat a flame graph as an answer to a scoped question: which endpoint, which time window, which hosts, and what changed. Compare “before vs after” (or “healthy vs degraded”) flame graphs for the same request path to avoid profiling noise.

Off-CPU time: the hidden source of latency

When latency spikes, many teams stare at CPU% first. That’s understandable—but it often points in the wrong direction. A service can be “only 20% CPU” and still be painfully slow if its threads spend most of their life not running.

Why CPU% alone misleads

CPU% answers “how busy is the processor?” It does not answer “where did my request time go?” Requests can stall while threads are waiting, blocked, or parked by the scheduler.

A key idea: a request’s wall-clock time includes both on-CPU work and off-CPU waiting.

Common off-CPU culprits

Off-CPU time typically hides behind dependencies and contention:

Disk I/O: synchronous reads/writes, fsyncs, slow storage, page cache misses.
Network waits: DNS lookups, TCP retransmits, overloaded upstream services.
Locks and mutex contention: threads blocked on locks, reader/writer locks, allocator contention.
Queueing: waiting in thread pools, connection pools, or internal work queues.

Symptoms worth watching

A few signals often correlate with off-CPU bottlenecks:

rising queue time (requests waiting before they even start executing)
increasing runnable threads (more competition for CPU time)
elevated I/O wait and longer disk/network latencies

These symptoms tell you “we’re waiting,” but not what you’re waiting on.

How off-CPU profiling shows “where the time went”

Off-CPU profiling attributes time to the reason you weren’t running: blocked in syscalls, waiting on locks, sleeping, or descheduled. That’s powerful for latency work because it turns vague slowdowns into actionable categories: “blocked on mutex X,” “waiting on read() from disk,” or “stuck in connect() to an upstream.” Once you can name the wait, you can measure it, confirm it, and fix it.

Confirm the bottleneck with evidence, not intuition

Performance work often fails at the same moment: someone spots a suspicious metric, declares it “the problem,” and starts tuning. Gregg’s methods push you to slow down and prove what’s limiting the system before you change anything.

Bottleneck, hot spot, and noise

A bottleneck is the resource or component that currently caps throughput or drives latency. If you relieve it, users see improvement.

A hot spot is where time is spent (for example, a function that appears frequently in a profile). Hot spots can be real bottlenecks—or simply busy work that doesn’t affect the slow path.

Noise is everything that looks meaningful but isn’t: background jobs, one-off spikes, sampling artifacts, caching effects, or “top talkers” that don’t correlate with the user-visible problem.

Prove it with comparisons and controlled change

Start by capturing a clean before snapshot: the user-facing symptom (latency or error rate) and the leading candidate signals (CPU saturation, queue depth, disk I/O, lock contention, etc.). Then apply a controlled change that should affect only your suspected cause.

Examples of causal tests:

Add capacity to the suspected resource (one more worker, more CPU shares, higher connection pool) and check whether latency improves.
Temporarily reduce demand (limit a noisy endpoint, replay a smaller workload) and see if the suspected constraint relaxes.

Correlation is a hint, not a verdict. If “CPU goes up when latency goes up,” verify by changing CPU availability or reducing CPU work and observing whether latency follows.

Document what you proved

Write down: what was measured, the exact change made, the before/after results, and the observed improvement. This turns a one-time win into a reusable playbook for the next incident—and prevents “intuition” from rewriting history later.

Build a repeatable workflow for performance incidents

Map the hot path in minutes

Sketch the hot request path in chat and discuss lock and queue choices early.

Try Free

Performance incidents feel urgent, which is exactly when guesswork slips in. A lightweight, repeatable workflow helps you move from “something’s slow” to “we know what changed” without thrashing.

The incident loop: detect → triage → measure → fix

Detect: alert on user-visible latency and error rate, not just CPU. Page when p95/p99 latency crosses a threshold for a sustained window.

Triage: immediately answer three questions: what’s slow, when did it start, and who is affected? If you can’t name the scope (service, endpoint, region, cohort), you’re not ready to optimize.

Measure: collect evidence that narrows the bottleneck. Prefer time-bounded captures (e.g., 60–180 seconds) so you can compare “bad” vs “good.”

Fix: change one thing at a time, then re-measure the same signals to confirm improvement and rule out placebo.

Standardize a small set of graphs

Keep a shared dashboard that everyone uses during incidents. Make it boring and consistent:

Latency: p50 / p95 / p99 (per critical endpoint)
RED signals: Rate, Errors, Duration (service-first view)
A few USE metrics: utilization, saturation, errors for CPU, disk, and network

The goal isn’t to graph everything; it’s to shorten time-to-first-fact.

Define “golden signals” per critical endpoint

Instrument the endpoints that matter most (checkout, login, search), not every endpoint. For each, agree on: expected p95, max error rate, and key dependency (DB, cache, third-party).

Decide what to capture during incidents

Before the next outage, agree on a capture kit:

Profiles (CPU and off-CPU), plus flame graphs
Traces for the slow endpoints
Logs for errors/timeouts (sampled)

Document it in a short runbook (e.g., /runbooks/latency), including who can run captures and where artifacts are stored.

Where Koder.ai fits in a Gregg-style workflow

Gregg’s methodology is fundamentally about controlled change and fast verification. If your team builds services using Koder.ai (a chat-driven platform for generating and iterating on web, backend, and mobile apps), two features map cleanly to that mindset:

Planning Mode helps you turn “maybe it’s X” into an explicit hypothesis and a small, testable change set before touching production.
Snapshots and rollback support safe, single-variable experiments: apply one change, re-measure RED/USE signals, and roll back quickly if the evidence says “no.”

Even if you’re not generating new code during an incident, those habits—small diffs, measurable outcomes, and quick reversibility—are the same habits Gregg promotes.

A practical walkthrough: from latency spike to verified fix

The scenario: p99 jumps at peak traffic

It’s 10:15am and your dashboard shows p99 latency for the API climbing from ~120ms to ~900ms during peak traffic. Error rate is flat, but customers report “slow” requests.

Step 1 — Start with RED to find user impact

Begin service-first: Rate, Errors, Duration.

You slice Duration by endpoint and see one route dominating p99: POST /checkout. Rate is up 2×, errors are normal, but Duration spikes specifically when concurrency rises. That points to queueing or contention, not an outright failure.

Next, check whether the latency is compute time or waiting time: compare application “handler time” vs total request time (or upstream vs downstream spans if you have tracing). The handler time is low, total time is high—requests are waiting.

Step 2 — Apply USE to the suspected host(s)

Inventory likely bottlenecks: Utilization, Saturation, Errors for CPU, memory, disk, and network.

CPU utilization is only ~35%, but CPU run queue and context switches climb. Disk and network look steady. That mismatch (low CPU%, high waiting) is a classic hint: threads aren’t burning CPU—they’re blocked.

Step 3 — Choose profiling based on symptoms

If CPU is high: use CPU profiling (on-CPU flame graphs) to see where time is spent.
If requests are waiting: use off-CPU profiling to see what threads are blocked on (locks, I/O, scheduling).

You capture an off-CPU profile during the spike and find heavy time in a mutex around a shared “promotion validation” cache.

Step 4 — Fix, then verify

You replace the global lock with a per-key lock (or a lock-free read path), deploy, and watch p99 return to baseline while Rate stays high.

Post-incident checklist:

Record the exact RED symptoms and the narrowed endpoint.
Save the profile and the time window.
Add an alert for the specific saturation signal (e.g., lock wait / run queue).
Write down the “next question to ask” if it happens again.