Learn what Werner Vogels meant by “You Build It, You Run It” and how to apply it: ownership, on-call, SLOs, incident response, and safer shipping.

“You build it, you run it” is one of those lines that sticks because it’s blunt. It’s not about motivation posters or “being more DevOps.” It’s a clear statement about responsibility: the team that ships a service also stays accountable for how that service behaves in production.
In practice, this means the same product team that designs features and writes code also:
It doesn’t mean everyone becomes an infrastructure expert overnight. It means the feedback loop is real: if you release something that increases outages, pager noise, or customer pain, your team feels it directly—and learns quickly.
This philosophy is easy to repeat and hard to implement unless you treat it as an operating model with explicit expectations. “Run it” typically includes being on-call (in some form), owning incident response, writing runbooks, maintaining dashboards, and continuously improving the service.
It also implies constraints: you can’t ask teams to “run it” without giving them the tools, access, and authority to fix issues—plus the time in their roadmap to do the work.
Before “You Build It, You Run It,” many companies organized software work as a relay race: developers wrote code, then “threw it over the wall” to an operations team to deploy and keep it running.
That handoff solved a short-term problem—someone experienced was watching production—but it created bigger ones.
When a separate ops team owns production, developers often learn about issues late (or not at all). A bug might show up as a vague ticket days later: “service is slow” or “CPU is high.” By then, context is missing, logs have rotated, and the people who made the change have moved on.
Handoffs also blur ownership. If an outage happens, dev might assume “ops will catch it,” while ops assumes “dev shipped something risky.” The result is predictable: longer incident resolution, repeated failure modes, and a culture where teams optimize locally instead of for the customer experience.
“You Build It, You Run It” tightens the loop. The same team that ships a change is accountable for how it behaves in production. That pushes practical improvements upstream: clearer alerts, safer rollouts, better dashboards, and code that’s easier to operate.
Paradoxically, it often leads to faster delivery. When teams trust their release process and understand production behavior, they can ship smaller changes more frequently—reducing the blast radius of mistakes and making problems easier to diagnose.
Not every organization starts with equal staffing, compliance requirements, or legacy systems. The philosophy is a direction, not a switch. Many teams adopt it gradually—starting with shared on-call, better observability, and clearer service boundaries—before taking full end-to-end ownership.
Werner Vogels, Amazon’s CTO, popularized the phrase “You build it, you run it” while describing how Amazon (and later AWS) wanted teams to think about software: not as a project you hand off, but as a service you operate.
The key shift was psychological as much as technical. When a team knows it will be paged for failures, design decisions change. You care about sane defaults, clear alerting, graceful degradation, and deploy paths you can roll back. In other words, building includes planning for the messy parts of real life.
AWS-era service thinking made reliability and speed non-negotiable. Cloud customers expect APIs to be available around the clock, and they expect improvements to arrive continuously—not in quarterly “big release” waves.
That pressure encouraged:
This philosophy overlaps with the broader DevOps movement: close the gap between “dev” and “ops,” reduce handoffs, and make outcomes (availability, latency, support load) part of the development loop. It also fits the idea of small autonomous teams that can ship independently.
It’s tempting to treat Amazon’s approach as a template to copy. But “You Build It, You Run It” is more of a direction than a strict org chart. Your team size, regulatory constraints, product maturity, and uptime requirements may call for adaptations—shared on-call rotations, platform support, or phased adoption.
If you want a practical way to translate the mindset into action, jump to /blog/how-to-adopt-you-build-it-you-run-it-step-by-step.
“You Build It, You Run It” is really a statement about ownership. If your team ships a service, your team is responsible for how that service behaves in the real world—not just whether it passes tests on release day.
Running a service means caring about outcomes end-to-end:
On a normal week, “run it” is less about heroics and more about routine operations:
This model works only when accountability means “we own the fix,” not “we hunt for a person to punish.” When something breaks, the goal is to understand what in the system allowed it—missing alerts, unclear limits, risky deployments—and improve those conditions.
Ownership gets messy when services are fuzzy. Define service boundaries (what it does, what it depends on, what it promises) and assign a named owning team. That clarity reduces handoffs, speeds up incident response, and makes priorities obvious when reliability and features compete.
On-call is central to “You Build It, You Run It” because it closes the feedback loop. When the same team that ships a change also feels the operational impact (latency spikes, failed deploys, customer complaints), priorities get clearer: reliability work stops being “someone else’s problem,” and the fastest way to ship more is often to make the system calmer.
Healthy on-call is mostly about predictability and support.
Define severity levels so the system doesn’t page for every imperfection.
A simple rule: if waking someone up won’t change the outcome, it should be a ticket, not a page.
On-call isn’t a punishment; it’s a signal. Every noisy alert, repeated failure, or manual fix should feed back into engineering work: better alerts, automation, safer releases, and system changes that remove the need to page at all.
If “you run it” is real, teams need a shared way to talk about reliability without turning every discussion into opinions. That’s what SLIs, SLOs, and error budgets provide: clear targets and a fair trade-off between moving fast and keeping things stable.
A useful way to remember it: SLI = metric, SLO = target, SLA = external commitment.
Good SLIs are specific and tied to user experience, such as:
An error budget is the amount of “badness” you can afford while still meeting your SLO (for example, if your SLO is 99.9% availability, your monthly error budget is 0.1% downtime).
When the service is healthy and you’re within budget, teams can take more delivery risk (ship features, run experiments). When you’re burning budget too fast, reliability work gets priority.
SLOs turn reliability into a planning input. If your error budget is low, the next sprint might emphasize rate limiting, safer rollouts, or fixing flaky dependencies—because missing the SLO has a clear cost. If budget is plentiful, you can confidently prioritize product work without guessing whether “ops will be fine.”
“You build it, you run it” only works if shipping to production is routine—not a high-stakes event. The goal is to reduce uncertainty before launch and to limit blast radius after launch.
Before a service is considered “ready,” teams typically need a few operational basics in place:
Instead of releasing everything to everyone at once, progressive delivery limits impact:
If your team is standardizing rollback, treat it as a first-class capability: the faster you can revert safely, the more realistic “you run it” becomes.
Two tests reduce “unknown unknowns”:
Keep it lightweight: a one-page checklist in your repo or ticket template (e.g., “Observability,” “On-call readiness,” “Data protection,” “Rollback plan,” “Capacity tested,” “Runbooks linked”). Make “not ready” a normal status—far better than learning in production.
Incidents are where “you run it” becomes real: a service has degraded, customers notice, and the team has to respond quickly and clearly. The goal isn’t heroics—it’s a repeatable workflow that reduces impact and produces improvements.
Most teams converge on the same phases:
If you want a practical template for this flow, keep a lightweight checklist handy (see /blog/incident-response-checklist).
A blameless postmortem doesn’t mean “nobody made mistakes.” It means you focus on how the system and process allowed the mistake to reach production, not on shaming individuals. That’s what makes people share details early, which is essential for learning.
Document:
Good postmortems end with concrete, owned follow-ups, typically in four buckets: tooling improvements (better alerts/dashboards), tests (regressions and edge cases), automation (safer deploy/rollback, guardrails), and documentation (runbooks, clearer operational steps). Assign an owner and due date—otherwise learning stays theoretical.
Tooling is the leverage that makes “You Build It, You Run It” sustainable—but it can’t substitute for real ownership. If a team treats operations as “someone else’s problem,” the fanciest dashboard will just document the chaos. Good tools reduce friction: they make the right thing (observing, responding, learning) easier than the wrong thing (guessing, blaming, ignoring).
At minimum, service owners need a consistent way to see what their software is doing in production and act quickly when it isn’t.
If your monitoring story is fragmented, teams spend more time hunting than fixing. A unified observability approach helps; see /product/observability.
As organizations grow, “who owns this?” becomes a reliability risk. A service catalog (or internal developer portal) solves this by keeping ownership and operational context in one place: team name, on-call rotation, escalation path, runbooks, dependencies, and links to dashboards.
The key is ownership metadata that stays current. Make it part of the workflow: new services can’t go live without an owner, and ownership changes are treated like code changes (reviewed, tracked).
The best setups nudge teams toward healthy behavior: templates for runbooks, automated alerts tied to SLOs, and dashboards that answer “are users affected?” in seconds. But the human system still matters—teams need time to maintain these tools, prune alerts, and continuously improve how they operate the service.
Platform teams make “You Build It, You Run It” easier to live with. Their job isn’t to run production for everyone—it’s to provide a well-lit path (sometimes called “paved roads”) so product teams can own services without reinventing operations every sprint.
A good platform offers defaults that are hard to mess up and easy to adopt:
Guardrails should prevent risky behavior without blocking shipping. Think “secure by default” rather than “open a ticket and wait.”
Platform teams can run shared services—without taking ownership of product services.
The boundary is simple: the platform team owns the platform’s uptime and support; product teams own how their services use it.
When teams don’t have to become experts in CI/CD, auth, or secrets on day one, they can focus on the service’s behavior and user impact.
Examples that remove busywork:
The result is faster delivery with fewer “custom ops snowflakes,” while keeping the core promise intact: the team that builds the service still runs it.
“You build it, you run it” can improve reliability and speed—but only if the organization changes the conditions around the team. Many failures look like the slogan was adopted, but the supporting habits weren’t.
A few patterns show up again and again:
Some environments need a tailored approach:
This philosophy fails fastest when reliability work is treated as “extra.” Leadership must explicitly reserve capacity for:
Without that protection, on-call becomes a tax—rather than a feedback loop that improves the system.
Rolling this out works best as a phased change, not a company-wide announcement. Start small, make ownership visible, and only then expand.
Pick a single, well-bounded service (ideally one with clear users and manageable risk).
Define:
The key: the team that ships changes also owns the operational outcomes for that service.
Before you expand to more services, make sure the pilot team can operate without heroics:
Use a small set of indicators that show whether ownership is improving shipping and stability:
If you’re adopting “you build it, you run it” while also trying to speed up delivery, the bottleneck is often the same: getting from idea → a production-ready service with clear ownership and a safe rollback story.
Koder.ai is a vibe-coding platform that helps teams build web, backend, and mobile apps through a chat interface (React on the web, Go + PostgreSQL on the backend, Flutter for mobile). For teams leaning into service ownership, a few features map cleanly to the operating model:
Pick your pilot service this week and schedule a 60-minute kickoff to set the first SLO, on-call rotation, and runbook owners. If you’re evaluating tooling to support this (shipping, rollback, and the workflows around ownership), see /pricing for Koder.ai’s free, pro, business, and enterprise tiers—plus options like hosting, deployment, and custom domains.
It means the team that designs, builds, and deploys a service also owns what happens after it’s live: monitoring, on-call response, incident follow-ups, and reliability improvements.
It’s a responsibility model (clear ownership), not a tool choice or job title change.
It doesn’t mean every engineer must become a full-time infrastructure specialist.
It means:
With a separate ops team, feedback is delayed and responsibility gets blurry: developers may not feel production pain, and ops may not have context for recent changes.
End-to-end ownership typically improves:
“Run it” usually includes:
Start with humane defaults:
A good on-call system aims to reduce pages next month, not normalize heroics.
Use a simple rule: if waking someone up won’t change the outcome, make it a ticket.
Practically:
They create shared, measurable reliability targets:
When the budget is burning fast, prioritize reliability work; when it’s healthy, take more delivery risk.
Adopt release practices that lower uncertainty and blast radius:
Run incidents with a repeatable flow:
Then write blameless postmortems focused on system and process gaps, with follow-ups that are:
A lightweight checklist like /blog/incident-response-checklist can help standardize the workflow.
A platform team should provide paved roads (templates, CI/CD, guardrails, shared services like auth and observability) while product teams keep ownership of their services’ outcomes.
A practical boundary: