Eric Brewer’s CAP Thinking: Why Distributed Systems Trade Off

Q: What problem does CAP help engineers reason about?

CAP is a mental model for replicated systems under communication failure . It’s most useful when the network is slow, lossy, or split, because that’s when replicas can’t reliably agree and you’re forced to choose between: - Consistency: everyone sees the same latest value - Availability: the system keeps returning successful responses It helps turn “distributed is hard” into a concrete product and engineering decision.

Q: When am I actually in a CAP situation?

A true CAP scenario requires both : - Replication (more than one node can serve/accept operations for the same data) - A realistic chance of communication failure (partitions, timeouts, long delays) If your system is a single node, or if you don’t replicate state, CAP tradeoffs aren’t the central issue.

Q: What’s the difference between consistency and availability in user terms?

Consistency (C) means reads reflect the latest acknowledged write from anywhere. Users experience it as “I changed it, and everyone sees it.” Availability (A) means every request receives a successful response (not necessarily the newest data). Users experience it as “the app keeps working,” possibly with stale results. During a partition, you typically can’t guarantee both simultaneously for all operations.

Q: Why can’t I just choose consistency and availability and ignore partitions?

Because partitions are not optional in distributed systems that span machines, racks, zones, or regions. If you replicate, you must define behavior when nodes can’t coordinate. So “tolerating partitions” usually means: when communication breaks, the system still has a defined way to operate—either by rejecting/pausing some actions (favoring consistency) or by serving best-effort results (favoring availability).

Q: What does a consistency-first (CP) choice look like during a split?

If you favor consistency , you typically: - reject or delay operations when you can’t confirm agreement - require majorities/quorums for reads/writes - surface errors like timeouts, “try again,” or read-only modes This is common for money movement, inventory reservation, and permission changes—places where being wrong is worse than being briefly unavailable.

Q: What does an availability-first (AP) choice look like during a split?

If you favor availability , you typically: - keep accepting reads/writes on each side of the split - allow replicas to diverge temporarily - reconcile later (merge rules, conflict resolution, manual review) Users see fewer hard errors, but may see stale data, duplicated effects without idempotency, or conflicting updates that need cleanup.

Q: What consistency guarantees can I choose besides “strong” vs “eventual”?

Useful options include: - Strong consistency (linearizable): best correctness; may block/fail during partitions - Eventual consistency: keeps serving; users may see staleness until replicas converge - Middle guarantees: - read-your-writes (you don’t lose sight of your own update) - monotonic reads (you don’t go backward) - causal consistency (dependent events stay ordered) Pick the weakest guarantee that prevents user-visible “wrongness” you can’t tolerate.

Eric Brewer’s CAP Thinking: Why Distributed Systems Trade Off | Koder.ai

Why CAP Became the Go-To Mental Model

When you store the same data on more than one machine, you gain speed and fault tolerance—but you also inherit a new problem: disagreement. Two servers can receive different updates, messages can arrive late or not at all, and users might read different answers depending on which replica they hit. CAP became popular because it gives engineers a clean way to talk about that messy reality without hand-waving.

Eric Brewer, a computer scientist and co-founder of Inktomi, introduced the core idea in 2000 as a practical statement about replicated systems under failure. It spread quickly because it matched what teams were already experiencing in production: distributed systems don’t just fail by going down; they fail by splitting.

CAP is a failure lens, not a feature checklist

CAP is most useful when things go wrong—especially when the network doesn’t behave. On a healthy day, many systems can look both consistent and available enough. The pressure test is when machines can’t reliably communicate and you must decide what to do with reads and writes while the system is divided.

That framing is why CAP became a go-to mental model: it doesn’t argue about best practices; it forces a concrete question—what will we sacrifice during a split?

What you’ll be able to decide by the end

By the end of this article, you should be able to:

Recognize when you’re dealing with a true CAP scenario (replication + possible communication breaks).
Choose, intentionally, whether your system should favor consistency (everyone sees the same truth) or availability (the system keeps answering) when replicas can’t agree.
Connect that choice to product impact: what users experience, what errors you surface, and what fixes you’ll need after the split heals.

CAP endures because it turns vague “distributed is hard” talk into a decision you can make—and defend.

The Setup: Replication and the Problem of Disagreement

A distributed system is, in plain terms, many computers trying to act like one. You might have several servers in different racks, regions, or cloud zones, but to the user it’s “the app” or “the database.”

Why we replicate data

To make that shared system work at real-world scale, we usually replicate: we keep multiple copies of the same data on different machines.

Replication is popular for three practical reasons:

Scale: more machines can handle more traffic.
Performance: users can be served by a nearby copy, reducing latency.
Reliability: if one machine dies, another copy can keep the service running.

So far, replication sounds like a straightforward win. The catch is that replication creates a new job: keeping all copies in agreement.

The core tension: copies can disagree

If every replica could always talk to every other replica instantly, they could coordinate updates and stay aligned. But real networks aren’t perfect. Messages can be delayed, dropped, or routed around failures.

When communication is healthy, replicas can usually exchange updates and converge on the same state. But when communication breaks (even temporarily), you can end up with two valid-looking versions of “the truth.”

For example, a user changes their shipping address. Replica A receives the update, replica B doesn’t. Now the system has to answer a seemingly simple question: what is the current address?

Normal operation vs. failure operation

This is the difference between:

Normal operation: replicas can coordinate; disagreement is mostly a timing issue.
Failure operation: some replicas can’t communicate; disagreement becomes unavoidable.

CAP thinking starts exactly here: once replication exists, disagreement under communication failure isn’t an edge case—it’s the central design problem.

CAP in Plain English: C, A, and P

CAP is a mental model for what users actually feel when a system is spread across multiple machines (often in multiple locations). It doesn’t describe “good” or “bad” systems—just the tension you must manage.

Consistency (C): do I see the latest write?

Consistency is about agreement. If you update something, will the next read (from anywhere) reflect that update?

From a user’s point of view, it’s the difference between “I just changed it, and everyone sees the same new value” versus “some people still see the old value for a while.”

Availability (A): can I get an answer at all?

Availability means the system responds to requests—reads and writes—with a success result. Not “the fastest possible,” but “it doesn’t refuse to serve you.”

During trouble (a server down, a network hiccup), an available system keeps accepting requests, even if it has to answer with data that might be slightly out of date.

Partition tolerance (P): what happens when nodes can’t talk?

A partition is when the network splits: machines are running, but messages between some of them can’t get through (or arrive too late to be useful). In distributed systems, you can’t treat this as impossible—you have to define behavior when it happens.

A simple story: two shops, one inventory

Imagine two retail shops that both sell the same product and share “1 inventory count.” A customer buys the last item in Shop A, so Shop A writes inventory = 0. At the same time, a network partition prevents Shop B from hearing about it.

If Shop B stays available, it may sell an item it no longer has (accepting the sale while partitioned). If Shop B enforces consistency, it may refuse the sale until it can confirm the latest inventory (denying service during the split).

What Partitions Really Are (and Why You Can’t Ignore Them)

A “partition” isn’t just “the internet is down.” It’s any situation where parts of your system can’t reliably talk to each other—even though each part may still be running fine.

In a replicated system, nodes constantly exchange messages: writes, acknowledgements, heartbeats, leader elections, read requests. A partition is what happens when those messages stop arriving (or arrive too late), creating disagreement about reality: “Did the write happen?” “Who is the leader?” “Is node B alive?”

Partitions are communication failures

Communication can fail in messy, partial ways:

Packet loss that triggers retries and timeouts
Routing issues where traffic takes a long detour or black-holes
Overloaded links (or saturated NICs) causing long delays
Misconfigured firewalls / security groups blocking only certain ports or directions
DNS or service discovery hiccups that prevent nodes from finding each other

The important point: partitions are often degradation, not a clean on/off outage. From the application’s point of view, “slow enough” can be indistinguishable from “down.”

Why partitions are inevitable at scale

As you add more machines, more networks, more regions, and more moving parts, there are simply more opportunities for communication to break temporarily. Even if individual components are reliable, the overall system experiences failures because it has more dependencies and more cross-node coordination.

You don’t need to assume any exact failure rate to accept the reality: if your system runs long enough and spans enough infrastructure, partitions will happen.

What “tolerate partitions” means in practice

Partition tolerance means your system is designed to keep operating during a split—even when nodes can’t agree or can’t confirm what the other side has seen. That forces a choice: either keep serving requests (risking inconsistency) or stop/reject some requests (preserving consistency).

The Key Moment: Choosing Consistency or Availability During a Split

Once you have replication, a partition is simply a communication break: two parts of your system can’t reliably talk to each other for a while. Replicas are still running, users are still clicking buttons, and your service is still receiving requests—but the replicas can’t agree on the latest truth.

That’s the CAP tension in one sentence: during a partition, you must choose to prioritize Consistency (C) or Availability (A). You don’t get both at the same time.

If you choose Consistency (C)

You’re saying: “I’d rather be correct than responsive.” When the system can’t confirm that a request will keep all replicas in sync, it must fail or wait.

Practical effect: some users see errors, timeouts, or “try again” messages—especially for operations that change data. This is common when you’d rather reject a payment than risk charging twice, or block a seat reservation than oversell.

If you choose Availability (A)

You’re saying: “I’d rather respond than block.” Each side of the partition will keep accepting requests, even if it can’t coordinate.

Practical effect: users get successful responses, but the data they read may be stale, and concurrent updates can conflict. You then rely on reconciliation later (merge rules, last-write-wins, manual review, etc.).

The choice can vary by operation

This isn’t always a single global setting. Many products mix strategies:

Reads vs. writes: keep reads available, but make writes stricter.
Critical vs. non-critical actions: enforce consistency for money, identity, and inventory; allow availability for feeds, analytics, “likes,” or cached profiles.

The key moment is deciding—per operation—what’s worse: blocking a user now, or fixing conflicting truth later.

Common Misconceptions: Beyond the “Pick Two” Slogan

See behavior in production-like runs

Deploy a prototype and observe how timeouts and fallbacks change perceived availability.

Deploy Now

The slogan “pick two” is memorable, but it regularly misleads people into thinking CAP is a menu of three features where you only get to keep two forever. CAP is about what happens when the network stops cooperating: during a partition (or anything that looks like one), a distributed system must choose between returning consistent answers and staying available for every request.

Misconception 1: “I’ll just choose C and A and avoid partitions”

In real distributed systems, partitions aren’t a setting you can disable. If your system spans machines, racks, zones, or regions, then messages can be delayed, dropped, reordered, or routed oddly. That’s a partition from the perspective of the software: nodes can’t reliably agree on what’s happening.

Even if the physical network is fine, failures elsewhere create the same effect—overloaded nodes, GC pauses, noisy neighbors, DNS hiccups, flaky load balancers. The result is the same: some parts of the system can’t talk to others well enough to coordinate.

Misconception 2: “Partitions are rare edge cases”

Applications don’t experience “partition” as a neat, binary event. They experience latency spikes and timeouts. If a request times out after 200 ms, it doesn’t matter whether the packet arrived at 201 ms or never arrived at all: the app must decide what to do next. From the app’s point of view, slow communication is often indistinguishable from broken communication.

Misconception 3: “Systems are either CP or AP”

Many real systems are mostly consistent or mostly available, depending on configuration and operating conditions. Timeouts, retry policies, quorum sizes, and “read your writes” options can shift the behavior.

Under normal conditions, a database might look strongly consistent; under stress or cross-region hiccups, it may start failing requests (favoring consistency) or returning older data (favoring availability).

CAP is less about labeling products and more about understanding the trade you’re making when disagreement happens—especially when that disagreement is caused by plain old slowness.

Consistency Options You Can Actually Choose

CAP discussions often make consistency sound binary: either “perfect” or “anything goes.” Real systems offer a menu of guarantees, each with a different user experience when replicas disagree or a network link breaks.

Strong consistency (and its price during failure)

Strong consistency (often “linearizable” behavior) means once a write is acknowledged, every later read—no matter which replica it hits—returns that write.

What it costs: during a partition or when a minority of replicas are unreachable, the system may delay or reject reads/writes to avoid showing conflicting states. Users notice this as timeouts, “try again,” or temporarily read-only behavior.

Eventual consistency (and what users may notice)

Eventual consistency promises that if no new updates happen, all replicas will converge. It does not promise that two users reading right now will see the same thing.

What users may notice: a recently updated profile photo that “reverts,” counters that lag behind, or a just-sent message that isn’t visible on another device for a short time.

Useful middle-ground guarantees

You can often buy a better experience without demanding full strong consistency:

Read-your-writes: after you update something, you won’t read an older version of your own data.
Monotonic reads: once you’ve seen version N, you won’t later see version N-1.
Causal consistency: if event B depends on A (reply after reading a message), everyone sees A before B.

These guarantees map well to how people think (“don’t show me my own changes disappearing”) and can be easier to maintain during partial failures.

Picking a consistency level based on expectations

Start with user promises, not jargon:

If incorrect reads create irreversible harm (money movement, inventory reservation, permission changes), lean toward stronger consistency and accept temporary unavailability.
If the feature is tolerable with short-lived disagreement (likes, view counts, feed ranking), eventual or causal consistency usually fits.
If the core pain is personal confusion (“I saved it—why can’t I see it?”), prioritize read-your-writes and monotonic reads.

Consistency is a product choice: define what “wrong” looks like to a user, then choose the weakest guarantee that prevents that wrongness.

Availability as a Product Decision, Not Just an Uptime Number

Model CP behavior safely

Stand up a Go service with PostgreSQL and add quorum-like rules where correctness matters.

Build Backend

Availability in CAP isn’t a bragging metric (“five nines”)—it’s a promise you make to users about what happens when the system can’t be certain.

Fast success vs. accurate success

When replicas can’t agree, you often choose between:

Fast success: return something quickly (even if it may be stale).
Accurate success: return only when you can prove the answer is current.

Users feel this as “the app works” versus “the app is correct.” Neither is universally better; the right choice depends on what “wrong” means in your product. A slightly outdated social feed is annoying. An outdated account balance can be harmful.

“Fail closed” vs. “fail open”

Two common behaviors show up during uncertainty:

Fail closed: reject the request (errors, timeouts, read-only mode). You protect correctness, but users may be blocked.
Fail open: serve a best-effort response (cached data, local replica, queued write). You protect flow, but may show inconsistent results.

This is not a purely technical call; it’s a policy decision. The product needs to define what is acceptable to show, and what must never be guessed.

Partial availability is still availability

Availability is rarely all-or-nothing. During a split, you might see partial availability: some regions, networks, or user groups succeed while others fail. This can be a deliberate design (keep serving where the local replica is healthy) or an accidental one (routing imbalances, uneven quorum reach).

Degraded mode: keep the core, limit the risk

A practical middle ground is degraded mode: continue serving safe actions while restricting risky ones. For example, allow browsing and search, but temporarily disable “transfer funds,” “change password,” or other operations where correctness and uniqueness matter.

Concrete Examples: Matching CAP Choices to Use Cases

CAP feels abstract until you map it to what your users experience during a network split: do you prefer the system to keep responding, or to stop and avoid returning (or accepting) conflicting data?

Inventory and ordering: oversell risk vs. checkout outages

Imagine two data centers both accept orders while they can’t talk to each other.

If you keep the checkout flow available, each side may sell the “last item” and you’ll oversell. That can be acceptable for low-stakes goods (you backorder or apologize), but painful for limited inventory drops.

If you choose consistency-first behavior, you might block new orders when you can’t confirm stock globally. Users see “try again later,” but you avoid selling something you can’t fulfill.

Payments and balances: correctness-first patterns (and why)

Money is the classic “being wrong is expensive” domain. If two replicas accept withdrawals independently during a split, an account can go negative.

Systems often prefer consistency during critical writes: decline or delay actions if they can’t confirm the latest balance. You’ll trade some availability (temporary payment failures) for correctness, auditability, and trust.

Chat, feeds, analytics: “available with slightly stale data” is OK

In chat and social feeds, users usually tolerate small inconsistencies: a message arrives a few seconds late, a like count is off, a view metric updates later.

Here, designing for availability can be a good product choice, as long as you’re clear which elements are “eventually correct” and you can merge updates cleanly.

The point: your tradeoff is a business decision

The “right” CAP choice depends on the cost of being wrong: refunds, legal exposure, user trust, or operational chaos. Decide where you can accept temporary staleness—and where you must fail closed.

Design Patterns That Implement Your Tradeoff

Once you’ve decided what you’ll do during a network split, you need mechanisms that make that decision real. These patterns show up across databases, message systems, and APIs—even if the product never mentions “CAP.”

Quorums: majority agreement

A quorum is just “most of the replicas agree.” If you have 5 copies of some data, a majority is 3.

By requiring reads and/or writes to contact a majority, you reduce the chance of returning stale or conflicting data. For example, if a write must be acknowledged by 3 replicas, it’s harder for two isolated groups to both accept different “truths.”

The tradeoff is speed and reach: if you can’t reach a majority (because of a partition or outages), the system may refuse the operation—choosing consistency over availability.

Timeouts, retries, and backoff shape perceived availability

Many “availability” issues are not hard failures but slow responses. Setting a short timeout can make the system feel snappy, but it also increases the chance you’ll treat slow successes as failures.

Retries can recover from transient blips, but aggressive retries can overload an already struggling service. Backoff (waiting a bit longer between retries) and jitter (randomness) help keep retries from turning into a traffic spike.

The key is aligning these settings with your promise: “always respond” usually means more retries and fallbacks; “never lie” usually means tighter limits and clearer errors.

Conflict handling when you allow divergence

If you choose to stay available during partitions, replicas may accept different updates and you must reconcile later. Common approaches include:

Last-write-wins (LWW): pick the update with the latest timestamp. Simple, but can drop valid changes if clocks disagree.
Version vectors (high level): attach a small “history” that helps detect whether updates are concurrent or one supersedes another.
Merge rules: define how to combine changes (e.g., cart items union; counters add; profiles prefer non-empty fields). This often works best when designed into the data model.

Idempotency: making retries safe

Retries can create duplicates: double-charging a card or submitting the same order twice. Idempotency prevents that.

A common pattern is an idempotency key (request ID) sent with each request. The server stores the first result and returns the same result for repeats—so retries improve availability without corrupting data.

How to Validate CAP Assumptions in Real Life

Build failure-aware flows

Generate an API and UI that supports degraded mode and clear user messaging.

Start Building

Most teams “choose” a CAP stance on a whiteboard—then discover in production that the system behaves differently under stress. Validation means intentionally creating the conditions where CAP tradeoffs become visible, and checking that your system reacts the way you designed it to.

Test partitions on purpose (safely)

You don’t need a real cable cut to learn something. Use controlled fault injection in staging (and carefully in production) to simulate partitions:

Blackhole traffic between specific services or nodes (drop packets without closing connections) to mimic a silent split.
Kill links by blocking ports or security group rules between replicas/regions.
Add extreme latency and packet loss so timeouts and retries behave like a partition.
Force leader isolation (e.g., isolate the primary from a quorum) to see whether you fail “consistent” or “available.”

The goal is to answer concrete questions: Do writes get rejected or accepted? Do reads serve stale data? Does the system recover automatically, and how long does reconciliation take?

If you want to validate these behaviors early (before you’ve invested weeks into wiring services together), it can help to spin up a realistic prototype quickly. For example, teams using Koder.ai often start by generating a small service (commonly a Go backend with PostgreSQL plus a React UI) and then iterating on behaviors like retries, idempotency keys, and “degraded mode” flows in a sandbox environment.

Monitor signals that reveal CAP pain

Traditional uptime checks won’t catch “available but wrong” behavior. Track:

Error rates by operation type (read vs write vs conditional update).
Stale-read indicators (read-your-writes violations, version/ETag mismatches, lag metrics).
Replica divergence (replication lag, failed apply counts, conflict rates).
Timeouts/retries (often the first sign of an emerging split).

Runbooks and user communication

Operators need pre-decided actions when a partition happens: when to freeze writes, when to fail over, when to degrade features, and how to validate re-merge safety.

Also plan the user-facing behavior. If you choose consistency, the message might be “We can’t confirm your update right now—please retry.” If you choose availability, be explicit: “Your update may take a few minutes to appear everywhere.” Clear wording reduces support load and preserves trust.

A Practical CAP Checklist for Everyday System Decisions

When you’re making a system decision, CAP is most useful as a quick “what breaks during a split?” audit—not a theoretical debate. Use this checklist before you pick a database feature, caching strategy, or replication mode.

1) A short CAP checklist

Ask these in order:

What must be correct? (e.g., “a bank balance must never go negative,” “inventory can’t oversell,” “permissions must be accurate”)
What must stay up? (e.g., checkout endpoint, login, read-only catalog)
What can degrade temporarily? (e.g., analytics, recommendations, profile avatars, “last seen”)

If a network partition happens, you’re deciding which of these you’ll protect first.

2) Decide per data type and per endpoint

Avoid a single global setting like “we are an AP system.” Instead, decide per:

Data type: money vs. likes vs. logs
Endpoint: “place order” vs. “view order” vs. “track shipment”

Example: during a split, you might block writes to payments (prefer consistency) but keep reads for product_catalog available with cached data.

3) Define “acceptable inconsistency” in concrete terms

Write down what you can tolerate, with examples:

Time bound: “counts may be 5–10 minutes behind”
Magnitude: “inventory may be off by ±1 for low-demand items”
Field-level: “shipping ETA can be stale; order total cannot”
User-visible wording: “show ‘pending’ instead of a definitive status”

If you can’t describe inconsistency in plain examples, you’ll struggle to test it and explain incidents.

4) Takeaways + what to read next

Partitions turn “nice-to-have” guarantees into forced choices.
Make those choices explicit per endpoint, and document acceptable inconsistency.

Next topics that pair well with this checklist: consensus (/blog/consensus-vs-cap), consistency models (/blog/consistency-models-explained), and SLOs/error budgets (/blog/sre-slos-error-budgets).

FAQ

What problem does CAP help engineers reason about?

CAP is a mental model for replicated systems under communication failure. It’s most useful when the network is slow, lossy, or split, because that’s when replicas can’t reliably agree and you’re forced to choose between:

Consistency: everyone sees the same latest value
Availability: the system keeps returning successful responses

It helps turn “distributed is hard” into a concrete product and engineering decision.

When am I actually in a CAP situation?

A true CAP scenario requires both:

Replication (more than one node can serve/accept operations for the same data)
A realistic chance of communication failure (partitions, timeouts, long delays)

If your system is a single node, or if you don’t replicate state, CAP tradeoffs aren’t the central issue.

What counts as a network partition in real systems?

A partition is any situation where parts of your system can’t communicate reliably or within required time limits—even if every machine is still running.

Practically, “partition” often shows up as:

latency spikes that trigger timeouts
dropped/blackholed packets
firewall or routing misconfigurations
overloaded nodes that stop responding in time

From the application’s perspective, “too slow” can be the same as “down.”

What’s the difference between consistency and availability in user terms?

Consistency (C) means reads reflect the latest acknowledged write from anywhere. Users experience it as “I changed it, and everyone sees it.”

Availability (A) means every request receives a successful response (not necessarily the newest data). Users experience it as “the app keeps working,” possibly with stale results.

During a partition, you typically can’t guarantee both simultaneously for all operations.

Why can’t I just choose consistency and availability and ignore partitions?

Because partitions are not optional in distributed systems that span machines, racks, zones, or regions. If you replicate, you must define behavior when nodes can’t coordinate.

So “tolerating partitions” usually means: when communication breaks, the system still has a defined way to operate—either by rejecting/pausing some actions (favoring consistency) or by serving best-effort results (favoring availability).

What does a consistency-first (CP) choice look like during a split?

If you favor consistency, you typically:

reject or delay operations when you can’t confirm agreement
require majorities/quorums for reads/writes
surface errors like timeouts, “try again,” or read-only modes

This is common for money movement, inventory reservation, and permission changes—places where being wrong is worse than being briefly unavailable.

What does an availability-first (AP) choice look like during a split?

If you favor availability, you typically:

keep accepting reads/writes on each side of the split
allow replicas to diverge temporarily
reconcile later (merge rules, conflict resolution, manual review)

Users see fewer hard errors, but may see stale data, duplicated effects without idempotency, or conflicting updates that need cleanup.

Can I mix consistency and availability choices by operation?

You can choose differently per endpoint/data type. Common mixed strategies include:

Reads stay available, writes get stricter (browse works; risky updates may fail)
Critical actions fail closed (payments, inventory, auth) while low-stakes features fail open (feeds, analytics)
Degraded mode: keep safe operations enabled, temporarily disable risky ones

This avoids a single global “we are AP/CP” label that rarely matches real product needs.

What consistency guarantees can I choose besides “strong” vs “eventual”?

Useful options include:

Strong consistency (linearizable): best correctness; may block/fail during partitions

How do I test and monitor my system’s CAP behavior in practice?

Validate by creating conditions where disagreement becomes visible:

Simulate partitions/latency in staging (and carefully in production): blackhole traffic, block ports, add delay/loss
Verify behavior: do writes get rejected or accepted? do reads go stale? how does recovery/reconciliation work?
Monitor signals beyond uptime:
- operation-specific error rates (read vs write)
- replication lag/divergence/conflict rates
- timeouts and retries