Learn how to design and build a web app that tracks SLA compliance: define metrics, collect events, calculate results, alert on breaches, and report accurately.

SLA compliance means meeting the measurable promises in a Service Level Agreement (SLA)—a contract between a provider and a customer. Your app’s job is to answer a simple question with evidence: Did we meet what we promised, for this customer, during this time period?
It helps to separate three related terms:
Most SLA tracking web apps start with a small set of metrics that map to real operational data:
Different users want the same truth, presented differently:
This product is about tracking, proof, and reporting: collecting signals, applying agreed rules, and generating audit-friendly results. It does not guarantee performance; it measures it—accurately, consistently, and in a way you can defend later.
Before you design tables or write code, get painfully clear on what “compliance” means for your business. Most SLA tracking problems aren’t technical—they’re requirements problems.
Start by collecting the sources of truth:
Write these down as explicit rules. If a rule can’t be stated clearly, it can’t be calculated reliably.
List the real-world “things” that can affect an SLA number:
Also identify who needs what: support wants real-time breach risk, managers want weekly rollups, customers want simple summaries (often for a status page).
Keep scope small. Choose the minimum set that proves the system works end-to-end, such as:
Create a one-page checklist that you can test later:
Success looks like this: two people compute the same sample month manually and your app matches it exactly.
A correct SLA tracker starts with a data model that can explain why a number is what it is. If you can’t trace a monthly availability figure back to the exact events and rules used, you’ll fight customer disputes and internal uncertainty.
At minimum, model:
A useful relationship is: customer → service → SLA policy (possibly via plan). Incidents and events then reference the service and customer.
Time bugs are the #1 cause of wrong SLA math. Store:
occurred_at as UTC (timestamp with timezone semantics)received_at (when your system saw it)source (monitor name, integration, manual)external_id (to dedupe retries)payload (raw JSON for future debugging)Also store customer.timezone (IANA string like America/New_York) for display and “business hours” logic, but don’t use it to rewrite event time.
If response-time SLAs pause outside business hours, model calendars explicitly:
working_hours per customer (or per region/service): day-of-week + start/end timesholiday_calendar linked to a region or customer, with date ranges and labelsKeep the rules data-driven so ops can update a holiday without a deploy.
Store raw events in an append-only table, and store calculated results separately (e.g., sla_period_result). Each result row should include: period boundaries, inputs version (policy version + engine version), and references to the event IDs used. This makes recomputation safe and gives you an audit trail when customers ask, “Which outage minutes did you count?”
Your SLA numbers are only as trustworthy as the events you ingest. The goal is simple: capture every change that matters (an outage started, an incident acknowledged, service restored) with consistent timestamps and enough context to calculate compliance later.
Most teams end up pulling from a mix of systems:
Webhooks are usually best for real-time accuracy and lower load: the source system pushes events to your endpoint.
Polling is a good fallback when webhooks aren’t available: your app periodically fetches changes since the last cursor. You’ll need rate-limit handling and careful “since” logic.
CSV import helps with backfills and migrations. Treat it as a first-class ingestion path so you can reprocess historical periods without hacks.
Normalize everything into a single internal “event” shape, even if the upstream payloads differ:
event_id (required): unique and stable across retries. Prefer the source’s event GUID; otherwise generate a deterministic hash.source (required): e.g., datadog, servicenow, manual.event_type (required): e.g., incident_opened, incident_acknowledged, service_down, service_up.occurred_at (required): the time the event happened (not when you received it), with timezone.received_at (system): when your app ingested it.service_id (required): the SLA-relevant service the event affects.incident_id (optional but recommended): links multiple events to one incident.attributes (optional): priority, region, customer segment, etc.Store event_id with a unique constraint to make ingestion idempotent: retries won’t create duplicates.
Reject or quarantine events that:
occurred_at far in the future.service_id (or require an explicit “unmapped” workflow).event_id.This discipline upfront saves you from arguing about SLA reports later—because you’ll be able to point to clean, traceable inputs.
Your calculation engine is where “raw events” become SLA outcomes you can defend. The key is to treat it like accounting: deterministic rules, clear inputs, and a replayable trail.
Convert everything into a single ordered stream per incident (or per service-impact):
From this timeline, compute durations by summing intervals, not by subtracting two timestamps blindly.
Define TTFR as the elapsed “chargeable” time between incident_start and first_agent_response (or acknowledged, depending on your SLA wording). Define TTR as the elapsed “chargeable” time between incident_start and resolved.
“Chargeable” means you remove intervals that shouldn’t count:
Implementation detail: store a calendar function (business hours, holidays) and a rule function that takes a timeline and returns billable intervals.
Decide upfront whether you calculate:
For partial outages, weight by impact only if your SLA contract requires it; otherwise treat “degraded” as a separate breach category.
Every calculation should be reproducible. Persist:
When rules change, you can re-run calculations by version without rewriting history—crucial for audits and customer disputes.
Reporting is where SLA tracking either earns trust—or gets questioned. Your app should make it clear what time range is being measured, which minutes count, and how the final numbers are derived.
Support the common reporting periods your customers actually use:
Store periods as explicit start/end timestamps (not “month = 3”) so you can replay calculations later and explain results.
A frequent source of confusion is whether the denominator is the whole period or only “eligible” time.
Define two values per period:
Then calculate:
availability_percent = 100 * (eligible_minutes - downtime_minutes) / eligible_minutes
If eligible minutes can be zero (for example, a service that is only monitored during business hours and the period contains none), define the rule up front: either “N/A” or treat as 100%—but be consistent and document it.
Most SLAs need both a percentage and a binary outcome.
Also keep the “distance to breach” (remaining downtime budget) so dashboards can warn before the threshold is crossed.
Finally, keep the raw inputs (included/excluded events and adjustments) so every report can answer “why is this number what it is?” without hand-waving.
Your calculation engine can be perfect and still fail users if the UI doesn’t answer the basic question instantly: “Are we meeting the SLA right now, and why?” Design the app so each screen starts with a clear status, then lets people drill into the numbers and the raw events that produced them.
Overview dashboard (for operators and managers). Lead with a small set of tiles: current period compliance, availability, response-time compliance, and “time remaining before breach” where applicable. Keep labels explicit (e.g., “Availability (this month)” rather than “Uptime”). If you support multiple SLAs per customer, show the worst status first and let users expand.
Customer detail (for account teams and customer-facing reporting). A customer page should summarize all services and SLA tiers for that customer, with a simple pass/warn/fail state and a short explanation (“2 incidents counted; 18m downtime counted”). Add links to /status (if you provide a customer-facing status page) and to a report export.
Service detail (for deep investigation). This is where you show the exact SLA rules, the calculation window, and a breakdown of how the compliance number was formed. Include a chart of availability over time and a list of incidents that counted toward the SLA.
Incident timeline (for audits). A single incident view should show a timeline of events (detected, acknowledged, mitigated, resolved) and which timestamps were used for “response” and “resolution” metrics.
Make filters consistent across screens: date range, customer, service, tier, and severity. Use the same units everywhere (minutes vs seconds; percentages with the same decimals). When users change the date range, update every metric on the page so there’s no mismatch.
Every summary metric should have a “Why?” path:
Use tooltips sparingly to define terms like “Excluded downtime” or “Business hours,” and show the exact rule text on the service page so people don’t guess.
Prefer plain language over abbreviations (“Response time” instead of “MTTA” unless your audience expects it). For status, combine color with text labels (“At risk: 92% of error budget used”) to avoid ambiguity. If your app supports audit logs, add a small “Last changed” box on SLA rules and exclusions linking to /audit so users can verify when definitions changed.
Alerting is where your SLA tracking web app stops being a passive report and starts helping teams avoid penalties. The best alerts are timely, specific, and actionable—meaning they tell someone what to do next, not just that something is “bad.”
Start with three trigger types:
Make triggers configurable per customer/service/SLA, since different contracts tolerate different thresholds.
Send alerts to where people actually respond:
Every alert should include deep links like /alerts, /customers/{id}, /services/{id}, and the incident or event detail page so responders can verify the numbers quickly.
Implement deduplication by grouping alerts with the same key (customer + service + SLA + period) and suppressing repeats for a cooldown window.
Add quiet hours (per team time zone) so non-critical “approaching breach” alerts wait until business hours, while “breach occurred” can override quiet hours if severity is high.
Finally, support escalation rules (e.g., notify on-call after 10 minutes, escalate to a manager after 30) to prevent alerts from stalling in one inbox.
SLA data is sensitive because it can expose internal performance and customer-specific entitlements. Treat access control as part of the SLA “math”: the same incident can produce different compliance results depending on which customer’s SLA is applied.
Keep roles simple, then grow into finer-grained permissions.
A practical default is RBAC + tenant scoping:
Be explicit about customer-specific data:
Start with email/password and require MFA for internal roles. Plan for SSO later (SAML/OIDC) by separating identity (who they are) from authorization (what they can access). For integrations, issue API keys tied to a service account with narrow scopes and rotation support.
Add immutable audit entries for:
Store who, what changed (before/after), when, where (IP/user agent), and a correlation ID. Make audit logs searchable and exportable (e.g., /settings/audit-log).
An SLA tracking app is rarely an island. You’ll want an API that lets monitoring tools, ticketing systems, and internal workflows create incidents, push events, and pull reports without manual work.
Use a versioned base path (for example, /api/v1/...) so you can evolve payloads without breaking existing integrations.
Essential endpoints to cover most use cases:
POST /api/v1/events to ingest state changes (up/down, latency samples, maintenance windows). GET /api/v1/events for audits and debugging.POST /api/v1/incidents, PATCH /api/v1/incidents/{id} (acknowledge, resolve, assign), GET /api/v1/incidents.GET /api/v1/slas, POST /api/v1/slas, PUT /api/v1/slas/{id} to manage contracts and thresholds.GET /api/v1/reports/sla?service_id=...&from=...&to=... for compliance summaries.POST /api/v1/alerts/subscriptions to manage webhooks/email targets; GET /api/v1/alerts for alert history.Pick one convention and use it everywhere. For example: limit, cursor pagination, plus standard filters like service_id, sla_id, status, from, and to. Keep sorting predictable (e.g., sort=-created_at).
Return structured errors with stable fields:
{ "error": { "code": "VALIDATION_ERROR", "message": "service_id is required", "fields": { "service_id": "missing" } } }
Use clear HTTP statuses (400 validation, 401/403 auth, 404 not found, 409 conflict, 429 rate limit). For event ingestion, consider idempotency (Idempotency-Key) so retries don’t duplicate incidents.
Apply reasonable rate limits per token (and stricter limits for ingestion endpoints), sanitize inputs, and validate timestamps/time zones. Prefer scoped API tokens (read-only reporting vs. write access to incidents), and always log who called what endpoint for traceability (details in your audit log section at /blog/audit-logs).
SLA numbers are only useful if people trust them. Testing for an SLA tracking web app should focus less on “does the page load” and more on “does time math behave exactly the way the contract says.” Treat your calculation rules as a product feature with its own test suite.
Start by unit testing your SLA calculation engine with deterministic inputs: a timeline of events (incident opened, acknowledged, mitigated, resolved) and a clearly defined SLA rule set.
Use fixed timestamps and “freeze time” so your tests never depend on the clock. Cover edge cases that often break SLA compliance reporting:
Add a small set of end-to-end tests that run the full flow: ingest events → calculate compliance → generate report → render UI. These catch mismatches between “what the engine computed” and “what the dashboard shows.” Keep the scenarios few but high value, and assert on final numbers (availability %, breach yes/no, time-to-ack).
Create test fixtures for business hours, holidays, and time zones. You want repeatable cases like “incident occurs Friday 17:55 local time” and “holiday shifts response time counting.”
Testing doesn’t stop at deploy. Add monitoring for job failures, queue/backlog size, recalculation duration, and error rates. If ingestion lags or a nightly job fails, your SLA report can be wrong even if the code is correct.
Shipping an SLA tracking app is less about fancy infrastructure and more about predictable operations: your calculations must run on time, your data must be safe, and reports must be reproducible.
Start with managed services so you can focus on correctness.
Keep environments minimal: dev → staging → prod, each with its own database and secrets.
SLA tracking isn’t purely request/response; it depends on scheduled work.
Run jobs via a worker process + queue, or a managed scheduler invoking internal endpoints. Make jobs idempotent (safe to retry) and log every run for auditability.
Define retention by data type: keep derived compliance results longer than raw event streams. For exports, offer CSV first (fast, transparent), then PDF templates later. Be clear: exports are “best-effort formatting,” while the database remains the source of truth.
If you want to validate your data model, ingestion flow, and reporting UI quickly, a vibe-coding platform like Koder.ai can help you get to a working end-to-end prototype without committing to a full engineering cycle up front. Because Koder.ai generates full applications via chat (web UI plus backend), it’s a practical way to spin up:
Once the requirements and calculations are proven (the hard part), you can iterate, export the source code, and move into a more traditional build-and-operate workflow—while keeping features like snapshots and rollback available during rapid iteration.
An SLA tracker answers one question with evidence: did you meet the contractual commitments for a specific customer and time period?
In practice, it means ingesting raw signals (monitoring, tickets, manual updates), applying the customer’s rules (business hours, exclusions), and producing an audit-friendly pass/fail plus supporting details.
Use:
Model them separately so you can improve reliability (SLO) without accidentally changing contractual reporting (SLA).
A strong MVP usually tracks 1–3 metrics end-to-end:
These map cleanly to real data sources and force you to implement the tricky parts (periods, calendars, exclusions) early.
Requirements failures usually come from unstated rules. Collect and write down:
If a rule can’t be expressed clearly, don’t try to “infer” it in code—flag it and get it clarified.
Start with boring, explicit entities:
Aim for traceability: every reported number should link back to and .
Store time correctly and consistently:
occurred_at in UTC with timezone semanticsreceived_at (when you ingested it)Then make periods explicit (start/end timestamps) so you can reproduce reports later—even across DST changes.
Normalize everything into a single internal event shape with a stable unique ID:
event_id (unique, stable across retries)source, event_type, , Compute durations by summing intervals on a timeline, not by subtracting two timestamps.
Define “chargeable time” explicitly by removing intervals that don’t count, such as:
Persist the derived intervals and the reason codes so you can explain exactly what was counted.
Track two denominators explicitly:
Then calculate:
availability_percent = 100 * (eligible - downtime) / eligibleAlso decide what happens if eligible minutes is zero (e.g., show ). Document this rule and apply it consistently.
Make the UI answer “are we meeting the SLA, and why?” in one glance:
For alerts, prioritize actionable triggers: approaching breach, breach occurred, and repeated violations—each linking to relevant pages like /customers/{id} or .
occurred_atservice_idincident_id and attributesEnforce idempotency with a unique constraint on event_id. For missing mappings or out-of-order arrivals, quarantine/flag them—don’t silently “fix” the data.
/services/{id}