Uptime Monitoring for Site Reliability Engineers

External SLIs measured from outside your network, SLA policies that track uptime, error budget, and burn rate against the target you set, and calendar-based on-call with multi-step escalation — so the signal behind your SLOs lives in the same place that’s paging your team and writing your post-mortem timeline.

Start Free

Read API Docs

Free forever tier. No credit card. No sales call.

Why SREs Pick StatusDrift

Your internal telemetry tells you what your cluster thinks is happening. StatusDrift tells you what the outside world actually sees — the signal your SLOs are supposed to be measured against.

Measure availability from the outside

Your pods can be healthy, your liveness probes green, your internal dashboards 100% — and users still can’t reach the service. External checks from probes outside your network catch the network, DNS, CDN, certificate, and edge problems that internal observability cannot see. That’s the “real user availability” number your SLO was always supposed to report.

Monitor, incident, and escalation in one place

A check fails, an incident opens, the on-call gets paged on the right channel, the escalation kicks in if nobody acks, and the public status page updates — all from the same product, off the same data. Fewer tools to wire together, fewer seams where the truth can drift.

SLA policies with budget built in

Set an uptime target, pick a window (calendar week, calendar month, rolling 30 days, or calendar year), attach it to the monitors it covers, and StatusDrift tracks compliance, error-budget usage, and burn rate for you — per monitor and rolled up. For the metrics that aren’t natively tracked — MTTR, MTTA, custom aggregations — every check result and incident timestamp is retrievable via REST API, so you can feed Grafana, BigQuery, or an internal dashboard alongside.

SLA Policies With Error Budget and Burn Rate Built In

An SLO is only as honest as the SLI it’s built on. Define an SLA policy in StatusDrift, attach it to the monitors, tags, or groups it covers, and the platform tracks compliance — uptime percentage, error-budget usage, and burn rate — against the window you chose. No spreadsheet, no separate dashboard to maintain.

Configurable uptime target — pick the availability percentage that matches the commitment you’re making
Four window types — calendar week, calendar month, rolling 30 days, or calendar year
Optional response time SLA — layer a latency threshold on top of availability for services where “slow” is the same as “down”
Per-monitor compliance — every covered monitor shows compliant / at-risk / breached with its own uptime, budget used, and burn rate
Organization-wide or group-scoped policies — one policy for the whole org, or distinct policies per team, product, or environment
Prefer your own dashboard? Pull check results and incident timestamps via REST API and feed Grafana, Datadog, BigQuery, or an internal tool alongside the built-in view

Creating and reviewing SLA policies →

SLA policy detail page showing metadata and monitor compliance results. — Open a policy to review compliance details, budget usage, burn rate, and status for each monitored service.

Incident Management Built for Post-Mortems

A good post-mortem starts with an honest timeline. StatusDrift keeps one automatically — every state change, every acknowledgement, every investigation note stamped with the time it actually happened.

Automatic incident creation from failing checks, with the monitor, failure reason, and affected regions pre-attached
Full timeline — detection, first acknowledgement, each status change, investigation notes, and resolution — all timestamped
MTTR & MTTA data — pull ack/resolution timestamps via the REST API and compute mean/percentile times however your team defines them
Public status page components update automatically based on the monitors behind them — your users see the same truth your on-call does
Blameless by design — the timeline is about what happened, not who was slow. Drop it into your post-mortem template without redacting Slack threads

Explore incident management →

Escalation Policies & On-Call, Built In

Business-plan accounts get calendar-based on-call scheduling and multi-step escalation paths without a second product. Define who carries the pager, what happens if the first responder doesn’t acknowledge, and when to pull in the backup — then attach the schedule to any monitor.

Calendar-based rotations — weekly, daily, or custom shift patterns, with timezone-aware follow-the-sun support
Multi-step escalation — if step 1 doesn’t acknowledge within N minutes, step 2 gets paged, then step 3, until someone responds
Channel-level escalation — start on Slack, push to the on-call phone, then email the backup — all within a single policy
Overrides — cover holidays, PTO, or last-minute swaps without rewriting the schedule
Already on PagerDuty or Opsgenie? Route natively and your existing rotations handle paging. Use StatusDrift’s built-in on-call, theirs, or both per monitor

Explore escalation policies →

On-call schedule with multi-step escalation policy and calendar-based rotations

Alert Tuning That Respects the On-Call

Every false page costs trust in the entire alerting system. Per-monitor thresholds let you tune the signal-to-noise ratio service by service, instead of one blunt global setting.

30-second checks on paid plans

Paid monitors run every 30 seconds from multiple regions. Free monitors run every 5 minutes. Pick the frequency that matches how much downtime actually burns budget for that service — not a single account-wide cadence.

Per-monitor thresholds

Set a notification delay, a consecutive checks down threshold, and a locations down threshold per monitor. Page instantly on the critical-tier API, wait out a blip on a lower-tier service. Smart alerting docs →

Multi-region verification

Require failures from multiple regions before a monitor goes red. A single checking node having a bad minute doesn’t page anyone; a confirmed regional or global incident does. Fewer 3am false positives, more trust in the alerts that fire.

Monitoring as Code

Everything you can do in the dashboard, you can do in Terraform or the REST API. Keep monitors in version control next to the services they watch, review them in pull requests, and apply them in the same CI pipeline that ships your app.

Terraform provider

Declare monitors, alert contacts, on-call schedules, escalation policies, maintenance windows, and status pages as resources. Monitoring ships in the same PR as the service it watches, and decommission removes the monitor too.

View Terraform provider

Full REST API

Every dashboard action has a REST endpoint — monitors, alert contacts, maintenance windows, incidents, schedules, status pages, and teams. Pull check results and incident timelines to feed your own SLO dashboard or post-mortem pipeline.

Read API docs

CI/CD & webhooks

Open and resolve incidents from deploy pipelines. Schedule maintenance windows before a canary. Send every monitor and incident event to your own webhook for custom routing — or forward to a Prometheus pushgateway, BigQuery sink, or SIEM.

Alerts Where Your SRE Team Already Lives

Attach any combination of channels to each monitor, or hand it off to a calendar-based escalation path. Page Slack for lower-tier services, push to the on-call phone for the critical tier, forward incidents to PagerDuty for carrier-grade paging — all configured per monitor.

PagerDuty

Opsgenie

Slack

Microsoft Teams

Mobile Push

Webhook

Discord

Also supported: Telegram, Google Chat, Mattermost, ServiceNow, Splunk On-Call, Pushover, Pushbullet, Zapier, n8n, and generic webhooks for anything we haven’t listed. Need SMS or voice? Route through PagerDuty or Opsgenie and let them page.

View All Integrations

What You Actually Get on the Free Plan

No credit card, no trial clock. Enough to watch a homelab, a side project, or the critical-tier services you want external-check coverage on while you prove the tool out.

Up to 5 monitors
5-minute check interval
HTTP, keyword, ping, port, and cron/heartbeat checks
SSL certificate expiry monitoring
Multi-region checks & configurable alert thresholds

Email, Slack & webhook alerts
One public status page
90-day data retention
Full REST API access
No credit card, no time limit

Questions SRE Teams Usually Ask

Does StatusDrift compute SLOs and error budgets for me?

Yes. Define an SLA policy with an uptime target (and an optional response-time threshold), attach it to the monitors, tags, or groups it covers, and StatusDrift tracks uptime percentage, error-budget usage, burn rate, and compliance status (compliant / at-risk / breached) against the window you chose — calendar week, calendar month, rolling 30 days, or calendar year. See SLA policy docs →

How do I track MTTR and MTTA?

Every incident exposes detection, acknowledgement, and resolution timestamps. Pull them from the REST API and compute mean/percentile values however your team defines “acknowledged” and “resolved” — so the numbers match the methodology in your post-mortem template, not a vendor’s.

Does on-call scheduling come built in?

Yes, on the Business plan — calendar-based rotations and multi-step escalation policies live in StatusDrift. If you already live in PagerDuty or Opsgenie, route there natively and use your existing rotations. Both models work per monitor, and you can mix them.

Can I define monitors and schedules in Terraform?

Yes. The StatusDrift Terraform provider covers monitors, alert contacts, on-call schedules, escalation policies, maintenance windows, status pages, and teams. Keep them in the same repo as the service they watch, diff in PRs, apply in CI.

Can I silence alerts during a canary or deploy?

Yes. Schedule one-off or recurring maintenance windows from the dashboard, the REST API, or your deploy pipeline. Alerts stay silent for the window, and the public status page shows scheduled maintenance — not an incident that burns budget.

Does the status page stay up when production doesn’t?

Yes. Status pages are hosted on infrastructure separate from the services they report on. When your production goes dark, the page your users check stays online — which is the whole point of having one.

External Signal for Serious SRE

Defend your SLOs with honest, outside-the-network availability data — and keep the incident timeline in the same place that's paging the on-call.

Start Free Read API Docs

No credit card required

Free forever tier

Cancel anytime