Common Alert Workflows

This guide covers common alert configuration patterns for different monitoring scenarios. Use these workflows as starting points and adjust them based on your team’s specific requirements.

Production Website Monitoring

For business-critical production websites where downtime directly impacts revenue or user experience:

  • Check interval: 1 minute
  • Alert delay: 2 consecutive failures
  • Location confirmation: 2 or more locations
  • Notification channels: Slack channel for immediate team visibility, PagerDuty for on-call escalation

This configuration balances quick detection with false positive reduction. The 2-minute effective delay (two 1-minute checks) catches real outages promptly while filtering out momentary network blips.

API Endpoint Monitoring

For API services where response time matters as much as availability:

  • Check interval: 1 minute
  • Alert delay: 3 consecutive failures
  • Location confirmation: 1 to 2 location depending on business priority
  • Response time threshold: Alert if response exceeds 2 seconds
  • Notification channels: Slack for the API team, email to the engineering lead

API endpoints can experience brief latency spikes during deployments or traffic surges. The 3-failure threshold and majority location requirement help distinguish between temporary slowdowns and actual service degradation.

Internal Tools and Dashboards

For internal applications used by your team but not customer-facing:

  • Check interval: 5 minutes
  • Alert delay: 2 consecutive failures
  • Location confirmation: Any location
  • Notification channels: Email to the responsible team, Slack channel

Internal tools typically have more tolerance for brief outages. The longer check interval reduces monitoring overhead while still catching extended downtime within 10 minutes.

Staging and Development Environments

For non-production environments where you want visibility without urgent alerts:

  • Check interval: 15 minutes
  • Alert delay: 3 consecutive failures
  • Location confirmation: Any location
  • Notification channels: Email or Slack channel (no paging)

Staging environments often go down intentionally during deployments. The longer intervals and delays prevent alert noise while still notifying the team about prolonged outages that might affect testing.

E-commerce Checkout Flow

For critical transaction paths where every minute of downtime means lost sales:

  • Check interval: 1 minute
  • Alert delay: 1 failure (immediate)
  • Location confirmation: Any location
  • Notification channels: PagerDuty with high urgency to on-call engineer, Slack to commerce team

Checkout flows warrant aggressive alerting since any failure directly impacts revenue. The trade-off of occasional false positives is acceptable given the business impact of missing a real outage.

Third-Party Service Dependencies

For monitoring external services your application depends on:

  • Check interval: 5 minutes
  • Alert delay: 2 consecutive failures
  • Location confirmation: 3+
  • Notification channels: Slack channel for awareness, no paging (since you cannot fix third-party issues)

Third-party services often have their own status pages and support channels. Your alerts serve as early warning for your team to prepare workarounds or communicate with customers, not to trigger incident response.

Customizing These Workflows

These workflows represent common patterns, but every organization has unique requirements. Consider these factors when customizing:

  • Team size and availability: Smaller teams may need longer delays to avoid alert fatigue
  • Business hours vs. 24/7: Adjust notification channels based on when your team can respond
  • Service dependencies: Services with many dependencies may need more conservative alerting
  • Historical reliability: Services with known flaky behavior benefit from longer confirmation requirements

Was this article helpful?