5 On-Call Best Practices Every SRE Team Should Follow

On-call gets a bad reputation, and honestly, a lot of teams earn it. Unclear escalation paths, 3 AM pages for non-issues, rotations that land on the same two people every weekend. It doesn’t have to be this way.
The best SRE teams treat on-call as a first-class operational concern — not an afterthought bolted onto whoever deployed last. Here are five practices that separate functional on-call from the kind that makes engineers update their LinkedIn.
1. Define What Actually Deserves a Page
This sounds obvious. It isn’t.
Most teams start with good intentions: “We’ll only page for critical issues.” But over time, alerts accumulate. Someone adds a CPU threshold. Another person hooks up a log pattern match. Before you know it, on-call engineers are getting 40 notifications per shift, and maybe three of them require action.
The rule is simple: If the alert doesn’t require immediate human intervention to prevent or mitigate user impact, it shouldn’t page anyone.
Everything else belongs in a dashboard, a Slack channel, or a daily review queue. Be aggressive about this. When in doubt, don’t page.
How to implement it:
- Audit every paging alert quarterly. For each one, ask: “In the last 90 days, did this ever require someone to take action within 15 minutes?”
- If the answer is no, downgrade it.
- Maintain a written definition of P1 (page-worthy) vs P2 (next business day) vs P3 (informational).
2. Build Fair, Sustainable Rotations
Bad rotations create resentment faster than almost anything else in engineering. When the same people keep getting stuck with holiday shifts, or when rotations don’t account for team size changes, trust erodes.
What fair looks like:
- Equal distribution. Everyone in the rotation gets roughly the same number of weekday and weekend shifts per quarter.
- Reasonable shift lengths. 1-week rotations are common, but some teams do better with 3-4 day rotations for high-volume services.
- Follow-the-sun where possible. If you have team members across time zones, use it. Nobody should be on-call for 24 hours straight if there’s a teammate who could cover a different 12.
- Override support. Life happens. Make it easy to swap shifts with zero bureaucracy.
What to avoid:
- “Voluntold” rotations where the newest engineer always gets the worst slots.
- Rotations that don’t include senior engineers. If leadership doesn’t carry on-call, they won’t prioritize fixing alert quality.
- On-call without compensation or time-off-in-lieu. This tells your team you don’t value their time.
3. Every Alert Gets a Runbook
At 3 AM, cognitive function is about 60% of normal. Your on-call engineer shouldn’t need to reverse-engineer a production system from first principles. They need a clear, specific set of steps.
A good runbook answers:
- What is this alert telling me?
- What’s the likely user impact?
- What should I check first? (Dashboard links, log queries, specific hosts)
- What are the common fixes?
- When should I escalate, and to whom?
A runbook doesn’t need to be a novel. The best ones are 10-20 bullet points with links. They’re updated every time someone handles the incident and discovers the runbook was missing something.
The test: Could a new team member who’s never seen this alert before follow the runbook and either resolve or correctly escalate the issue? If not, the runbook isn’t done.
Pro tip: Link the runbook directly in the alert payload. Don’t make people search a wiki at 3 AM. The alert message itself should contain the URL.
4. Do Blameless Post-Incident Reviews
Every significant incident should get a post-incident review (sometimes called a postmortem, though “blameless review” better captures the intent). This isn’t about finding fault. It’s about finding the systemic issues that let the incident happen.
Why this matters for on-call:
- Reviews surface alert gaps (things that should have been detected earlier).
- They identify alert noise (things that fired but weren’t useful during the incident).
- They reveal routing problems (right alert, wrong person).
- They build institutional knowledge that reduces MTTR for future incidents.
The format doesn’t need to be heavy:
- Timeline: What happened, when?
- Detection: How did we find out? (Alert? Customer report? Dashboard?)
- Response: What did the on-call engineer do?
- Resolution: What fixed it?
- Action items: What changes do we make to prevent recurrence or improve response?
The crucial part: Follow through on action items. A review that generates five tasks and zero completions is worse than no review at all — it teaches the team that reviews are performative.
5. Invest in Alert Routing and Escalation
The right alert needs to reach the right person through the right channel at the right time. This sounds straightforward, but most teams get at least one of these wrong.
Common routing failures:
- Everyone gets everything. A shared PagerDuty or Slack channel where all alerts land. This creates diffusion of responsibility — “someone else will get it.”
- Single point of failure. One person is the expert for a service, so all alerts go to them. When they’re unavailable, nobody knows what to do.
- Wrong channel for urgency. P1 alerts going to Slack (where they might not be seen for 20 minutes) instead of SMS or phone call.
Good routing looks like:
- Service ownership mapping. Every alerting rule is tied to a specific team and rotation.
- Escalation policies. If the primary on-call doesn’t acknowledge within 5 minutes, it escalates to the secondary. If the secondary doesn’t acknowledge within 5 more, it goes to the team lead.
- Multi-channel delivery. Critical alerts go to phone/SMS. Important-but-not-critical goes to Slack or push notification. Informational goes to a dashboard.
- Schedule awareness. The system knows who’s on-call right now and routes accordingly — no manual updating of distribution lists.
Putting It All Together
Good on-call isn’t about any single practice. It’s about the system: clear definitions of what pages, fair rotation of who gets paged, runbooks for what to do, reviews to learn from what happened, and routing to make sure the right person gets the right alert.
Each of these practices reinforces the others. Fair rotations keep people engaged enough to write good runbooks. Good runbooks reduce MTTR, which makes reviews more productive. Better routing reduces noise, which makes on-call bearable.
Start with whichever one your team is weakest on. You don’t need to fix everything at once. But don’t ignore on-call health and expect your team to stay.
Quick Self-Assessment
Rate your team on each practice (1-5):
| Practice | Score |
|---|---|
| Alert definitions are clear and enforced | _/5 |
| Rotations are fair and sustainable | _/5 |
| Every paging alert has a runbook | _/5 |
| Post-incident reviews happen and produce action | _/5 |
| Alert routing matches urgency and ownership | _/5 |
If your total is under 15, you have significant on-call debt. Start with the lowest-scoring area.
PagerBolt helps teams route alerts to the right people through the right channels, with built-in escalation and multi-channel delivery. Join the waitlist to learn more.