Alerting principles
Every alert = actionable
An alert should say "what is broken right now" and "what to do". If an alert fires and the team says "meh, whatever", you do not need it.
Pages only for critical
severity: critical → PagerDuty / SMS. Only if an immediate response is required (24/7).
warning → the team's Slack channel. Looked at during working hours.
info → an audit channel. For context, not for action.
For-duration protects against a blip
for_duration_seconds: 60 is the standard. Less and you get flaps. More and you will not notice a real incident right away.
Anomaly for business metrics
A threshold does not work when "normal" depends on the time of day. Anomaly with offset=7d catches "not like the usual Tuesday".