One action item from our postmortem last week was to improve how we get alerted for serious incidents. Today, we did a meeting with people from all teams to see what needs to be done. We spent some time talking about what kinds of alerts we want to start with, because our incident last week was caused in one application that still looked like it was up and running but it wasn’t really. If we had noticed this problem and fixed it, the other apps wouldn’t have crashed 3 hours later. So if we had been alerted about the cause properly, the symptom “a different app crashed” wouldn’t even have materialised.
But for right now, it’s more important to us to have a symptom-based monitoring just for “app is down” in a central place where everyone who needs to be alerted about this can set up some kind of notification scheme that seems best for them.
But we have also established a task force for all things monitoring and alerting where we want to update each other about things we do inside our teams so that everyone can benefit from our learnings.