Postmortem: 2020-02-27 Cosmos Hub Validator Outage
By Tony Arcieri
Today we experienced our worst Cosmos Hub Validator outage yet: we were down for approximately 4 hours. In this post, we’d like to share the details of that outage with you, along with our plans for avoiding future outages and some clues to what went wrong which may be helpful to other validator operators.
Let’s begin with a timeline.
All times are in PST (i.e. this was a 5AM outage):
- 04:54: outage detected by Hubble
- 04:57: Zaki is notified of outage by Adriana Kalpa, confirms outage via Hubble, and alerts the rest of the team. Unfortunately the rest of the team remained asleep for the next 3 hours
- 07:42: Tony begins responding to outage
- 07:53: Outage detected by our internal monitoring and first PagerDuty alert is sent
- 08:27: Sentries isolated as the cause of the problem and fixed. Validator begins syncing back up
- 08:52: Signing resumed
Continue reading →