Postmortem: 2019-03-29 DNS-related Cosmos Hub Validator Incident
It began with a series of PagerDuty alerts on our phones. We occasionally have false positives, but this was different: several alarms in a row. We looked up at the display in our NOC (above photo, although from a different day) to see that this was not a false alarm: Hubble was also showing we were down. At least our alerts were working!
Though we had been making several changes throughout the day to move onto a new internal sentry architecture, almost all of these changes we additive, so it was surprising to see we had an outage. After hopping onto our redundant validator hosts and checking on the process status, we saw something rather scary: both the active validator and backup validator processes had crashed!
A quick examination of the stack trace found the cause: DNS. We had just deleted a DNS record we thought was no-longer in use, which seemed like a likely culprit. So first...
Continue reading →