Postmortem: 2020-02-27 Cosmos Hub Validator Outage
By Tony Arcieri
Today we experienced our worst Cosmos Hub Validator outage yet: we were down for approximately 4 hours. In this post, we’d like to share the details of that outage with you, along with our plans for avoiding future outages and some clues to what went wrong which may be helpful to other validator operators.
Let’s begin with a timeline.
Incident timeline #
All times are in PST (i.e. this was a 5AM outage):
- 04:54: outage detected by Hubble
- 04:57: Zaki is notified of outage by Adriana Kalpa, confirms outage via Hubble, and alerts the rest of the team. Unfortunately the rest of the team remained asleep for the next 3 hours
- 07:42: Tony begins responding to outage
- 07:53: Outage detected by our internal monitoring and first PagerDuty alert is sent
- 08:27: Sentries isolated as the cause of the problem and fixed. Validator begins syncing back up
- 08:52: Signing resumed
Looking at that, there are a few things that clearly went wrong.
- Long incident response: there was a pretty big gap between when the incident was originally noticed and when we determined the cause
- Not detected by our internal monitoring: we noticed this problem via Hubble (thanks Hubble!)
In the next section, we’ll cover these, and also our investigation.
Incident response #
This was a tricky problem to investigate, and I (as in Tony) personally wasn’t able to dig in until approximately 3 hours into it.
Why was that? Well, I had gone to bed fairly late the night before, diligently put my phone down next to my bed just in case something went wrong, but unfortunately the ringer/sound was still silenced, so I didn’t become aware of the issue until I woke up and checked my phone. Mea culpa.
I’d like to say “in the future I won’t do that again”, but this isn’t the first time this has happened. So perhaps I should look into having something next to my bed besides my personal phone to alert me. With that noted, back to the investigation.
I first checked the status of the validator, KMS, and sentry nodes. Everything appeared to be up and running, which was quite confusing. However, while the sentry node was synced up, the validator was “stuck” and not receiving new blocks.
I checked the validator’s /net_info
RPC endpoint, and it showed two connections, both open: one to our standby validator, and another to our external sentry. There was one connection that was down though: the connection to our “out-of-band (OOB) sentry”, which is peered to other validators using VPNs and private (RFC 1918) IP addresses.
At this point I bounced the validator and the KMS. However, after doing so it wasn’t able to establish a connection to either of our sentries:
Feb 27 16:00:58 cosmoshub[30406]: Error dialing peer
module=p2p err="dial tcp [OOB sentry IP]: i/o timeout"
Feb 27 16:01:00 cosmoshub[30406]: Stopping peer for error
module=p2p peer="Peer{MConn{[external sentry IP]:26656} out}" err=EOF
…so we were getting two different errors trying to reach our two sentry nodes.
After restarting the validator, the n_peers
in our validator’s /net_info
dropped from 2 to 1. This was detected by our internal monitoring, as we configured it to page when there are < 2 connections to the validator, and shortly thereafter we got our first PagerDuty alert. Unfortunately, that one peer was our standby validator, which had no other path to the outside world.
Our validators were unable to reach the OOB sentry due to a networking misconfiguration which we discovered today and are still investigating. It seems likely that has been the case for awhile and we have been running “on one leg”, because aside from the coarse grained n_peers
check while we monitor the health of our Cosmos nodes themselves, we don’t yet monitor the connection health between them. With the n_peers
threshold set to 2 instead of 3, we also didn’t notice this connection going down.
As for our external Internet-facing sentry, it had an active session with our primary validator until we restarted, it so n_peers
remained at 2 despite the connection being unhealthy. After we restarted the validator, we started getting an EOF
error. This indicates that it was able to open a TCP connection, but that the other side was closing it.
Looking at things from the sentry’s perspective, we saw this:
Feb 27 16:00:58 cosmoshub-sentry: Committed state
module=state height=950656 txs=0 appHash=F128089A0CD30DC4DB14FD6030876E8B9D00CBA337B30B3525E1CCBE09537667
Feb 27 16:01:00 cosmoshub-sentry: MConnection flush failed
module=p2p peer=[validator IP] err="write tcp [external sentry IP]->[validator IP]: write: connection reset by peer"
This indicates the sentry was connected and synced up to cosmoshub-3
, but receiving a TCP RST packet from the validator when they tried to initiate P2P traffic. That may sound like the validator or some piece of networking gear between them was misbehaving, but this is also to be expected if the socket were closed. Regardless of what was happening, there was some issue establishing a P2P connection between the validator and the sentry, and it’s unclear exactly which side was at fault.
We confirmed the sentry was still reachable from the validator via a TCP connection:
[validator ~]# nc -v [external sentry IP] 26656
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to [external sentry IP]:26656.
! �{8�y�/X^4k���S�����ң���" R
After several attempts at restarting the validator resulting in these same errors every time, we restarted the external sentry… and… suddenly everything was working and the validator started to sync up! 🎉
Our best guess is something was malfunctioning within the Tendermint P2P networking code, causing it to get into a bad state. Notably when we first started investigating, the TCP connection between the validator and the external sentry was still open, and multiple attempts at restarting the validator failed, while it connected right up after restarting the external sentry. Unfortunately we don’t have any additional loglines to go off of in order to debug this problem besides the ones we’ve included in this post.
Action items #
It’s clear from this incident we have a few things to work on. It’s been a long time since we had a major incident, and perhaps we’ve been a bit complacent because things have otherwise been smooth sailing.
We’ve built everything to be fully redundant, which is great for tolerating failures, but unless you can detect partial failures in redundant systems, fault tolerance can give you a false impression and sense of complacency that everything is great when there are ongoing incidents in progress.
We have a decent amount of monitoring in place, but unfortunately none of it was helpful in detecting this outage, and we were only able to notice it at first via Hubble. I’ll talk about that a bit later, though. First let me talk about gaiad itself.
It seems we’re running a somewhat outdated version of gaiad (v2.0.3), so we’re rolling out the latest version. We’re also excited about upcoming (but breaking) changes to the P2P layer in Tendermint v0.33, especially the ADR-050 changes which we hope will make private connections from validators to sentries significantly more robust, which were implemented in tendermint/tendermint#4176.
Back to monitoring: presently beyond standard host-based monitoring agents, we leverage the Tendermint Prometheus endpoints to ingest observability metrics about running nodes into Datadog, which has various alarms it can dispatch via PagerDuty. However, the problem is according to all of existing monitoring, everything was working great, when in fact we were completely down!
For starters, we’ll be bumping the n_peers
check on the validator up to 3 so we can properly detect when we’re running “on one leg”. We’ll also be turning up the verbosity for Tendermint’s P2P logging in hopes we’ll have more information to debug these sorts of problems in the future.
That said, P2P network observability is something we’ve been concerned with for awhile. It’s why we began developing a Tendermint monitoring app called Sagan, a sort of “sidecar” agent for a Tendermint node (implemented as a lightweight Rust process) which is able to extract information beyond what’s available in the Prometheus metrics via the RPC port.
Sagan is already capable of extracting this information from each node and sending it to a central collector process, which builds a global view of the current state and health of the P2P network as well as the health of the validator and whether our signatures are getting included. This information is presently available via Sagan’s JSON API, however we have not yet integrated Sagan via any sort of alerting system.
Unfortunately as other priorities came up we put Sagan on the backburner. C'est la vie. An incident is a poignant reminder that perhaps we should dust it off and get it to an MVP state where it is deployed to production and integrated with our existing monitoring, which would make it capable of detecting certain actionable problems with the state of our internal P2P network and validators. We intend to do that soon, and in the process also have a tool that might be useful for other validators.
Thanks for reading, and we hope to do better in the future!