Postmortem: 2019-08-07 Tendermint KMS-related Cosmos Hub Validator Incident

Last night and early this morning we encountered a series of small outages in iqlusion’s Cosmos Hub validator related to a recently released version of Tendermint KMS: v0.6.1. We are the primary contributors to Tendermint KMS and generally try to run the latest version of the code at all times, in order to smoke test each release prior to a wider announcement. Generally this has been going pretty well and we have not had an outage like this before.

Unfortunately, while the v0.6.1 release appeared to work during the day, to the point we made a release announcement about it on Twitter, as sometimes happens issues didn’t crop up until late at night and early next morning. We missed around 200 blocks when the KMS crashed originally last night, and another 300 when it crashed again this morning. We weren’t alone, there were two issues opened by people who encountered the same problems on testnets.

The incident was permanently resolved by shipping a bugfix release of Tendermint KMS: v0.6.2, followed up by a v0.6.3 release in which we tried to further eliminate some particularly nasty corner cases we encountered that lead to the crash resulting in prolonged outages rather than quick blips. In this post, we hope to cover what went wrong and what we’ve done to fix it.

Tendermint Consensus and Double Signing #

One of the primary functions of Tendermint KMS is preventing double signing, i.e. signing consensus messages for different blocks at the same stage of the consensus protocol. Preventing double signing is critical to the security of BFT consensus. Nodes in a BFT network sending conflicting information to other nodes is known as “equivocation”, and validators guilty of it are heavily penalized in the form of “slashing”, in which validators are permanently jailed and the amount staked with them “slashed” by 10%. There have been a few such incidents on Tendermint networks, and in one particular case the penalty was the equivalent of ~$1 million, so double signing is something we stringently endeavor to avoid.

Tendermint KMS v0.6.0 included some new features which really put its double signing defenses to the test: it supports connecting to multiple validators on the same chain, and concurrently accepting signing requests (often times inconsistent) ones from all of them. As KMS developers, this is rather scary as a single bug in the double signing code is all that stands between a validator getting slashed/jailed.

For this reason we recommend validators don’t run this sort of configuration in perpetuity, but use the functionality to failover between validators. Based on this incident, and others we’ll describe below, we in fact recommend you only run in this configuration on testnets for now because we think this operational mode needs a lot more testing to be safe.

Ideally we’d like to see support for using a coordination service/protocol to elect a single validator as active, running multiple validators in an active/passive(/passive/etc) configuration as a belt-and-suspenders way to avoid double signing.

That said, we’ve had a number of volunteer validator operators from the community stress test the concurrent validator support on testnets, running up to five concurrent validators all hammering the KMS with concurrent and inconsistent signing requests, and watching what happened.

After a number of false positives in which the KMS printed scary logs that looked like double signing, but were in fact safe, we finally encountered a legitimate double signing bug in a corner case of the consensus protocol.

The nature of this bug requires understanding some nuances of the Tendermint Byzantine Consensus Algorithm, namely a notion called “proof-of-lock-change” (PoLC):

A set of +2/3 of prevotes for a particular block or at (H,R) is called a proof-of-lock-change or PoLC for short.

To make progress on a block proposed by a validator, 2/3rds of other validators perform a “prevote” at a particular height, round, and round step in which they either vote for a particular block ID or vote <nil>. There are a number of things that can go wrong during the consensus process: the proposer may be down and never sends a block, or they may send a block another validator considers to be invalid. In these cases Tendermint consensus needs a way to vote to move on to the next proposer in order to continue making progress, and in those cases a validator will “prevote” <nil>, i.e. there was no block ID a particular validator has identified to make progress on.

In the KMS double signing event encountered on kava-testnet-2000, the prevotes looked like this:

04:46:58 [info] [kava-testnet-2000@tcp://47.101.10.160:26658] signed PreVote:<nil> at h/r/s 34499/0/6 (102 ms)
04:46:59 [info] [kava-testnet-2000@tcp://kava-test.ping.pub:26658] signed PreVote:2FC0C142C5 at h/r/s 34499/0/6 (123 ms)

In this particular case, a testnet validator prevoted twice: once for <nil>, and once for block ID 2FC0C142C5.... This was allowed under the original double signing logic the KMS used, which was implemented as the following Rust code:

if new_state.block_id.is_some()
    && self.consensus_state.block_id.is_some()  
    && self.consensus_state.block_id != new_state.block_id  
{
    [...]
}

In other words, the old logic was agnostic to <nil> votes, hence the double signing vulnerability seen in practice. In Tendermint KMS PR#334, it was changed to look like this:

if self.consensus_state.block_id != new_state.block_id {
    [...]
}

This logic was much stricter about double signing, and disallowed any conflicting votes regarding a <nil> block ID or the presence of one. This logic was relaxed slightly in PR#335, which attempts to allow cases where a validator might originally prevote <nil>, but when Tendermint consensus advances to the “precommit” phase, allow it to vote for a valid block ID it might have originally missed.

What Went Wrong? (And How We Addressed It) #

Unfortunately PR#334 contained a small bug which was not caught by any of the existing tests. Here is the full code block after the change:

if self.consensus_state.block_id != new_state.block_id {
    fail!(
        StateErrorKind::DoubleSign,
        "Attempting to sign a second proposal at height:{} round:{} step:{} old block id:{} new block {}",
        new_state.height,
        new_state.round,
        new_state.step,
        self.consensus_state.block_id.as_ref().unwrap(),
        new_state.block_id.unwrap()
    )
}

The problem lies in the self.consensus_state.block_id.as_ref().unwrap() and new_state.block_id.unwrap() lines: these assumed the block_id was not <nil>, and crashed if it was! Unfortunately the crash was all in building and formatting an error message, and the fix was quite small (PR#342) and used some helpers designed to handle formatting log messages which contain potentially <nil> block IDs.

All of that said, these crashes were manifestations of some deeper problems:

The solution to the first problem is simple enough: add more tests! That’s what we did in PR#344, which expanded the number of test cases around the double signing logic, and which we ensured would’ve caught the original bug.

The solution to PoisonError was a bit more nuanced, and can be found in PR#345. The correct thing for a Rust program to do when it encounters PoisonError is crash, so a supervisor like systemd can restart it. However, Tendermint KMS attempts to rescue panics, as many of them are retryable, such as spurious network I/O errors. PoisonError is specifically the one class of error in Rust which should not be retried, but can only be fixed by crashing and cleanly restarting the entire process.

The failure mode we witnessed on our own validator was one of the chain consensus state being corrupted by a PoisonError, but the error handling logic continually “catching” it and restarting in an endless loop. Had the KMS simply crashed hard, systemd would’ve restarted it at each crash and we wouldn’t have experienced downtime.

PR#345 implements such a behavior, and now considers any PoisonError fatal and completely crashes the KMS process.

Conclusion #

The fixes from PR#342 (bugfix for consensus error messages), PR#344 (tests for double signing errors), and PR#345 (crash hard on PoisonError) are all available in the latest Tendermint KMS bugfix release: v0.6.3. We hope the new handling of PoisonError in these releases will prevent future outages resulting from these sorts of failure modes.

We are running Tendermint KMS v0.6.3 in production on both Cosmos Hub and Terra, thus far without issues.

We also plan on working with the Tendermint team to nail down the KMS’s handling of the corner cases in the consensus protocol, and ensuring that validators operating KMS do not encounter costly double signing vulnerabilities. There have been some hard lessons learned from this incident, but we hope the outcomes are ultimately for the best.

 
16
Kudos
 
16
Kudos

Now read this

Postmortem: 2019-03-29 DNS-related Cosmos Hub Validator Incident

It began with a series of PagerDuty alerts on our phones. We occasionally have false positives, but this was different: several alarms in a row. We looked up at the display in our NOC (above photo, although from a different day) to see... Continue →