Stalled parachains on Kusama - post mortem

Stalled parachains on Kusama - post mortem

Special thanks to @tsvetomir and @eskimor for writing the majority of this post mortem.

On September 1st, 2023 at around 06:14 UTC all parachains on Kusama stopped producing blocks (stalled). The relay chain was not affected - it was producing and finalizing blocks without any interruptions.

A TL;DR version is available at the end.

Investigation

Further inspection showed that a dispute has been initiated and concluded “Invalid” at block 19485524:

A dispute concluding invalid means that a bad candidate was included and the corresponding fork should be reverted up to the last known good block. Generally speaking this is undesirable but not a catastrophic event. The network is designed to handle such situations gracefully. The problem in this case was that the disputed block was already finalized. This can be confirmed by the logs of one of our validators:

04:11:30.831 DEBUG tokio-runtime-worker parachain::candidate-backing: Candidate backed candidate_hash=0x0f36afd6742ffbf9a1e807e56dea0c168ecf40f84b7f524a11f5a6d41fd24ba1 relay_parent=0xbcc6bd1d372e57366eb27d8382501a51a59bcfd9c37a7deb627912ea9b66b3b0 para_id=2106 traceID=20222370379600306452081162187952491542

We can see that the candidate was backed at 04:11:30 UTC. The dispute was raised and concluded around 06:14:36 UTC in block 19485524 according to Subscan:

image

We can see that there are around two hours between the block inclusion and the dispute! Why would anyone raise a dispute that late - long after finality? The dispute concluded against the candidate, triggering a revert event in block 19485524:

The chain should be reverted to block 19,484,297 (and the current block is 19,485,524). Kusama (and Polkadot) is designed not to finalize invalid blocks. If it still happens the chain is frozen until governance steps in. This means that the relay chain still produces blocks but parachain consensus is halted. This explains why all the parachains had stalled but the relay chain was producing and finalizing new blocks. The frozen chain can be confirmed by the validator logs too:

WARN tokio-runtime-worker parachain::chain-selection: The hammer has dropped. The protocol has indicated that a finalized block be reverted. Please inform an adult. maybe_revert_hash=None revert_target=19484297 maybe_reporting_hash=Some(0xe018f4d30dcb3976537cd37be0ab6e17f2304a6d85350fdf230a161d9869f1a6) maybe_reporting_number=Some(19485524)

At this point we know that the stall was caused by a dispute concluding invalid for a finalized block causing freeze of the chain. But there were still too many open questions.

Was the block actually invalid?

The most important question at this point was what caused the dispute. There were two possibilities - either an invalid block was finalized or a valid block was (wrongly) getting disputed. To figure this out we double checked that the candidate was correctly approved by validators. As an additional sanity check we also manually verified the disputed candidate via kuddelmuddel:

     Running `target/release/kuddelmuddel validate-candidate --network kusama --candidate-hash 0x0f36afd6742ffbf9a1e807e56dea0c168ecf40f84b7f524a11f5a6d41fd24ba1 --rpc-url 'wss://kusama-rpc.polkadot.io:443'`
Successfully fetched PoV for 0x0f36afd6742ffbf9a1e807e56dea0c168ecf40f84b7f524a11f5a6d41fd24ba1, para_id=2106
Fetching Pvf 0x748cff124604a12694cbf022d24e2c1fe35ed6a29ac3e2429a910c6f72d3a56d
PoV size: 27kb
Pvf prechecking...
Pvf preparation took 4836ms
Pvf execution...
Execution took 22ms

The candidate was indeed valid: The network behaved correctly when it finalized it. One question answered, but it is still unclear why the dispute was initiated in the first place, why validators participated and why the majority of the nodes considered the candidate invalid.

Why validators voted invalid?

As the candidate was valid, we had a suspicion that dispute participants voted invalid because they couldn’t execute the block for some reason. With the help of the validator community we got this log:

Sep 01 06:14:28 cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0xbcc6bd1d372e57366eb27d8382501a51a59bcfd9c37a7deb627912ea9b66b3b0
Sep 01 06:14:28 Failed to acquire executor params for the session validation_code_hash=0x52c615ef5a0bafcd3ff2757611c21f568775c792a83af1c5bcc0d63b05b7694d para_id=Id(2239)
Sep 01 06:14:28 Candidate 0xb0c6a8f38fca5949f3d613c5fc43c7a63021f35baa90e127dcfab1e20b42fa96 considered invalid: BadParent

It shows that the validator coudn’t execute the block because it failed to fetch executor params from the runtime, because the used relay chain block was already finalized and its state pruned.

Multiple validators reported the same problem (runtime api call failure with ‘state discarded’ error). We had actually been aware of this issue already and a PR fixing it was ready, but not yet merged nor released hence the issue.

At this point, the new information we have obtained, is that the dispute was concluded invalid because of the unavailable executor parameters, which are a prerequisite for verifying candidate validity. They have been missing due to the disputed candidate being finalized for a long time already and the relevant state pruned as the consequence.

We therefore answered the question why other nodes, once the dispute was raised, would vote invalid. We still don’t know exactly why the dispute was raised in the first place.

Why was the dispute started?

A good guess why the dispute was started in the first place would be that some validator was offline, caught up, but was lagging behind finality and would start approval voting. Then, while candidate validation was initiated, finality caught up, the state was discarded causing validation to fail: A dispute was born!

This is just a guess and there are still open questions potentially relating to bugs: In particular, even if a node was offline for a while, approval voting is not supposed to do anything during a major sync.

Why validators participated in the dispute?

To be more precise:

Why did honest nodes participate in a dispute against a finalized block?

The dispute-coordinator (the module handling disputes in a validator) caches all the required data in order to participate in disputes up to some block depth behind finality. The reason for this depth is to be able to participate in a dispute on an abandoned fork so that no offender can get away without a slash. But this depth is 10 blocks not two hours worth of blocks! Any older data should be pruned, causing validators seeing a dispute for a block beyond this point to mostly ignore it (they record the votes, but will not participate). This pruning security measure (spam prevention) however did not work as expected.

We noticed in the logs that our validators indeed had not pruned the data for the candidate in question and for this reason they participated in the dispute. This lead us to suspecting a bug in the pruning logic and indeed there was one.

Why nobody got slashed?

While the candidate was indeed valid, the network believed at the time it was not: Therefore a slash should have happened!

Inspecting the storage items, we’ve noticed that the runtime correctly saved the information about pending slashes (only one entry is shown for brevity):

parasSlashing.unappliedSlashes: Option<PolkadotRuntimeParachainsDisputesSlashingPendingSlashes>

[
  [
    [
      32,938
      0x0f36afd6742ffbf9a1e807e56dea0c168ecf40f84b7f524a11f5a6d41fd24ba1
    ]
    {
      keys_: {
        30: 0xde584bf7dfad115726ec0a4b2f75fab6b0b02dd5343a12015a93743f0eaa8173
        31: 0x82c5eaf9d580e0c613d42226ae91a87120f75497f90e7925813b7632e9830405
        32: 0x66b33067fcaadb0f87ea4fe9347d90a090794fcac4f71e3948f8b44ef2edf57c
        33: 0x7cbab980c86ace4ecd371a7179aa382e33c76a6db2337b2d6cdda94f0cf2f705
        34: 0x748496ef28092ec0c87c731b4d369da71637eaaab00c7b6e286584ffc36c8638
      }
      kind: ForInvalid
    }
  ]
]

From this entry, we can see that an invalid parachain candidate (0x0f36afd6742ffbf9a1e807e56dea0c168ecf40f84b7f524a11f5a6d41fd24ba1) was backed in session 32938 by 5 validators with respective session keys.

These validators were waiting to be slashed. We were only missing so called KeyOwnershipProofs to be submitted by validators for the slashes to occur.

Again, pruning on finality was preventing that from happening:

2023-09-01 06:14:38.125 parachain::dispute-coordinator: Could not find an including block for candidate against which a dispute has concluded. candidate_hash=0x0f36afd6742ffbf9a1e807e56dea0c168ecf40f84b7f524a11f5a6d41fd24ba1 session=32938

Similar situation here, we already knew about this problem and had a solution implemented but not yet released.

Steps taken to unstall the Parachains

For the case when a chain freezes, there’s an extrinsic named force_unfreeze which can be used by Governance to unblock the chain in case of a bug.

This is exactly what has been submitted as OpenGov referendum: Fixing Inclusion Issues On Parachains Side and whitelisted by the Kusama Fellowship: SubSquare | governance platform.

Thanks to everyone who voted in a timely fashion to ensure the quick enactment period!

Lessons learned and the path forward

The underlying dispute issue was caused by a code change that did not take into account that the state of the candidate’s relay parent would not be guaranteed available during disputes and slashing.

Although, we have fixed this regression, to ensure it doesn’t happen in the future, extensive automated testing has been proposed and will be implemented ASAP:

In general, test coverage is being increased as we speak.

TL;DR

In all brevity, let’s summarize what happend and how it was fixed:

What happened?

  1. Some candidate got backed at around 4AM in the morning - nothing special yet.
  2. That same candidate gets approved and finalized just a bit later - nothing special yet.
  3. 2 hours later, some validator disputes that candidate … this is unexpected!
  4. All validators participate in that dispute … even more unexpected!
  5. Most participating validators vote “invalid”! … seriously?!
  6. Dispute resolves against the candidate, but the candidate was already finalized. No reversion possible. Chain gets frozen.
  7. Nobody got slashed because past session slashing was not working correctly for finalized blocks.

How we fixed it?

  1. Double check that candidate is indeed valid - good.
  2. Initiated unfreezing the chain via governance
  3. Executor param related bugfix which no longer fetches executor params via the disputed candidate’s parent which might alreay be pruned
  4. Underpruning related bugfix in dispute-coordinator, which allowed the validators to participate in that old dispute.
  5. Improve pruning resilience for slashing.
15 Likes

Thank you for quick recovery!
As there was a quick repair, we’ve managed with artists mint on Kusama Assets Hub on KodaDot for offline gallery in Berlin as this incident happened just we’ve been onboarding 48 artists to Assets Hub :slight_smile:

1 Like

Good jobs. Felt like Rush Hour. :slight_smile:

1 Like

I wonder if we’ve considered the scenario that an attacker may have seen the PR which exposes the vulnerability and decided to exploit it.

And potentially if it may be the case that we need to review how security vulnerabilities get patched. Perhaps not making them public through issues/PRs until they are patched.

Yeah for sure this was considered. However, the parachains stalled first and then there were prs to fix it. You are right that someone could still had tried the same thing after seeing these prs coming up, but this didn’t happen. However, when there are really critical things we keep the stuff closed as long as possible to have the fixes rolled out first.

2 Likes

It’s also important to note that this issue was a parachain liveness and not security issue since there was not way to revert past finalized block.

Even then, we’ve decided to publish this postmortem only after the supermajority of validators have upgraded to the patched release according to various telemetry metrics.

3 Likes