[2025-08-24] Kusama stall postmortem

Timeline

All timestamps are UTC.

Github ticket: Kusama block production stall (24.08.2024) · Issue #9546 · paritytech/polkadot-sdk · GitHub

08/24/2025 12:27:54

New session starts at block #29798491.

08/24/2025 12:28:24

A dispute against a valid parachain block concluded on-chain in block 29798496. The relay chain runtime disabled the validators that voted against the valid candidate:

  • Cad3MXUdmKLPyosPJ67ZhkQh7CjKjBFvb4hyjuNwnfaAGG5,
  • Cax6oB6GCvGv89NnsG5mbTpK5ZkqA3Fa6fKk1sgNibmBmeR.

08/24/2025 12:28:24 - 12:39:36

No blocks have been produced during this time period.

All block authors experienced the following panic when they attempted to build a block:

2025-08-24 12:42:48.626 ERROR tokio-runtime-worker runtime: panicked at /home/builder/cargo/registry/src/index.crates.io-6f17d22bba15001f/polkadot-runtime-parachains-19.2.0/src/paras_inherent/mod.rs:203:5: Bitfields and heads must be included every block

2025-08-24 12:42:48.628 WARN tokio-runtime-worker babe: Proposing failed: Import failed: Error at calling runtime api: Execution failed: Execution aborted due to trap: wasm trap: wasm unreachable instruction executed WASM backtrace: error while executing at wasm backtrace: 0: 0xa9cd - staging_kusama_runtime.wasm!rust_begin_unwind 1: 0x383b - staging_kusama_runtime.wasm!core::panicking::panic_fmt::hf6523d5adc4038e0 2: 0x42ed2d - staging_kusama_runtime.wasm!<(TupleElement0,TupleElement1,TupleElement2,TupleElement3,TupleElement4,TupleElement5,TupleElement6,TupleElement7,TupleElement8,TupleElement9,TupleElement10,TupleElement11,TupleElement12,TupleElement13,TupleElement14,TupleElement15,TupleElement16,TupleElement17,TupleElement18,TupleElement19,TupleElement20,TupleElement21,TupleElement22,TupleElement23,TupleElement24,TupleElement25,TupleElement26,TupleElement27,TupleElement28,TupleElement29,TupleElement30,TupleElement31,TupleElement32,TupleElement33,TupleElement34,TupleElement35,TupleElement36,TupleElement37,TupleElement38,TupleElement39,TupleElement40,TupleElement41,TupleElement42,TupleElement43,TupleElement44,TupleElement45,TupleElement46,TupleElement47,TupleElement48,TupleElement49,TupleElement50,TupleElement51,TupleElement52,TupleElement53,TupleElement54,TupleElement55,TupleElement56,TupleElement57,TupleElement58,TupleElement59,TupleElement60,TupleElement61,TupleElement62,TupleElement63,TupleElement64) as frame_support::traits::hooks::OnFinalize>::on_finalize::hc7f22551b7f542d9 3: 0x61efc9 - staging_kusama_runtime.wasm!BlockBuilder_finalize_block

08/24/2025 12:39:36

29798497 is built. What is important to note about this block is that it contains an empty InherentData, meaning no candidates have been backed, made available, or dispute votes have been put on-chain.

08/24/2025 12:39:42 - 12:48:30

No block have been produced. Block authors continued to get the same panic while trying to build blocks.

08/24/2025 12:48:30

29798497 was built and network fully recovered.

Root cause

The issue was caused by a bug in the parainherent.create_inherent runtime API. This API is used by the block authors to craft the parainherent.enter call and arguments. The node calls this runtime API with an input InherentData containing backed candidates, bitfields and dispute votes. The API filters the input data and returns a curated list of backed candidates, bitfields and dispute votes to be used for the enter call.

This bug is located in the code that filters the backed candidates to ensure validity votes coming from disabled validators are removed. This code is wrong because it does not map the index in indices_to_drop (which is an index into the validator group) to the index in the validity votes vector. One of the candidates provided by the nodes had less number of votes than validators assigned to the backing group. This resulted in the removal of the wrong validity vote. Removing the wrong vote shifted the other votes index in the vector causing the validity vote signature checks to fail later in the code.

We can find proof that this is what happened in the logs:

2025-08-24 12:29:17.807 DEBUG tokio-runtime-worker parachain::candidate-backing: Importing statement statement=CompactStatement::Valid(0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540) validator_index=52 candidate_hash=0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540 traceID=149510056682134309566292573539200341786

2025-08-24 12:28:23.294 DEBUG tokio-runtime-worker parachain::candidate-backing: Importing statement statement=CompactStatement::Valid(0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540) validator_index=53 candidate_hash=0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540 traceID=149510056682134309566292573539200341786

2025-08-24 12:28:23.291 DEBUG tokio-runtime-worker parachain::candidate-backing: Importing statement statement=CompactStatement::Valid(0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540) validator_index=51 candidate_hash=0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540 traceID=149510056682134309566292573539200341786

2025-08-24 12:28:23.291 DEBUG tokio-runtime-worker parachain::candidate-backing: Importing statement statement=CompactStatement::Seconded(0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540) validator_index=54 candidate_hash=0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540 traceID=149510056682134309566292573539200341786


2025-08-24 12:28:19.294  INFO tokio-runtime-worker parachain::dispute-coordinator: Disabled offchain for voting invalid against a valid candidate candidate_hash=0xdd9398f89d8c245572610b9ecc381334cf66b1f8839cfa7cf8892791627ceab1 validator_index=ValidatorIndex(622) session=50299 traceID=294525757330273653902701241479959417652

2025-08-24 12:28:19.294  INFO tokio-runtime-worker parachain::dispute-coordinator: Disabled offchain for voting invalid against a valid candidate candidate_hash=0xdd9398f89d8c245572610b9ecc381334cf66b1f8839cfa7cf8892791627ceab1 validator_index=ValidatorIndex(51) session=50299 traceID=294525757330273653902701241479959417652

The parachain block candidate backed by the disabled validators is identified by this hash: 0x707a96ea766dbf25c4d2033d96ef171ae6f0cfc2d7568f1a95fdc45378ce1540.

Validators with indices 50, 51, 52, 53, 54 are part of the backing group that backed this candidate.

But, validator 50 did not provide it’s validity vote, as we can only see 51, 52, 53 and 54 in the logs. In block 29798496, 51 gets disabled on-chain. Validator 51 has index 1 in the group so we end up removing index 1 from validity votes, when we should’ve have removed 0, because 0 is the signature corresponding to 51. This lead to failed signature checks later in the code, resulting in the missing inherent data and the panic we see in the logs.

The fix

After identifying the root cause we fixed the offending code by correctly mapping the index in the validity_votes vector: Parachains runtime: properly filter backed candidate votes by sandreim · Pull Request #9564 · paritytech/polkadot-sdk · GitHub

The fix was backported to the 2503 and 2507 releases and a new relay chain runtimes were released and proposed via referendums:

  • 1.7.1 for Kusama (referendum #584)
  • 1.6.2 for Polkadot (referendum #1736)

Why did validators initiate the dispute ?

Proof for the reason for initiating the dispute is not available on-chain, and we could not find it in the available logs. However, we’ve seen from the previous attempt to enable RFC103 that some validators are running older versions which do not support RFC103. When such a validator attempts to validate a parachain block that uses this feature, it will fail and consider the parachain block invalid.

Why did the block production stall ?

When validators build a relay chain block they use the BlockBuilder runtime APIs. Every relay chain block contain mandatory 2 extrinsics which are called inherents:

  • timestamp.set
  • parainherent.enter

The last step of the block building process is called block finalization. Pallets can implement the on_finalization hook if they want include specific logic that needs to be executed at the end of a block. The paras_inherent pallet which is responsible for processing the parachains inherent implements it and performs a check that verifies if the block actually contains the paracahins inherent extrinisic (enter).

But, during the incident, this inherent data was missing without any warning and error in the logs except the panic, originating from here.

Why was the parainherent extrinsic missing from the block ?

The runtime API used by the node to craft the extrinsic call failed due to a bug in processing the backed candidates supplied to the create_inherent call in the InherentData argument returning None to the caller, which is valid from the client perspective and no error was logged.

While the runtime code contains a warning message explaining the error, this was not visible in the client because logging is completely disabled in production runtimes for performance reasons.

Why did the block production continue to stall after 29798497 was built?

We know that for this block, parainherent.enter extrinsic contained no backed candidates, bitfields or disputes. These are collected by the node via gossip resulting from off-chain consensus in backing, availability, disputes and are kept in memory until they are provided to the runtime via the parainherent.enter extrinsic.

The only valid explanation is that the author of this block must have restarted during the stall. When it came back online it did not have any backed candidates, bitfields or dispute votes to put in the inherent. So it worked.

Why did the network recover ?

The network recovered as soon as the candidate backed by the validators disabled in block 29798496 expired. Backed candidates expire when their relay parent is older than 3 blocks.

The last block where this candidate was still valid is block 29798497. However, we got lucky, because the author of this block was poorly connected to the network and has never seen the candidate.

How do we avoid similar issues in the future?

We’ve previously run heavy dispute load tests in our private testnet (Versi). The test network does not emulate the real world conditions: latencies, node restarts or failures. Because of the perfect nature of the testnet, the number of validity votes of backed candidates is always equal to backing groups size and the issue did not surface.

To prevent future similar issues we need to:

  • change testing environment to match as closely as possible what happens in the real world
  • fuzz test the parachain consensus components of the relay chain runtime
  • better runtime error reporting (during this incident, relevant runtime error information was missing)
6 Likes