Polkadot Dispute Storm - The Postmortem

On March 11th, 2023 (block 14,605,287), a validator started to dispute valid parachain blocks. This led to block times of up to 18 seconds on Polkadot. In addition, parachains stalled because they could not include any new blocks on the relay chain.

This validator continued to initiate disputes for around ~20 hours until block 14,614,592, generating 60,614 disputes in total. Block times returned to normal after 12 hours (block 14,619,249) but it took another 12 hours for the network to fully recover.

What exactly happened, how it was fixed, and the lessons we learned will be discussed in the following sections.

What happened?

Prometheus metrics produced by an RPC node during the incident provide a high-level view of the impact.

Polkadot block time degradation during the incident.

The rate of disputes raised and concluded per relay chain block. It can be observed that most of them conclude fast.

60k parachain blocks were disputed during the incident. For each dispute ~200 votes are put on chain.

What is a dispute?

The disputes protocol ensures that any attempt to finalize an invalid parachain block is detected and the offending validators are punished according to the severity of the offence.

A dispute is a process that is triggered by the protocol when there are two opposing votes about the validity of a specific parachain block. All parachain validators are required to participate in it by checking the parachain block and casting a vote. Only after the dispute is resolved can the relay block that includes it be finalized. To achieve consensus and resolve a dispute, a 2/3 supermajority is needed. If the parachain block is deemed invalid, the chain is reverted and the offending block is discarded. Otherwise, the parachain block can be finalized.

The cost of disputes

Validators participating in a dispute download the parachain block and execute it to check its outputs and validity. This process consumes CPU and network bandwidth which scales linearly with the number of ongoing disputes.

When there are a lot of ongoing disputes, our system prioritizes security over availability. Therefore, parachain block times are expected to degrade as block authors prefer to include dispute votes in blocks rather than create new parachain blocks, which would queue up even more work to the network queue. On-chain votes are required to slash/disable the offenders, preventing future disputes from occurring and the associated high load.

Why did the validator incorrectly dispute the parachain blocks?

The validator operator updated the node binary but did not restart it as expected after the update. The new PVF workers (responsible for executing blocks) spawned by the node used the new binary, but the PVF artifacts were compiled with the old version. During execution, the client code detected the version mismatch and reported it as an invalid parachain block. This led to disputes being initiated.

The execution failed because the PVF(wasm) artifact created with wasmtime 6.0.0 could not be loaded in 6.0.1, which is an internal error and should not raise any disputes. If the operator had restarted the node, all PVF artifacts would have been wiped at startup and the bug wouldn’t have surfaced.

Slashing for disputes against valid blocks was not enabled during the incident

The ideal way to prevent spam is to slash validators who raise disputes against valid blocks. As this requires a complex design that considers multiple requirements and prevents attacks, it has not yet been implemented. There is an urgent need to prevent malicious backers from exploiting differences in execution times between backing and approval checking timeouts. This will enable malicious backers to get honest approval checkers slashed.

We are currently [in the design phase] for a solution called ‘time disputes’ (Time Disputes · Issue #6510 · paritytech/polkadot · GitHub).

Why did the relay chain slow down?

We expected that a high rate of disputes would only slow down parachains, and we did not expect relay chain block times to degrade. The validator initiated 60k disputes. Once the disputes stopped it took another 12 hours for the relay chain block times to return to normal.

Among the first things we noticed were increased block import times and inherent data creation times. The maximum execution duration for a block is capped at 2 seconds, the maximum block weight allowed. As soon as disputes started, block import time gradually increased and peaked between 3-5 seconds. As a result, block authors missed their slot and a higher number of forks resulted, as they started crafting blocks at the same height while importing the heavy sibling.

Block import time before, during the incident, and after (once the network returned to normal).

Why did imports take between 3-5s?

Block execution time depends on the time it takes to execute all its transactions. A relay chain block contains an unsigned transaction (intrinsic) for processing parachain inherent data, which consists of proofs of block backing, availability, and dispute resolution. Slashing requires that all the votes of participating validators are put on chain.

The amount of dispute votes required on-chain is large when there are active disputes (around 60k in our incident). On Polkadot and Kusama, this is 200 votes for each dispute, assuming all parachain validators participate. Each vote is signed by the validator who issued it. During block execution, we check all vote signatures imported into that block.

Signature check time histogram on reference hardware:

Running benchmarks on the blocks crafted during the incident showed that the consumed weight was much larger than the block weight estimated when blocks were created:

2023-03-14 12:20:32 Block 8427 with    13 tx used  84.62% of its weight (   441,319,387 of    521,532,364 ns)    
2023-03-14 12:20:43 Block 8428 with     2 tx used 165.23% of its weight ( 3,509,763,887 of  2,124,232,762 ns) - OVER WEIGHT!    
2023-03-14 12:20:54 Block 8429 with     2 tx used 173.14% of its weight ( 3,681,854,294 of  2,126,555,888 ns) - OVER WEIGHT!    
2023-03-14 12:20:56 Block 8430 with    21 tx used  82.18% of its weight (   536,352,058 of    652,686,508 ns)    
2023-03-14 12:20:57 Block 8431 with    15 tx used  73.91% of its weight (   571,469,565 of    773,157,660 ns)    
2023-03-14 12:21:08 Block 8432 with     2 tx used 162.42% of its weight ( 3,460,068,835 of  2,130,279,730 ns) - OVER WEIGHT!    
2023-03-14 12:21:09 Block 8433 with    32 tx used 273.82% of its weight (   404,748,840 of    147,816,954 ns) - OVER WEIGHT!    

We can safely conclude that the overweight blocks were caused by improper weights.

Why did the blocks consume more weight than reported?

To understand this we looked at how disputes are stored on chain and when/how we access that storage:

	pub(super) type Disputes<T: Config> = StorageDoubleMap<

We use a hashmap to store the state for all disputes that have happened in the past six sessions. Each new relay chain block introduces additional dispute votes, which we use to update the state of disputes they reference. As dispute votes are processed in runtime, we mark them accordingly in storage and emit specific events: DisputeConcluded/DisputeInitiated/DisputeTimedout.

The most interesting event in the context of figuring out the problem is the timeout event. It should be emitted when a supermajority is not reached with votes passed to the runtime within a given period (600 blocks on Polkadot).

Looking closer to when this event is emitted reveals our problem. We iterate over 60k dispute entries on each block in the disputes initializer. Each read costs 20μs and adds up to 1.2 seconds of execution time.

Why didn’t the relay chain block time recover as soon as disputes ended?

Disputes are not pruned from storage after they conclude. Instead, they are kept for six sessions before pruning. This means that even when disputes stopped, we were still paying the cost of iterating over all of the disputes in storage on every block. As can be observed in the chart above, the import times recovered as soon as those dispute entries were pruned from storage on each new session. Polkadot sessions are four hours long, so it took 24 hours to prune all disputes from storage and revert the initializer cost to zero.

The fix

The design of the disputes pallet assumes disputes are rare events because offending validators are slashed and disabled when they spam the network with disputes. So, the best long-term fix is to fully implement and enable dispute slashing.

As a short-term fix, we need to change the disputes pallet to not iterate all the disputes in storage.

We decided to remove the dispute timeout events in this PR. The timeout events do not add any benefit, but demand a lot of block space. It is safe to remove them because:

  • Disputes that hit the runtime should never time out. They have been confirmed, so they should conclude.
  • There is no timeout on the node side. Timing them out in the runtime is out of sync with what the node would assume about the dispute.
  • Disputes are pruned after six sessions regardless of whether they were concluded or not.
  • Timed-out disputes should never trigger slashing since validators who vote are usually honest.

We suggest this fix be included in the next runtime upgrade (9410).

Lessons Learned

  • We need to invest more to harden PVF execution and prevent disputes from being initiated due to operational errors. The issue that triggered this incident could have been easily prevented by checking the worker version against the node version before execution. We will continue to harden PVF execution.
  • The block import thread must be run with the highest possible priority. When the load on a machine is high, this allows us to still import blocks in proper time frames.
  • We will continue to improve our dispute load testing and benchmarking block weights.

Thanks for the postmortem. This explains why it takes forever for Chopsticks to process recent blocks and we have to override disputes to make block production within Chopsticks completes in a reasonable amount of time e2e-tests/polkadot.ts at c67af7fa0d1711bae728c4ec8a2a972f09fc9c00 · AcalaNetwork/e2e-tests · GitHub