[2025-05-09] Kusama dispute storm postmortem

Timeline

09/05/2025 12:35:30 - Runtime upgrade

The runtime of Kusama was upgraded to v1.5.0 at block #28279019 after referendum 519 was successfully executed.

The new runtime included the enablement of the v2 candidate receipts (RFC-103) node feature flag. This configuration change was not applied immediately in session #47,731; instead it was scheduled to be applied in session #47,733.
The upgrade was applied successfully and the network continued to function normally, until session #47,733 when the node feature bits changed to 0b11010000 signalling that v2 candidate receipts are now accepted by the runtime and nodes.

09/05/2025 14:30 - V2 receipts are enabled

A small group (~40) of Kusama validators were running versions older than v1.17. These older validators do not support v2 candidate receipts and in v2 we removed the collator signature field from the receipt.

Collators running versions v1.17 or later started to produce blocks that use the new v2 receipts. These blocks were backed on-chain, as most of the validators are upgraded.

During approval checking, the old validators failed the checks on the collator signature fields of all v2 candidate receipts and constantly initiated disputes.

09/05/2025 14:30 - Finality lag and high parachain block times

Finality started to lag at 09/05/2025 14:30 UTC as soon as the disputes started and the network connectivity degraded.

Parachain block times started to degrade as on-chain backing of candidates slowed down due to dispute participation and approval checking being prioritised by nodes.

09/05/2025 18:50 - Finality stalls

As CPU and network load increased due to the disputes and approval checking work, the nodes started falling behind and could not keep up with the incoming network message rate. The chart below shows that subsystems had a huge backlog of messages to process.

Due to the load and network connectivity issues GRANDPA stopped finalising blocks at 09/05/2025 18:50 UTC.

The finality did not catch up for the next 23 hours, as can be observed in the chart above.

10/05/2025 05:31 - Finality starts to recover

The complete stall lasted around 11 hours until the emergency fix deployment made significant progress and the network load started to trend downward. GRANDPA started slowly finalising blocks, but finality lag persisted for another 8 hours until enough validators applied the emergency fix.

10/05/2025 13:30 - Finality and parachain block times recover

Finality was fully restored after the emergency fix deployment reached 2/3 of the validators.

As the load of the network decreased to normal operational levels, all parachains using v1 descriptors returned to normal block times.

Parachains running collators using v2 receipts still experienced high block times, as the v2 receipt candidates were not backed. More than 2/3 of the validators did not accept v2 receipts after the emergency fix.

2025-05-11 13:13 - All parachain block times are normal

Eventually all parachain block times recovered as soon as the referendum for disabling v2 receipts was executed at block #28,304,492

The emergency fix

A referendum to disable v2 receipts on Kusama was posted at 2025-05-09 20:30:36 UTC.

A temporary emergency fix was created to fix finality. It included the following changes:

  • disablement of the dispute logic on the node (effectively ignoring all dispute protocol messages) to prevent any new disputes being raised.
  • re-enablement of dispute logic in two weeks at block height #28,486,744
  • disablement of v2 candidate receipt on the node side (overriding the runtime flags) to ensure that the fixed nodes could not back v2 candidate receipts

The fix was released as a Kusama-only node release stable2503-3.

Why did the disputes start?

We expected that v2 receipt candidates would be disputed by old validators. But we also expected that off-chain disablement would mitigate the dispute spam, and prevent dispute load by ignoring disputes initiated by the old validators.

Off-chain disablement did not fully mitigate the dispute spam.

Why off-chain disabling did not mitigate the issue?

Off-chain disabling only works for unconfirmed disputes, meaning disputes where 1/3 or less validators participate. Nodes participate even for spam disputes if they are confirmed.

The disable mechanism works by storing a list of validators in memory and then ignoring disputes initiated by any validator on that list. A validator that loses a dispute against a valid candidate is disabled for the current session only. These old validators repeatedly initiated disputes on every new session. We can see in the logs that around 40 validators are disabled at the beginning of each session.

However, if a validator is restarted they come back online with an empty list of disabled validators. We believe that validator restarts contributed to reduce the effectiveness of off-chain disabling, causing more dispute participation and spam.

Additionally, with sufficient CPU load, even updated validators started to dispute valid candidates when the PVF execution timeout was hit.

Why did the finality training wheels not work?

This mechanism is triggered when either disputes or approval voting are keeping finality from progressing for at least 500 blocks. This involves any unconcluded dispute or unapproved candidate in any unfinalised Relay Chain block. Once the 500 blocks threshold is hit, all the candidates included in the unfinalised chain are approved.

In our case, we did have dispute lag and approval voting lag larger than 500 blocks and the training wheels were activated.

The training wheels did nothing because GRANDPA stopped finalised blocks even before the training wheels were activated.

Why did GRANDPA stop finalising blocks ?

From past testing, we know that under very high load or bad connectivity it is possible for GRANDPA to stop finalising. In our case we can see that the amount of GRANDPA messages validators are sending has significantly dropped during the finality stall.

One important detail is that Kusama is running with the new litep2p stack. At this point we don’t know if this has had a positive or negative impact, and remains to be investigated further.

Why did parachain block times degrade?

During the incident the block times degraded because of two reasons:

  1. Backing back pressure. Approval voting and disputes have priority on the node side re-execution of candidates. Approval voting no-shows caused by the existing load further increased the load of the system.
  2. Processing many disputes evicted entries from the node runtime API cache, triggering a bug in statement distribution. More details on the root cause can be found here.

After emergency fix was applied, not all parachain block times returned to normal. Parachains producing candidates with v2 receipts were discarded by the node fix and their block times were severely affected.

How do we avoid similar issues in the future?

We are fixing the root causes of the issues:

  1. Prevent dispute spam when validators restart. A fix that populates the off-chain disabled validators at startup was merged.
  2. A statement distribution fix was merged. It ensures that backed candidate statements are processed even under heavy dispute spam.

We are implementing stronger dispute spam prevention:

  • On-chain disablement of validators that vote against valid candidates
  • A validator re-enabling mechanism will ensure that we don’t disable too many (more than 1/3) validators on-chain.

Additionally, we will improve our testing process to reduce the chances of similar issues occurring in the future:

  • Run larger scale and longer duration dispute tests to understand failure modes and fix potential issues before they occur on live networks
  • Dispute load testing
  • Investigate how litep2p influenced the connectivity issues and GRANDPA stall.

2025-05-15 Root cause fixed

Kusama validators were upgraded to stable2503-4 on May 15th, 2025. This reverts the emergency fix and includes long-lasting fixes (1 and 2) as numbered in the section above.

21 Likes

Thx for the postmortem. Very insightful!

Tangential question: Is there an attack vector where a small nodes could collude to produce a dispute storm and stall finality?

Even before this incident we had protection for dispute spam. However we did not really take into consideration restarts when we implemented off-chain disabling.

With the offchain disablement persistence fix, honest nodes will ignore dispute spam even after restart, and will not participate to create storms.