[2025-05-03] Polkadot parachain block time degradation

Timeline

This timeline spans multiple days observing two incidents of prolonged and significant degradation of block times affecting all parachains on Polkadot. Both incidents shared the same root cause.

First incident

2025-05-02 15:37 Disputes start

The first dispute was seen onchain at 2025-05-02 15:37 UTC at block #25832586, almost seven hours before the parachain block times degrade.

Offchain there are many unconfirmed disputes, coming from just a single validator, but these were not yet affecting block times.

We have logs from the originating validator, but only for the second incident. For the first one, no one had trace log level enabled. The dispute spam can be seen indirectly as error=AuthorityFlooding logs on other validators. This originating validator should already be disabled offchain because it had already lost a dispute which was recorded onchain as seen in the chart.

May 06 22:28:28 DisputeRequest(UncheckedDisputeMessage { candidate_receipt: CandidateReceiptV2 { descriptor: CandidateDescriptorV2 { para_id: Id(2043), relay_parent: 0x8df79ebedb8e78b33853becea9cc8cbbe3418e32d05e986df72ef534949941c5, pov_hash: 0xc18367b17ea6313efb417b48809bc4af2b1bb2f582f186a3cfb869499574e97a, erasure_root: 0xfd262abacb8e990f675109f0c701670bf96615ce929859b135a1678bf30dd620, para_head: 0x566735719ff93a553c86892c76dc8bbfa5d513d5991e8a442f8d9f00c88a5c8c, session_index: 10834, 
invalid_vote: InvalidDisputeVote { validator_index: ValidatorIndex(186), signature: Signature(246417e9823b188ab9749bdab303bfc1ff722f38238924444f37ec60bc1fe868359cc684ab1fe70cf37af856304941984cfa3d90963784a1b6a27b5459141181), kind: InvalidDisputeStatementKind::Explicit }, 
valid_vote: ValidDisputeVote { validator_index: ValidatorIndex(375), signature: Signature(9a308907ddf2e5012c26225957236bda7b604500d0ec02b7c50ebea0ff8eed53b075ecdf8ed396c8073b60b6ce1ee7ab8361c186e75772e4c8767e008b7eae8d), kind: ValidDisputeStatementKind::Explicit } })

May 06 22:28:28 DisputeRequest(UncheckedDisputeMessage { candidate_receipt: CandidateReceiptV2 { descriptor: CandidateDescriptorV2 { para_id: Id(2043), relay_parent: 0xb6b0d2c70b9ab9d5eddcc031287487555ae44e62a66315b6e5f9c235d33cbc2c, pov_hash: 0x1b8cdc6415c3a26f4f94757020c73b9f366144ca1c2c9d171589248dd235bbe8, erasure_root: 0x2f0252f234653d9c6997c9ceae5e650ce453bb03eb33a7ae01e7f2557e56eba1, session_index: 10834, 
invalid_vote: InvalidDisputeVote { validator_index: ValidatorIndex(186), signature: Signature(2e544e4335cb698f2ed6013421fc0e04fefc03a5fa3956f70aa8933ec5aed454d289260525aaa2f28ba9e2be636745c1a4f865fe3967c22cf26c1cc6c2af2b84), kind: InvalidDisputeStatementKind::Explicit }, 
valid_vote: ValidDisputeVote { validator_index: ValidatorIndex(375), signature: Signature(4a4f7894717dbbadf7f5a06024edfaeee69722ae19e165cbdd995975c28741539d613a6e30646e354b744b71370d031d3cad849f51f9775456f1d8c9a3e94689), kind: ValidDisputeStatementKind::Explicit } })

2025-05-03 02:15 Parachain block times degrade

All Polkadot parachain block times suffered significant degradation. The block times increased from an average of 10s to as high as 30s during spikes.

At this point we don’t have much data to identify the root cause. All logs provided by validators show heavy dispute spam, but validators were not participating and there are no signs of high network or CPU load.

We asked a few of the validators to restart with debug logs and we then observed that, following a restart, there were no further dispute spam flood logs.

2025-05-03 07:30 More validator restarts

We asked even more validators to restart gradually with the hope that this would mitigate the dispute flood.

2025-05-03 07:45 Block times return to normal

Validators gradually stopped spamming and all parachain block times returned to normal after five hours, 30 minutes.

Second incident

2025-05-06 7:38 First dispute

The first dispute is seen onchain in block #25884954 and second in block #25887326 at 2025-05-06 11:36.

2025-05-06 11:15 New node release: stable2503-1

Validators were notified and some validators automatically started the upgrade process, restarting their nodes. Other validators gradually restarted while further disputes were raised.

Similarly to the first incident, unconfirmed disputes were being spammed.

The originator of these disputes was the same validator as in the first incident. As more validators restarted, they started participating and increasing the volume of dispute spam.

2025-05-06 12:45 Parachain block times degrade

We asked validators to restart, but it had no effect on dispute spam and block times.

2025-05-07 07:45 Parachain block times recover

The block times recovered exactly 24 hours after the first dispute that we believe triggered the block time degradation. This is exactly the size of the dispute session window on Polkadot (6 * 4h)

Why did validators initiate disputes?

We currently have few logs and information about dispute initiation, or the amount of unconfirmed disputes. However, we saw that on rare occasions, dispute events have occurred on Polkadot over time.

The single validator raising disputes reported disk failures. We know that disk corruption can lead to disputes as documented here. Another data point is that we have seen one validator disputing because it fails the storage root checks, documented in this ticket.

However, just one single validator spamming disputes should not overload nodes, nor increase parachain block times.

Why did the others validators spam with disputes?

All the other validators that participated and spammed with disputes are the ones that restarted. Nodes started with an empty list of offchain disabled validators and populated the list from future concluded disputes. The nodes did not ignore the messages from the originator and started the spam.

Offchain disablement lasted for just one session. So, at the beginning of every session the initiator was disabled again as soon as it lost the first dispute. Validators which restarted in the previous session would mark the validator as disabled for the current session.

However, if a validator was disabled in session N, other validators would accept and participate in disputes raised by the same validator in prior sessions where the validator was not marked as disabled.

This created a dispute storm of variable intensity depending on how many validators restarted and participated, and relay messages for the unconfirmed disputes.

Why did parachain block times degrade?

The root cause of the degradation can be traced back to a long standing bug in the candidate backing pipeline.

The statement distribution subsystem handles backed candidate statements from various forks. Upon importing a new block, it queries the runtime API (disabled_validators) for a list of validators disabled onchain for all blocks where the node must accept candidates. If this query fails for any block, the process is interrupted.

During the incident, some of these queries failed due to block pruning. Consequently, the subsystem skipped processing the most recently imported blocks and any backed candidates built on top of them. This explains the absence of backed candidates in many Relay Chain blocks.

Normally, pruning is not a problem as disabled_validators runtime API does not fail. The result of the call can still be found in the LRU cache of the runtime-api subsystem. However, during dispute spamming with old disputes, the LRU cache evicted the block in favour of the blocks with disputes, so the call failed.

Full detailed root cause explained here.

Why did offchain disablement fail?

Offchain disablement failed because the list of disabled validators was not preserved at restart. After restart, nodes participated and sent dispute messages for the unconfirmed disputes for an entire session, unless another dispute concluded against the originator.

How do we prevent future similar incidents?

Fixes:

  1. Prevent dispute spam when validators restart. A fix that populates the offchain disabled validators at startup has been merged.
  2. A statement distribution fix has been merged. It ensures that backed candidate statements are processed even under heavy dispute spam.

Testing:

  • Improve dispute activity monitoring and alerting
  • Create a long-duration dispute testing pipeline covering offchain disablement, slashing and validator restarts
  • Regular dispute load tests

2025-05-15 Root cause fixed

Polkadot validators started to upgrade to stable2503-4 on May 15th, 2025. This release includes fixes for dispute spam and statement distribution (1 and 2 as numbered in the section above).

10 Likes