Summary
On 2025-11-25 15:54 UTC
during a spammening
session, kusama test parachains were used to spam with transactions and finality started lagging until the failsafe kicked in.
Why did the finality started lagging ?
As more load was added on the spammening test parachains nodes started to be loaded massively, at the peak of the spam we can see up to 85 candidates per relay chain block with all the spammening parachains blocks taking ~2s of PVF execution time.
The cpu usage was really high, with our nodes reporting up to 800% at the peak, with the biggest cpu hog being libp2p(at 400%), that caused a lot of no-shows especially on kusama nodes that are under the reference hardware specification, with no-shows other nodes have to do more parachain block checking and that increases the load on them as well.
In the end, this starts to put back-pressure on backing so we end up including less parachain blocks and in turn we give nodes time to catch on checking, which in theory should limit the finality lagging we get because of nodes being overloaded, however that’s not what happened.
Why did finality lagged that much ?
In approval-distribution
we have two mechanisms called aggression levels
which are meant to help when finality lags, the levels are:
- L1 aggression: Kicks in when the finality lag is greater than 16 blocks, at this point all nodes decide to send all messages that they have created to all other validators in the network, on top of that we continuously resend the message every other 8 blocks.
- L2 aggression: Kicks in when the finality lag is greater than 28 blocks and all nodes decide that they should send all the messages(assignments and approvals) to their X and Y neighbors, the result of this is that for each unique message in the network each node will have to receive it 64 times and sends it to 64 peers, so for example if you have 10_000 unique messages for a relay chain blocks, which is not unreasonable in the presence of no-shows and high load you will have to process 640_000 messages for that block and send 640_000 messages in one go.
So, when L2 kicked in, we saw ~100 validators going offline
The validators went offline when they started crashing and restarting with the bellow error, because we had many more messages than what the subsystem channel can hold(64k).
Subsystem approval-distribution-subsystem appears unresponsive when sending a message of type polkadot_node_subsystem_types::messages::ApprovalDistributionMessage.
Why didn’t finality recover after validators restarted?
There were a few blocks that could not be approved by more than 1/3 third of the nodes, the logs where like this:
status=247 assignments triggered/162 approvals/500 validators
We can see here that not all 500 validators triggered their assignments, so what we think happened is that because of the restarts the network got segmented into two sets. One of the sets thought the candidate is approved because they processed enough messages while the other set thought is not approved, because the message did not reach them because of the restarts.
Normally if you end up into this situation approval-distribution L1 aggression would solve it, but because of a bug after restart validators don’t re-distribute their approval anymore because of this issue. On top of that, validators that restarted would take a while to connect to the rest of the network and be able to fetch PoV to approve blocks, so after restart they most likely will also contribute with no-shows.
How do we avoid this happening in the future.
- From January 2025 validators validators are suppose to start running nodes with two times more cores,8 instead of 4, that should greatly help in this use case, @validators please make sure you respect the hardware specifications.
- Reduce the biggest CPU hog, by switching from libp2p to litep2p in the future.
- Fix the
approval-distribution
aggression to not be that spammy and lead to nodes crashing - Fix
approval-voting
, so that nodes correctly re-distribute their approvals after restart. - Make
approval-voting
more reliable by retrying to approve a candidate if it is still needed and the initial failure was because we couldn’t fetch the PoV due to the node being poorly connected at restart.