[2024-04-21] Polkadot parachains stalled until next session

alexggh · April 22, 2024, 11:00am

Description

Immediately after runtime 1.2 got enacted on polkadot all parachain almost halted, with blocks being included rarely into the relay chain. This happened because parachain candidates weren’t backed by the statement-distribution-v2 subsystem.

Statement-distribution-v2 is a new subsystem written for async backing that is responsible with distributing backed statements between nodes, so that all nodes become aware a parachain candidate can be included on the relay chain, more details about it here. The subsystem got enabled the moment the runtime ParachainHost API got bumped to a version that includes async_backing_params in polkadot runtime release 1.2.

Statement-distribution-v2 uses the network topology to know how to gossip backed statements. The network topology gets updated at the begining of every session, so when it received the new topology it concluded it does not have to save the topology because async backing was not enabled on any leafs, hence when the runtime got upgraded mid-session, it did not have a topology to use for distributing the backed statements, so all parachains got stalled.

How was it fixed

The network self-healed on the next session, when it got a new topology, no further impact is expected.

Root-cause of the incident

Why did this issue exist in the first place ?

Async backing is a huge update to the network and upgrading the network with minimal downtime proved to be challenging since both the old protocols and the new ones had to exist at the same time, given the surface of the update this bug snuck in.

Why wasn’t this caught on our test networks?

This bug was caught when the enablement on kusama happened and at the time a fix deemed to be correct was provided.

Why did the issue repeat on Polkadot ?

The fix was not correct, the issue was correctly root-caused and the provided fix looked sane, however a mistake in the author’s process made the fix not achieve its goals.

This happened because at that moment the only runtime not having async backing was polkadot, so the developer ran a hacked experiment to confirm what happened on kusama and the confirm that the fix solves the issue, then it proceeded with catching the invariants of the scenario in a unit test. However, a late cleanup addition to the pull request, made the fix incorrect and the newly added unit test did not catch it. This could’ve been avoided if the developer re-ran the hacked experiment after the late addition, more details here

Why didn’t any of our tests catch this issue before runtime deployment ?

All of the polkadot-sdk tests use the latest runtime version, so if there are subtle interactions between node and runtime at the moment of the transition, our tests will miss it

Why didn’t we immediately rollback to statement-distribution-v1 ?

Runtime upgrades can not be rolled backed, the new subsystems were activated by the introduction of new runtime APIs.

How do we avoid this happening in the future ?

The easiest way to avoid this category of bugs happening in the future would be to have a step in both polkadot-sdk and runtimes were we explicitly test that transitions from runtime N-1 version to runtime N happens with zero impact on all properties of the network. Idea of how this could be implemented is here: Polkadot Doppelganger · Issue #4230 · paritytech/polkadot-sdk · GitHub
Our node binary has a lot of things cached on session boundary, so a way to reduce the risk here, would be to always schedule runtime updgrade just a few blocks before a new session begins.
Features that can be safely rolled-back, should always be enabled through a mechanism were we can disable them fast if deemed necessary, such a mechanism does not exist at the moment since all configuration changes need to go through a whitelist and a public referendum, this is here to consider the opportunity of building such a mechamism.

Timeline

[2024-04-21 10:27:48 UTC] Runtime v1.2.0 gets enacted all parachains halt, the issue can be clearly noticed on polkadot-introspector dashboards.

[2024-04-21 10:47:00 UTC] Parachain team gets notified on channel Parachain core implementation team, and start investigating the issue.
[2024-04-21 11:17:00 UTC] Issues gets identified as a possible repeat of kusama incident, team is not sure if this is the problem because issue should be fixed, other avenues regarding storage migration are investigated as well.
[2024-04-21 11:30:00 UTC] New session changes all parachains start producing blocks at the rate they were producing before, issue was resolved.
[2024-04-21 11:36:00 UTC] We concur it was the kusama incident, and since enabling of async APIs it is a one-off, we expect the issue to resolve for good and no further stalls are expected, but we still don’t know why the fix for it did not work.
[2024-04-21 15:18:00 UTC] We find the root cause why the fix did not work as expected, more details in, no further impact is expected

Time to detection

Issue was detected immediately by parachain teams, however it took around 20 minutes until the signal arrived on the right channel into the attention of the core-parachain development team, the team has alarms in place, but this was not detected by them. The time to detection could be further improved by having alarms based on the polkadot-introspector, to notify us for any unusual pattern in parachains block production.

Action items:

Add alarms for parachains team for parachains block production on polkadot
Implement an integrated automated CI check for runtime upgrade for each of our networks.

bLd · April 29, 2024, 10:21am

Thanks a lot for this post mortem and the quick handling of the incident by Parity team.

As a feedback, I would like to ask for the next time if Parity’s incident manager could communicate in an established channel with parachain teams.
A very high level of communication would be enough with a restricted number of participants (parachains team), allowing us to spread the word to users community and improve the communication overall.

Right now, it seems that each user/builder/parachain team member goes to ask/share information in many different channels, while we have matrix groups like League of Parachains that would allow this close communication with everyone and would help reduce the noise around.

Thanks a lot!

bLd
Astar Network

Topic		Replies	Views
[2024-09-17] Polkadot finality lag/slow parachain production immediately after runtime upgrade - post mortem Tech Talk postmortem	0	571	September 18, 2024
ParityTech update for April 2024 Ecosystem	10	1530	June 6, 2024
Async Backing - Development Updates Ecosystem	10	2129	May 23, 2024
Proposed Solutions for Unbricking an Enterprise Parachain with OpenGov Ecosystem	14	2080	September 18, 2023
How to Recover a Parachain Tech Talk	25	3859	December 8, 2022