Description
Immediately after runtime 1.2
got enacted on polkadot all parachain almost halted, with blocks being included rarely into the relay chain. This happened because parachain candidates weren’t backed by the statement-distribution-v2 subsystem.
Statement-distribution-v2 is a new subsystem written for async backing that is responsible with distributing backed
statements between nodes, so that all nodes become aware a parachain candidate can be included on the relay chain, more details about it here. The subsystem got enabled the moment the runtime ParachainHost API got bumped to a version that includes async_backing_params
in polkadot runtime release 1.2
.
Statement-distribution-v2 uses the network topology to know how to gossip backed
statements. The network topology gets updated at the begining of every session, so when it received the new topology it concluded it does not have to save the topology because async backing was not enabled on any leafs, hence when the runtime got upgraded mid-session, it did not have a topology to use for distributing the backed
statements, so all parachains got stalled.
How was it fixed
The network self-healed on the next session, when it got a new topology, no further impact is expected.
Root-cause of the incident
Why did this issue exist in the first place ?
Async backing is a huge update to the network and upgrading the network with minimal downtime proved to be challenging since both the old protocols and the new ones had to exist at the same time, given the surface of the update this bug snuck in.
Why wasn’t this caught on our test networks?
This bug was caught when the enablement on kusama happened and at the time a fix deemed to be correct was provided.
Why did the issue repeat on Polkadot ?
The fix was not correct, the issue was correctly root-caused and the provided fix looked sane, however a mistake in the author’s process made the fix not achieve its goals.
This happened because at that moment the only runtime not having async backing was polkadot, so the developer ran a hacked experiment to confirm what happened on kusama and the confirm that the fix solves the issue, then it proceeded with catching the invariants of the scenario in a unit test. However, a late cleanup addition to the pull request, made the fix incorrect and the newly added unit test did not catch it. This could’ve been avoided if the developer re-ran the hacked experiment after the late addition, more details here
Why didn’t any of our tests catch this issue before runtime deployment ?
All of the polkadot-sdk tests use the latest runtime version, so if there are subtle interactions between node and runtime at the moment of the transition, our tests will miss it
Why didn’t we immediately rollback to statement-distribution-v1 ?
Runtime upgrades can not be rolled backed, the new subsystems were activated by the introduction of new runtime APIs.
How do we avoid this happening in the future ?
-
The easiest way to avoid this category of bugs happening in the future would be to have a step in both polkadot-sdk and runtimes were we explicitly test that transitions from runtime N-1 version to runtime N happens with zero impact on all properties of the network. Idea of how this could be implemented is here: Polkadot Doppelganger · Issue #4230 · paritytech/polkadot-sdk · GitHub
-
Our node binary has a lot of things cached on session boundary, so a way to reduce the risk here, would be to always schedule runtime updgrade just a few blocks before a new session begins.
-
Features that can be safely rolled-back, should always be enabled through a mechanism were we can disable them fast if deemed necessary, such a mechanism does not exist at the moment since all configuration changes need to go through a whitelist and a public referendum, this is here to consider the opportunity of building such a mechamism.
Timeline
- [2024-04-21 10:27:48 UTC] Runtime v1.2.0 gets enacted all parachains halt, the issue can be clearly noticed on polkadot-introspector dashboards.
-
[2024-04-21 10:47:00 UTC] Parachain team gets notified on channel
Parachain core implementation team
, and start investigating the issue. -
[2024-04-21 11:17:00 UTC] Issues gets identified as a possible repeat of kusama incident, team is not sure if this is the problem because issue should be fixed, other avenues regarding storage migration are investigated as well.
-
[2024-04-21 11:30:00 UTC] New session changes all parachains start producing blocks at the rate they were producing before, issue was resolved.
-
[2024-04-21 11:36:00 UTC] We concur it was the kusama incident, and since enabling of async APIs it is a one-off, we expect the issue to resolve for good and no further stalls are expected, but we still don’t know why the fix for it did not work.
-
[2024-04-21 15:18:00 UTC] We find the root cause why the fix did not work as expected, more details in, no further impact is expected
Time to detection
Issue was detected immediately by parachain teams, however it took around 20 minutes until the signal arrived on the right channel into the attention of the core-parachain development team, the team has alarms in place, but this was not detected by them. The time to detection could be further improved by having alarms based on the polkadot-introspector, to notify us for any unusual pattern in parachains block production.
Action items:
- Add alarms for parachains team for parachains block production on polkadot
- Implement an integrated automated CI check for runtime upgrade for each of our networks.