[2024-09-17] Polkadot finality lag/slow parachain production immediately after runtime upgrade - post mortem

Description

Immediately after runtime was upgraded to 1.3.0 at block 22,572,435 block finalization started lagging to at most ~70 blocks and parachains block production was very slow or halted.

How it was fixed

Relay chain finality

Most of the polkadot validators crashed and restarted, when enough of them came back online, finality started catching up, at around ~10min after the start of the incident.

Parachain block production

Parachain collators got stuck and all needed to be restarted manually to resume block production

Root-cause of the incident

On validators the candidate-backing subsystem was marked as stalled by the overseer and the node was restarted:

Overseer exited with error err=Generated(SubsystemStalled("candidate-backing-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

Why was the subsystem stalling ?

Because it was stuck waiting for the result of the runtime-api function claim_queue.

Sep 17 08:04:13 dot1 taskset[688103]: 2024-09-17 08:04:13 Task "polkadot-runtime-api-request" (Group: runtime-api) was still running with 3 instances after waiting 60 seconds to finish.

Why was the claim_queue runtime api stuck/really slow ?

Because at the block the runtime was enacted it used a corrupted version of HostConfiguration which had a lot of garbage values in it, so that probably made the loops in this function really long.

Why was the HostConfiguration corrupted ?

The runtime update contained a migration of the host configuration where some fields were moved around parachains_configuration::migration::v12::MigrateToV12, as a consequence of this limitation the migration is applied at the end of the blocks, so runtime calls will use the new code fields, but with the old storage format, hence why HostConfiguration was corrupted and contained garbage values for the new fields.

More technical details could also be found in [Root cause] Finality lag and slow parachain block production immediately after runtime upgrade · Issue #5738 · paritytech/polkadot-sdk · GitHub

How do we avoid this happening in the future ?

  1. Until 1) gets fixed we need to use defensive coding like here to work around its limitations.
  2. Prioritize a solution for After runtime upgrade, Runtime API calls use new code with unmigrated storage · Issue #64 · paritytech/polkadot-sdk · GitHub or make it really hard to trip this limitation with some compiletime/runtime checks.
  3. Better monitoring and sanity checking of pallets with migrations on test networks, the corruption happened on the test network as well, but we were lucky/unlucky and the network moved on, because the corruption did not cause the nodes to crash. However, we could have noticed the garbage values if we called the runtime apis at the block the runtime was enacted. These checks can be automated using chopsticks.
  4. An E2E sanity check test, we can have a test where we spawn a local network with the old runtime we upgrade the runtime version to the new one and we call all the runtime API the node calls and check they have sane values.
  5. The generic solution to avoid this category of bugs happening in the future would be to have a step in both polkadot-sdk and runtimes were we explicitly test that transitions from runtime N-1 version to runtime N happens with zero impact on all properties of the network. Polkadot Doppelganger, is the proposed solution for that, although is not a guarantee it will always catch this type of corruptions.
  6. Make the node code a bit more robust, so it does not fail hard if an API hangs, for example in this case it could have timeout and work with an empty queue for this block and move on to the next relay-chain block.

Timeline

  • [2024-09-17 07:57:48] Runtime v1.3.0 gets enacted
  • [2024-09-17 08:03:xx] Most of the validators restart because of SubsystemStall
  • [2024-09-17 08:12:00] Finality catches up parachains are still producing blocks very slowly
  • [2024-09-17 08:30:00] Parachains get notified to restart their collator and start recovering

The finality lag dashboard

Average parachain block production

Action items:

  • Validators and collators that have not auto-restarted need to be restarted to recover, for example there are still 11(3%) polkadot validators that did not recover, they can be found here: Apps by Turboflakes, that will affect relay-chain block times, because they are not producing any blocks.
  • Prioritize the items 1 to 5 presented above.

Open questions

  • This required action from multiple stakeholders validators and parachains, was everyone timely informed that they need to recover their validators/collators ?

Shot out to @alindima, @andronik, @bkchr, @sandreim, @tsvetomir for being of tremendous help during this investigation.

18 Likes