Partial Polkadot Parachains stall on runtime upgrade 2.1.1 - postmortem

Post-Mortem: Partial Polkadot Parachain Stalls following Runtime Upgrade 2.1.1

Incident Date: 23 March 2026

Summary

On 23 March 2026, immediately following the enactment of the Polkadot 2.1.1 runtime upgrade, a small subset of the parachains experienced a stall in block production. The root cause was identified as a decoding failure in a deprecated Runtime API.

While usage of the API had been removed from collator nodes in June 2024 and marked as deprecated in April 2025, a change in its returned data structure in December 2025 broke backwards compatibility for collator nodes built before June 2024. These collators stopped sending blocks to relay chain validators, triggering a stall or block time degradation for the parachains running only/partially unupgraded nodes.

The issue was an implementation oversight whose impact would have been avoided by having more recent collator nodes, which is also why only a subset of parachains were affected. The issue would have been caught by an automated test that had been disabled as flaky.

Impact

  • Parachains: A small subset of parachains running legacy node software stopped producing blocks. Parachains running releases newer than stable2407 did not experience any problems. According to our data, 5 out of the 37 active parachains were affected.
  • Relay Chain: No direct impact; the Polkadot Relay Chain continued to function and finalize blocks normally.
  • Users: Users of affected parachains experienced a service outage until collators were updated or patched.

Timeline (All times in UTC, format YYYY-MM-DD HH:MM)

  • 2024-06-12: Polkadot-sdk v1.13.0 is released - the first release that removed the usage of the offending runtime API from collator nodes (PR #4471)
  • 2024-07-29: Polkadot-sdk first stable release is published: stable2407. With it, the new release process is established: 3 month cadence for new stable releases, 1 year support window for all stable releases. During the support window, stable releases get updated with patch releases for bugfixes and security fixes. It also includes PR #4471.
  • 2025-04-08: Polkadot-sdk stable2503 is published, deprecating the offending backing_state runtime API and replacing its node-side usage with a new, more comprehensive one, for the purposes of fixing a security issue (PR #6867). This release does not break compatibility.
  • 2025-12-22: Polkadot-sdk stable2512 is published. This is the first release that included #9443, being the first one that broke backwards compatibility of all validator and collator nodes older than stable2503 and of collators older than 12th June 2024 (v1.13 release).
  • 2026-03-13: Fellowship runtime release 2.1.1 is published.
  • 2026-03-23 12:37:48: Polkadot relay chain runtime is upgraded to the 2.1.1 fellowship release, breaking compatibility of unupgraded nodes. Affected parachains stall.
  • 2026-03-23 13:38 First reports of parachains being stalled come in, via private messaging with the teams. First contact from Parity is made within two minutes of the message being received. No alarms on Parity side were triggered by this incident, so developers were only notified of the issue via private and public messaging.
  • 2026-03-23 13:58 Parachains team at Parity is notified and investigation begins.
  • 2026-03-23 15:10 Root cause is identified by Parity engineers and communicated with the teams that had made contact. Hotfix is immediately supplied, in the form of proposing to cherry-pick PR #4471 on top of their polkadot-sdk forks. We provided technical support to the teams that reached out until the fix was confirmed to be working.
  • 2026-03-23 19:13 First blocks are being produced by collators upgraded to the hotfix. We get reports that more collators are starting to upgrade and we witness the situation improving, confirming the fix was correct.
  • 2026-03-23 21:12 After getting confirmation that the fix is working, we sent updates on the situation on public facing channels, such as the Technical Fellowship public channel, the Polkadot Developer Support channel and github, with a summary of the root cause and fix.

Root Cause Analysis

Why did some parachains stall?

The incident was caused by a violation of SCALE codec compatibility of the return value of a deprecated runtime API. This was being made evident from the logs gathered from affected parachain collators:

DEBUG [Relaychain] Encountered issue during run iteration:
RuntimeApi(Execution { 
    runtime_api_name: "para_backing_state", 
    source: FailedToDecodeReturnValue { 
        function: "ParachainHost_para_backing_state", 
        error: "Could not decode Option::Some(T)" 
    } 
})

The collator was still able to author blocks but due to this error, the blocks were not sent to the relay chain validators, causing the parachain to stall.

Why did the API break?

Tl;DR: it was an accident

Introduction of the new runtime API (PR #6867): For a security fix, we added a new runtime API (backing_constraints), that was returning the same data as the previous one (backing_state), but with one extra field. The old API was deprecated and its usage was removed in the same PR. This maintained compatibility of the old nodes.

The Breaking Cleanup (PR #9443): As part of the standard development cleanup procedure, primitive types used by new, staging runtime APIs are moved into the main primitive module once they have been released on all production networks. This time however, the legacy data structure still in use by the deprecated API was removed by accident as part of the regular cleanup. The deprecated API remained, but it was pointed to the new structure. Because the new structure contained an additional field, it changed the encoding of the response, making old nodes unable to query the old runtime API on new runtimes built from this PR onwards.

Why wasn’t the bug caught during review or testing?

This was an oversight during development and review, made more likely considering the amount of code bundled in the same PR. Large cleanups are always harder to review and make it easier for subtle bugs to go unnoticed during review.

A similar issue was detected by a developer on another runtime API that was broken by the same PR. It was fixed after the merge, before the bug was released.

It was not caught during automated testing. There was one backwards-compatibility test that, if enabled, would have caught this. This test was disabled in a batch together with other flaky tests, but in reality, it was hiding this exact issue. It was also an oversight, made more likely by the amount of flaky CI tests in polkadot-sdk that were disabled.

However, it was only by chance that this disabled test was using a release this old, as there is not a clear backwards-compatibility testing strategy currently in place.

It was also not caught during deployments to testnets. We did not receive similar reports when applying the runtime change on testnets, which is likely because the nodes running there are not as ancient and were not still using the deprecated runtime APIs.

Why were only a subset of parachains affected?

TL;DR Because there are very few parachains running nodes older than version 1.13.

The specific conditions required to trigger this problem made the incident highly improbable in a production environment. No polkadot-sdk release within the official one-year support window was affected. Collator nodes newer than 19 months and relay chain validators newer than 11 months remained fully functional.

However, the 1-year support period was explicitly stated in the release procedure to be for bugfixes and security fixes. It was not expected that an unsupported release would cause compatibility issues, regardless of how old it was.

Why weren’t Parity engineers notified sooner?

The on-call alerting system did not have a correctly configured alert that would trigger in this scenario.
It did include an alert for the average parachain block time degradation, but it was disabled a few months earlier due to false positives caused by a couple of parachains that were degraded for other reasons.
In the end, it took about an hour from the runtime upgrade to the first message from a parachain team notifying us about the problem.

Resolution and Recovery

Mitigation

Affected parachain teams were advised to take a two step approach:

  1. Patch: Teams were advised to backport PR #4471 to their polkadot-sdk forks, removing the call to the affected runtime API for collator nodes entirely. Alternatively, for the scenario where a backport of #4471 would be too complex due to merge conflicts on outdated code, we provided a simpler patch.
  2. Upgrade: Move collator nodes to a supported Long-Term Support (LTS) stable release and setting up a process to regularly upgrade their nodes (using the omni-node as a preferred approach, which should make it a seamless experience)

After applying the patch, all collators had to be upgraded to notice a full recovery of the expected block time of the parachain, as the block time recovery scaled with the total number of upgraded blocks.

Action items

Fix and re-enable flaky CI tests

Existing CI tests need to be stabilised and re-enabled, to ensure regressions are caught early and fixed before being merged. Increase the stability of the CI runners to minimise false negatives.

Backwards compatibility testing

Set up an automated process for testing the compatibility of runtimes with a set of old releases on some common criteria that ensure compatibility of the relay chain and parachains with the new runtimes.
Moreover, set up a testnet QA pipeline that picks up polkadot-sdk runtime releases and monitors common quality criteria of the network.

Define better alerts for on-call monitoring

The existing alert for parachain block time had been paused due to false positives, in favor of only monitoring a couple of particular parachains. Make the alert more resilient, so that on-call engineers are notified when there is a real issue with parachain degradation.

Other learnings

These are some lessons learned that would not have prevented the issue from being released, but would have mitigated or limited its impact.

Setting up a deprecation and compatibility policy

The release process of polkadot-sdk does not concern deprecation/compatibility of old releases, but only the backporting of bugfixes and security fixes to stable releases still within the support window.

A predictable compatibility policy is needed, so that node developers and operators have clear visibility into what releases are still guaranteed to be working and compatible.
Security fixes should be the only reason why this compatibility policy would be broken.
Otherwise, clearing technical debt or implementing breaking features would only happen according to the compatibility policy.

Upgrading parachain collators

This incident also brings light onto a different issue: some parachain nodes are still running legacy, unsupported code.
Parachain teams are strongly encouraged to adopt the Omni-node to make these transitions seamless and frequently pick up new releases. The new release process and support calendar also guarantees that any stable release will be supported for 1 year, bringing down the maintenance cost even further.

6 Likes