If something can go wrong it will go wrong at some point. It’s important to have processes in place to ensure that the likelihood of mistakes is reduced to a minimum. But even the best and most thorough processes can’t ensure that mistakes and errors never happen. So there always has to be a recovery path in case of errors.
For parachains, one of the most severe errors that can happen is when the relay chain has stored a different state transition function (aka. Wasm code) than the parachain. The relay chain stores the Wasm code for all parachains and parathreads that are registered. If the registered Wasm code for a parachain is different to the Wasm code that is stored on the parachain itself, blocks won’t be accepted by the relay chain.
In practice there are very few circumstances where this could happen. But it did happen to us. In the following text I will document the steps we took to recover from this situation, so that others that might find themselves in similar situations can benefit from our learnings.
The Problem
Why is it problematic when there is the wrong Wasm code registered on the relay chain? The parachain collators and the relay chain validators have to agree on how input is interpreted. For example, if collators run a newer version of the democracy pallet, the relay chain validators might not understand the new logic inside that pallet and reject the block, since it’s invalid from their point of view.
Another problem is that the runtime version could differ between the Wasm code on the parachain and the one on the relay chain. Signed extrinsics are bound to a specific runtime version. If extrinsics were valid for multiple versions, they could have different and unintended outcomes. For example, in the old version voting didn’t lock any tokens; in the new version voting locks funds. When signing the extrinsic with the old version in mind, you didn’t agree to any locking. To ensure that an extrinsic is bound to a specific logic, the runtime version is included when signing an extrinsic.
If the runtime version on the parachain is different to the version the relay chain has registered for that parachain, extrinsics are only valid for one of the two. So even if blocks are built, signed extrinsics wouldn’t be possible since they can not be verified by both the relay chain and the parachain.
The Solution
Once there is a mismatch between the registered Wasm on the relay chain and the Wasm on the parachain, the possibilities for fixing this are rather slim. One option is to force a Wasm change on the relay chain side. This can be achieved by calling paras.forceSetCurrentCode(paraID, WASM). This call requires a root origin. Thus, any Kusma/Polkadot parachain would need to go through the governance process, which could take weeks or even months.
For many projects such a downtime is not acceptable. Luckily there is a second option to bring the Wasm code back to sync. If the Wasm can’t be changed on the relay chain side, the only option is to change it on the parachain. Triggering a runtime upgrade is not possible, since this would require building blocks. But there is another option to change the Wasm on the parachain side.
The path to fixing the Wasm code goes as follows:
- Overwrite the Wasm that should be used by the collators with the one that is registered on the relay chain.
- Depending on the Wasm that is stored on the relay chain, it might be necessary to adjust the collator to be compatible with the old Wasm code.
- Build blocks using the old Wasm.
- Sign extrinsics with the runtime version of the old Wasm, so that they get accepted by both the relay chain and the parachain.
- Trigger a runtime upgrade which will bring the Wasm code back to sync.
In the following sections I will describe how this works in detail.
Overwrite the on-chain Wasm
Overwriting the Wasm is relatively easy. The chain specification has a special field for exactly that purpose. It’s called codeSubstitutes. This field contains an object that maps from block number to substitute Wasm. The collator will use this Wasm code starting from the block number that you specified until there is a Wasm code on the parachain with a higher version number.
Say the Wasm code on the relay chain has the version number 10101 while on the parachain it is at 10730. Simply setting the Wasm code from the relay chain as a substitute can not work since the version is too low. Luckily, the version can be overwritten without changing the Wasm code! We wrote a tool only for this purpose. When you input the Wasm code with your old version, you can generate the same Wasm with the new version annotation. It’s important not to change the runtime API version as this dictates how your collator communicates with the Wasm code. If the runtime API version doesn’t match the implemented runtime API, you can’t run the Wasm correctly. The next section explains how to deal with the latter.
Backwards compatibility
After a collator builds a block, it submits this block together with the collation information. This collation information is collected by calling a runtime API. In our case, a breaking change was introduced to this runtime API in between our old and new Wasm code. The old Wasm code that was registered on the relay chain used the old runtime API, while the new Wasm code on the parachain used the new runtime API. This wouldn’t normally be an issue, since the runtime API is versioned and the code is written backwards compatible. However, in this specific part of the code, the runtime API version was not taken from the substitute Wasm but from the original Wasm. Therefore, the backwards compatibility code was not used by the collator which thus failed to submit the block to the relay chain
The solution for this issue was to patch cumulus to only use the old runtime API and build a custom collator with that patch. This collator can only build blocks with the old runtime API. As soon as we upgrade to the new Wasm, building blocks will fail since the new runtime API version would be required at that point.
All these fixes led to a working collator that could build blocks and submit them to the relay chain where they were validated and accepted. This was a huge success. But as soon as you submit an extrinsic, it fails with “invalid signature”.
Sign extrinsics with a specific version
When you sign extrinsics with the Polkadot-JS Apps, you also sign the version for which this extrinsic is valid. Even though we set the runtime version to 10730 in our old Wasm, the code itself still assumed that it was 10101. The reason for that is that we only changed the description of the Wasm code, not the content! Polkadot-JS Apps will use the version from the description and not from the actual Wasm code. In other words, extrinsics were signed for version 10730 instead of the required version 10101.
As I’m not that gifted when it comes to the Javascript side, I couldn’t find a way to set the runtime version manually when signing extrinsics using the polkadotjs-api. But the polkadotjs-api is not the only way to sign extrinsics! Have you heard of subxt? It’s a rust library that lets you interact with your blockchain node using the websocket connection. Luckily, it supports specifying the runtime version when signing extrinsics. So I wrote another program to do exactly that.
Then we could build blocks and submit extrinsics! The last step was to schedule a runtime upgrade.
Execute a runtime upgrade
Finally, with the ability to produce blocks and submit valid extrinsics, we could schedule an upgrade to the new and correct version. It was important to increase the version number so that this Wasm code could actually replace the substitute Wasm that we added earlier, e.g., we bumped to 10731.
Summary
We went to great lengths to get the chain back running and would have never gotten that far without the huge help from Parity devs. Blocks had to be built with an outdated runtime version, transactions signed with a specific version, and the collator client had to be patched. We learned a lot during this process but still wouldn’t want to repeat it.
Epilogue
We could have alternatively addressed this issue via Polkadot Relay Chain governance, which would have been less effort, but that would have taken much more time. With an increasing number of parachains joining the ecosystem, it might happen more often that parachains need the help of Polkadot governance to recover. For enterprises relying on the operation of their parachain, waiting at least 8 weeks isn’t a viable solution.
Having changes applied quicker would be possible with intervention from the Technical Committee, but such a situation would not fall into the role of the Technical Committee, who generally intervene only when an issue could affect the security of the Relay Chain. Maybe there could be a solution where the parachain delegates special powers to a separate body. These powers could be subject to conditions like “the parachain didn’t build blocks for X amount of time”. In case the parachain stalls, the body would then be able to recover the parachain using its special rights. Maybe this could be similar to the Fellowship in Gov 2.0, but more focused on parachains. What are your thoughts on that?
In the end, the easiest solution is to simply avoid mistakes in the first place.