How to Recover a Parachain

If something can go wrong it will go wrong at some point. It’s important to have processes in place to ensure that the likelihood of mistakes is reduced to a minimum. But even the best and most thorough processes can’t ensure that mistakes and errors never happen. So there always has to be a recovery path in case of errors.

For parachains, one of the most severe errors that can happen is when the relay chain has stored a different state transition function (aka. Wasm code) than the parachain. The relay chain stores the Wasm code for all parachains and parathreads that are registered. If the registered Wasm code for a parachain is different to the Wasm code that is stored on the parachain itself, blocks won’t be accepted by the relay chain.

In practice there are very few circumstances where this could happen. But it did happen to us. In the following text I will document the steps we took to recover from this situation, so that others that might find themselves in similar situations can benefit from our learnings.

The Problem

Why is it problematic when there is the wrong Wasm code registered on the relay chain? The parachain collators and the relay chain validators have to agree on how input is interpreted. For example, if collators run a newer version of the democracy pallet, the relay chain validators might not understand the new logic inside that pallet and reject the block, since it’s invalid from their point of view.

Another problem is that the runtime version could differ between the Wasm code on the parachain and the one on the relay chain. Signed extrinsics are bound to a specific runtime version. If extrinsics were valid for multiple versions, they could have different and unintended outcomes. For example, in the old version voting didn’t lock any tokens; in the new version voting locks funds. When signing the extrinsic with the old version in mind, you didn’t agree to any locking. To ensure that an extrinsic is bound to a specific logic, the runtime version is included when signing an extrinsic.

If the runtime version on the parachain is different to the version the relay chain has registered for that parachain, extrinsics are only valid for one of the two. So even if blocks are built, signed extrinsics wouldn’t be possible since they can not be verified by both the relay chain and the parachain.

The Solution

Once there is a mismatch between the registered Wasm on the relay chain and the Wasm on the parachain, the possibilities for fixing this are rather slim. One option is to force a Wasm change on the relay chain side. This can be achieved by calling paras.forceSetCurrentCode(paraID, WASM). This call requires a root origin. Thus, any Kusma/Polkadot parachain would need to go through the governance process, which could take weeks or even months.

For many projects such a downtime is not acceptable. Luckily there is a second option to bring the Wasm code back to sync. If the Wasm can’t be changed on the relay chain side, the only option is to change it on the parachain. Triggering a runtime upgrade is not possible, since this would require building blocks. But there is another option to change the Wasm on the parachain side.

The path to fixing the Wasm code goes as follows:

  1. Overwrite the Wasm that should be used by the collators with the one that is registered on the relay chain.
  2. Depending on the Wasm that is stored on the relay chain, it might be necessary to adjust the collator to be compatible with the old Wasm code.
  3. Build blocks using the old Wasm.
  4. Sign extrinsics with the runtime version of the old Wasm, so that they get accepted by both the relay chain and the parachain.
  5. Trigger a runtime upgrade which will bring the Wasm code back to sync.

In the following sections I will describe how this works in detail.

Overwrite the on-chain Wasm

Overwriting the Wasm is relatively easy. The chain specification has a special field for exactly that purpose. It’s called codeSubstitutes. This field contains an object that maps from block number to substitute Wasm. The collator will use this Wasm code starting from the block number that you specified until there is a Wasm code on the parachain with a higher version number.

Say the Wasm code on the relay chain has the version number 10101 while on the parachain it is at 10730. Simply setting the Wasm code from the relay chain as a substitute can not work since the version is too low. Luckily, the version can be overwritten without changing the Wasm code! We wrote a tool only for this purpose. When you input the Wasm code with your old version, you can generate the same Wasm with the new version annotation. It’s important not to change the runtime API version as this dictates how your collator communicates with the Wasm code. If the runtime API version doesn’t match the implemented runtime API, you can’t run the Wasm correctly. The next section explains how to deal with the latter.

Backwards compatibility

After a collator builds a block, it submits this block together with the collation information. This collation information is collected by calling a runtime API. In our case, a breaking change was introduced to this runtime API in between our old and new Wasm code. The old Wasm code that was registered on the relay chain used the old runtime API, while the new Wasm code on the parachain used the new runtime API. This wouldn’t normally be an issue, since the runtime API is versioned and the code is written backwards compatible. However, in this specific part of the code, the runtime API version was not taken from the substitute Wasm but from the original Wasm. Therefore, the backwards compatibility code was not used by the collator which thus failed to submit the block to the relay chain

The solution for this issue was to patch cumulus to only use the old runtime API and build a custom collator with that patch. This collator can only build blocks with the old runtime API. As soon as we upgrade to the new Wasm, building blocks will fail since the new runtime API version would be required at that point.

All these fixes led to a working collator that could build blocks and submit them to the relay chain where they were validated and accepted. This was a huge success. But as soon as you submit an extrinsic, it fails with “invalid signature”.

Sign extrinsics with a specific version

When you sign extrinsics with the Polkadot-JS Apps, you also sign the version for which this extrinsic is valid. Even though we set the runtime version to 10730 in our old Wasm, the code itself still assumed that it was 10101. The reason for that is that we only changed the description of the Wasm code, not the content! Polkadot-JS Apps will use the version from the description and not from the actual Wasm code. In other words, extrinsics were signed for version 10730 instead of the required version 10101.

As I’m not that gifted when it comes to the Javascript side, I couldn’t find a way to set the runtime version manually when signing extrinsics using the polkadotjs-api. But the polkadotjs-api is not the only way to sign extrinsics! Have you heard of subxt? It’s a rust library that lets you interact with your blockchain node using the websocket connection. Luckily, it supports specifying the runtime version when signing extrinsics. So I wrote another program to do exactly that.

Then we could build blocks and submit extrinsics! The last step was to schedule a runtime upgrade.

Execute a runtime upgrade

Finally, with the ability to produce blocks and submit valid extrinsics, we could schedule an upgrade to the new and correct version. It was important to increase the version number so that this Wasm code could actually replace the substitute Wasm that we added earlier, e.g., we bumped to 10731.

Summary

We went to great lengths to get the chain back running and would have never gotten that far without the huge help from Parity devs. Blocks had to be built with an outdated runtime version, transactions signed with a specific version, and the collator client had to be patched. We learned a lot during this process but still wouldn’t want to repeat it.

Epilogue

We could have alternatively addressed this issue via Polkadot Relay Chain governance, which would have been less effort, but that would have taken much more time. With an increasing number of parachains joining the ecosystem, it might happen more often that parachains need the help of Polkadot governance to recover. For enterprises relying on the operation of their parachain, waiting at least 8 weeks isn’t a viable solution.

Having changes applied quicker would be possible with intervention from the Technical Committee, but such a situation would not fall into the role of the Technical Committee, who generally intervene only when an issue could affect the security of the Relay Chain. Maybe there could be a solution where the parachain delegates special powers to a separate body. These powers could be subject to conditions like “the parachain didn’t build blocks for X amount of time”. In case the parachain stalls, the body would then be able to recover the parachain using its special rights. Maybe this could be similar to the Fellowship in Gov 2.0, but more focused on parachains. What are your thoughts on that?

In the end, the easiest solution is to simply avoid mistakes in the first place.

7 Likes

For the initial period where the Parachain isn’t completely decentralized, the following way could be used: https://github.com/paritytech/polkadot/pull/5451

If a Parachain is owned by some enterprise, they would probably never set this lock and could use this forever.

In general I thought about some sort of recovery mode for Parachains. Something that is living inside the Parachain runtime. Some very easy code path that could be used to only upgrade a Parachain. However, I don’t know if it is worth to create something like this. The biggest problem would also be the authorization to enter this mode.

Nevertheless, I like this solution the most :smiley:

3 Likes

My main concern is, that a parachain might enter a state where it can only recover using the relay chain governance. And that would probably take a long time (might be faster with Gov 2.0?). Keeping the Parachain unlocked and handing the power over the parachain to a single account isn’t an option for most projects. If the parachain registrar could be changed using XCM, it might be possible to build custom solutions for this. E.g. Something like the fellowship in gov 2.0, but for each parachains.
This wouldn’t even need to live on the relay chain and could be a parachain itself.

1 Like

Interesting post.

There is no technical solution to this. It’s a political problem. If you deploy the wrong Wasm code then you have to appeal to governance to decide politically if it was actually the wrong code.

I think it’d be best for technical efforts to be spent on improving the tooling so that these mistakes simply do not happen.

I’m here with @rphmeier. You need to think about this in the way of having a Parachain that is decentralized and this stops because of some bug. Then you want some external entity to vote on some wasm blob to fix the bug. This bug fix could be controversial and the people that are affected by this wouldn’t have any voting power or not as much voting power as on the Parachain itself.

1 Like

As a note, this PR has been merged, which makes it one step easier for teams to manage their Wasm and Hash on the relay chain.

The steps needed:

  • Register your parachain with an account managed by your team.
  • Get a parachain slot, which will “lock” your parachain, the default and safe behavior for the network.
  • Have your parachain send an XCM message to the relay chain, unlocking the chain, and giving access to the parachain registrant.
  • Make scheduled changes to your parachain wasm or head using the new extrinsics available to the registrant of an unlocked parachain.

I would like to note that chains which are “unlocked” are basically the same as those with “sudo”, and should be considered permissioned and somewhat centrally controlled chains. BUT, these are still useful, especially in the early days of the network. Looking forward to seeing more writing and guidance on using these tools as parachain teams actually use them.

1 Like

I’m also fully supporting building better tools for it to not happen, but recent incident has shown us that we cannot live without the political part. Especially when we have no way to speed this critical update up. I would argue that it’s not as hard to see the changes even for novice users when the fix is small like in the case of our fix there are only two changes

  1. Comparing v12.0.0...v12.0.1 · galacticcouncil/Basilisk-node · GitHub
  2. dont wipe authorities · galacticcouncil/substrate@2cb01a5 · GitHub

We could make sure there are tools to make it much easier to explain and show the diff to users.

But even if this is solved, I see three main problems here which are fundamentally technical and are not solved by gov v2 AFAIK

  1. If a parachain stalls there is no way for users to vote on new upgrade apart from relay chain vote
  2. If there is a lot of KSM in that parachain and even KSM of the voters, there is no way for them to vote on unlocking their funds.
  3. There is no way for this to be quick

I was proposing to host the voting on separate chain with separate tokens but @bkchr pointed out off chain solution could be made and it would probably be much more efficient.

I think we should really think about this because if de-fi chain stalls with a lot of liquidity and a money market on top of it even for a day. It could lead to catastrophic chain of liquidations and loss of funds events in the whole ecosystem. We should have a way to fix stuff quickly because even if we have the best tools, it will probably happen and there could be just one time.s

2 Likes

Very good points we host a workshop in the barcamp about this topic. Would love if we could work on a solution there. :slight_smile:

4 Likes

Maybe that doesn’t need to be fully off chain. With manual para lock we could probably introduce some kind of “move control to an external body at parachain X”. Let’s assume that your Parachain has done this and it stopped. Then you could send your users to parachain X, they could proof based on the latest block of your Parachain to Parachain X that they own X amount of tokens and are eligible to vote to recover your chain. When enough users from your parachain voted, the recovery could be done. With some extra checks around this stuff like when the state of your Parachain changed, all these previous votes are removed, because they tried to “recover” a working chain.

I think there are some ways to do this kind of things, but for sure it requires much more thinking.

4 Likes

Maybe we can have a common good parachain for this purpose.

Every other parachain can opt-in to authorize this rescue parachain to have the upgrade permission IFF there is no para block finalized on relaychain for more than X mins.

And then this rescue parachain can implement some simple voting method to allow people to vote the rescue wasm runtime. Interchain Proof Oracle Network will be used to proof holding of funds.

So then instead of seeking DOT/KSM holders’ approval, the parachain token holder can self-service to rescue their parachain.

9 Likes

I agree with this idea, I believe a common good “parachain recovery” chain would be a nice compromise, but understanding that this is not a quick solution to implement, it would be wise to explore the possibilities presented by the tools available to us in the near future, specifically parathreads. There might be a better approach where parachains maintain some sort of emergency recovery “mini runtime” ready to run in a parathread that gets automatically triggered when the parachain stalls and presents the network participants a more sovereign solution to recovering the chain, perhaps this could even allow for basic usage (like token transfers and other features not in constant change) of the network that then gets pushed back to the main parachain along with the fixed validation code that unbricks the chain. This could allow for 0 downtime in the parachain with a sovereign recovery solution, it’s just a quick idea, but in my opinion worth exploring more in depth.

How should that work? How would the parathread be authorized to do this upgrade?

Maybe we should really integrate some kind of recovery mode in the Cumulus PoV logic. The only problem there is, how should we authorize the enabling of this mode? It needs to be something that doesn’t involve too many entities, maybe 3/4 of the collator set or something similar. I mean in the end it would be configured by the Parachain on what logic to use for authorizing this. The recovery mode could then be some simple token voting mechanism to authorize a runtime upgrade to fix the broken chain.

So this is obviously a very complex & political topic.

The de-facto approach to governance in most blockchain systems is more or less that code is law. There have always been exceptions when the system itself was threatened, e.g. when Bitcoin had an infinite mint bug in 2011. Ethereum has hardened on this position over time: in 2016, the DAO hack was enough to necessitate a hard fork and in 2018/19 the Parity wallet hack was not. At this point in time, it doesn’t seem that even a $500m DeFi hack is enough for Ethereum governance to get involved and push a hard fork.

In Polkadot, we have on-chain governance, which carries with it a social contract: the token-holders or ‘citizens’ of the chain have ultimate authority. Other lesser bodies may have some minor privileges, which are scoped. The relay-chain governance has the ability to overwrite any parachain deployed on Polkadot. This is a power that should be used with extreme caution.

When evaluating governance authority, it is important to evaluate the worst cases for abuse as well as use. Convenience often gives way to tyranny. And it is extremely difficult to account for the actual intentions of a parachain in a broad technical mechanism. The best way we have of doing that is whether the chain is proceeding as planned according to its state-transition function. If the chain stalls, there is no shortcut for human intervention.

Essentially, I see parachain teams asking for the relay-chain to automatically evaluate proxy signals such as a token-holder vote or collator referendum to get the chain started up again. Given that parachains are a general mechanism akin to smart contracts, there is no impartial way of evaluating whether these signals actually encapsulate the will of the parachain. For instance, if the parachain has deliberately set its code to void as a way of shutting down, that should not be overruled. It goes beyond the social contract.

It seems to me that parachains don’t want to add ‘admin-multisig’ style recovery paths to their own chain, but would like for the relay-chain governance to function as a fast-response admin-multisig, without having the proper interfaces to do this in a general way that suits all use-cases. I think it would be better for admin/recovery to be managed in the parachain Wasm, as @bkchr suggests, and to expose fallback/recovery infrastructure within the Wasm blob itself, to handle corrupted storage or bugs, if that is what’s desired. I assume this would solve for 80-90% of such cases that we have seen historically, with mild errors in parachain logic or storage. Over time, as systems get more stable, they might choose to remove or limit their recovery/admin infrastructure in favor of more decentralization. For cases that aren’t covered by parachain-scope recovery paths, we will just have to wait longer for the top-level relay-chain governance to decide. Which is not even an option in other ecosystems.

I don’t believe these are technical problems at all.

I can ask these political questions about each of these points:

  1. Who should have the power to upgrade a parachain against the explicit observable behavior of the parachain’s Wasm code itself?
  2. Let’s say the parachain intended to burn those users’ KSM by stopping itself. Should those users alone have the ability to override this behavior? Or should they vote alongside all other KSM holders? Is there an impartial way to determine which KSM is intended to be owned by which account on the parachain, even if the parachain code or storage itself is erroneous?
  3. Is it not dangerous to be able to quickly upgrade a parachain or change its storage? From a technical perspective, we could easily all vote to change the governance system to operate 100x faster. This is not a technical issue but a political one, because there needs to be enough time for all interested participants.

In my proposal, this functionality needs to be opt-in. i.e. enabled by the parachain governance. So if the parachain want’s to purposely die, it can simply opt-out first.

Another way to look at this is: I am requesting a generalized governance chain, the solo purpose is to allow people to vote and dispatch XCM to other parachain / relaychain. Then a parachain may not need to implement any native token & governance. It just need to issue token on Statemint and use this generalized governance chain for any governance actions. In fact, this is the goal of the Polkadot: the core relaychain should have no functionality other than finalizing para blocks. All the governance & token functionality are on system parachain. This shares the same design except the system parachain could be used for other community parachain.

Or when parathread is mature, parachains can also deploy an alternative governance body on parathread that can be used to rescue the main parachain under special circumstances.

5 Likes

I think we agree that relaychain voting should only be used in extreme circumstances and in that sense.

  1. Parachain users. This was the point of my post and I would like to find solution for this.

I know it’s not possible right now but it seems to me like we could find one. In very generic ideation way… Once we have state proofs on the relay chain for parachain state, can we devise a generic way to gather balance of it’s users? (parachains might to chose rules for this e.g. who owns lended voting power? Can you use only free balance?). If this would be readable by the relaychain or a governance parachain. Users of given parachain could vote with balances on the next state of the given chain. Again, might not be the best solution.

  1. Vote alongside everybody else. In the case the storage has errors, it is probably the extreme case and relaychain governance could step in as it would now. Relaychain vote has precedence over parachain vote (at least now and I don’t see immediate reason to change this)

  2. It depends and I’m not sure… Is there a way to speed up technical fixes if a collective of people deemed technical, agree that the upgrade is non-malicious and is it fixing the protocol? If it’s kept in governance v2 why not for parachains? If not, how can we find a way to make fixes secure and relatively fast? I think parachains should have a say on these parameters for themselves and should chose parameters they deem reasonable.

All in all. I completely agree we should not re-use or abuse relaychain governance for these things and all I’m trying to find out is: Can we find a way to make parachain governance more robust and use relaychain governance as last-resort only? It might not be the right way and best reward for the effort. That’s why we need this discussion because there might be better ideas. The only thing I know is that having a DeFi parachain stall for days means it probably never goes back up again.

Once we have state proofs on the relay chain for parachain state, can we devise a generic way to gather balance of it’s users

this is a pretty big presupposition. Polkadot is deliberately designed to be general over storage formats of parachains as well as user schema. There isn’t a good way to ‘prove’ users to the relay chain without effectively locking in parachains to a particular type of merkle trie or storage schema. Most chains use hex tries now, although binary tries are strictly better. And not all parachains will follow the current FRAME schema. There will be parachains written with things that are not FRAME, or with future versions of FRAME. It also seems unlikely that balance accounting can be done impartially without having the relay chain reason about specific pallets on specific parachains.

The two tools that have been suggested in this thread seem to me the most viable path as well as the least likely to incur technical debt.

  1. Basti’s suggested parachain-side recovery mode (solves 80-90% of cases)
  2. Bryan’s suggestion: Parachain admin capabilities, which already exist but should be extended to support an arbitrary account ID as the manager. Over XCM this could be controlled by a multisig or a voting mechanism on another parachain.

These can be used in conjunction with each other and use relay-chain tokenholder vote as a fallback.

When it comes to allowing tokenholders on the parachain to vote, this is actually a more general problem and may need a different solution that is relevant here but somewhat beyond the scope of the topic.

Strictly speaking, DOT held on a parachain is owned by the parachain account and claims are internally delegated to its users according to the mechanisms of the parachain. I propose that we only need two functionalities on the relay-chain in order to solve this issue:

  1. Accounts delegate governance voting rights (already exists)
  2. Accounts should be able to cast many governance votes

With these two functionalities, any parachain can delegate all of its governance voting rights to some account which is controlled on another parachain. This account will be a smart contract, which operates according to the following rules:

  1. If an XCM message is received from the parachain account directing it to vote, it votes according to the parachain’s command.
  2. If no XCM message is received from the parachain (e.g. if the parachain is down or non-operating), it provides logic tailored to the parachain for users of the parachain to prove % holding of the parachain’s DOT and votes according to their will.

There is a lot of flexibility for the specific governance voting mechanism. This smart contract can also be the controller of the relay-chain ‘para-admin’ capabilities of the parachain, if that is desired. It is totally opt-in, and will need to be upgraded whenever the parachain’s own rules, storage format, or storage schema changes, in order to make the accounting correctly.

Note that perfect tracking of claims by users is probably impossible. Users on one parachain might be owners of tokens which correspond to ownership on DOT held by another parachain, or other such cases. While these users might feel like they own DOT, and there is a case to be made that they effectively do, it is difficult to do this type of accounting and can at best be approximate, given opt-in and coordination of delegations by many parachains.

1 Like