Robust chain upgrades: Impossible or Uptane for Substrate (Parachains)

I’ve come across several reports of a parachain update/upgrade going awry and a chain is bricked, unable to produce blocks - hi-jinx ensue, and everyone lives happily ever after.

One such case is discussed below and the consensus appears to be automatic rollback is not possible, and being inoperable for some period is the way things will be (hi-jinx required):

I’d assumed any such changes are like a kernel update with A/B partitions - if the reboot fails the previous setup is reverted to.

I am not suggesting this would be straight forward. In fact the Uptane details indicate the scope of the issue - albeit in a different context:

Nonetheless, is something like “The Update Framework + failsafe rollbacks for Parachains” possible?

Is there anything along these lines underway that can be tracked?

Update:
There is a Rust implementation of TUF (beta):

There is no development done yet. We currently discuss what the best mechanism would be. We host a session at the barcamp next week.

At the moment there are multiple options proposed in the discussion you linked.

  1. A mechanism in Cumulus that enables parachains to recover without external help.
  2. A mechanism on the Relay Chain that allows parachains to delegate recovery powers to an entity. (suggested by Bryan Chen)

Both options have their pros and cons. Option 1 might not solve all the errors that could occur, while option 2 is difficult to implement in a decentralized fashion (the token holders on the bricked parachain should be able to vote).

An interesting discussion point regarding option 2 is also how much power should be moved to the relay chain. The code of the parachain defines the rules that must be followed on this chain. If we now move parts of this to the relay chain, the parachain gives up a portion of its sovereignty and it also becomes more complicated to reason about the rules on a parachain. You would need to take the parts into account that now live on the relaychain.

Another issue is also, that a stalled parachain might even be the luckiest error case. A security vulnerability that let’s you mint tokens could be even worse. Rollbacks might not be possible in these cases since the tokens could already be moved to other chains via XCM.

1 Like

I won’t be at the Barcamp, and understand it’s held under Cheltenham House Rules - which is fine.

My understanding of both those options is that neither is an automatic rollback by the relay and parachain. Both options require voting. Correct?

Agree not every mishap will be reversible. Maybe restrict the initial scope to those that are.

Am I correct that if the upgrade protocol had an immediate ‘block-production’ validation step and the relay and parachain kept the last known-good wasm to revert to, then some of your ‘hi-jinx’ may have been avoided, and you would have immediately been alerted to the issue.

I’m not suggesting some uptane-like recovery functionality is trivial, nor a cure-all. It is a well defined starting point. It also means not every recovery is blocked on a vote.

Probably important to address two categories of mishaps separately?

@albi , I have drafted a RFP and submitted it to the Web 3.0 Foundation Grants Program.
If you could be kind enough to bring it to the attention of the attendees at the Polkadot Summit: Barcamp (30 Nov, 1 Dec) topic Parachain Emergency Recovery?

In addition to general feedback attendees will likely know of teams/people that could deliver the RFP: