I’ve come across several reports of a parachain update/upgrade going awry and a chain is bricked, unable to produce blocks - hi-jinx ensue, and everyone lives happily ever after.
One such case is discussed below and the consensus appears to be automatic rollback is not possible, and being inoperable for some period is the way things will be (hi-jinx required):
I’d assumed any such changes are like a kernel update with A/B partitions - if the reboot fails the previous setup is reverted to.
I am not suggesting this would be straight forward. In fact the Uptane details indicate the scope of the issue - albeit in a different context:
Nonetheless, is something like “The Update Framework + failsafe rollbacks for Parachains” possible?
Is there anything along these lines underway that can be tracked?
Update:
There is a Rust implementation of TUF (beta):
Both options have their pros and cons. Option 1 might not solve all the errors that could occur, while option 2 is difficult to implement in a decentralized fashion (the token holders on the bricked parachain should be able to vote).
An interesting discussion point regarding option 2 is also how much power should be moved to the relay chain. The code of the parachain defines the rules that must be followed on this chain. If we now move parts of this to the relay chain, the parachain gives up a portion of its sovereignty and it also becomes more complicated to reason about the rules on a parachain. You would need to take the parts into account that now live on the relaychain.
Another issue is also, that a stalled parachain might even be the luckiest error case. A security vulnerability that let’s you mint tokens could be even worse. Rollbacks might not be possible in these cases since the tokens could already be moved to other chains via XCM.
I won’t be at the Barcamp, and understand it’s held under Cheltenham House Rules - which is fine.
My understanding of both those options is that neither is an automatic rollback by the relay and parachain. Both options require voting. Correct?
Agree not every mishap will be reversible. Maybe restrict the initial scope to those that are.
Am I correct that if the upgrade protocol had an immediate ‘block-production’ validation step and the relay and parachain kept the last known-good wasm to revert to, then some of your ‘hi-jinx’ may have been avoided, and you would have immediately been alerted to the issue.
I’m not suggesting some uptane-like recovery functionality is trivial, nor a cure-all. It is a well defined starting point. It also means not every recovery is blocked on a vote.
Probably important to address two categories of mishaps separately?