Proposed Solutions for Unbricking an Enterprise Parachain with OpenGov

This post follows up on the recent Nodle Parachain upgrade which halted block production. A governance proposal that required to lock 100,000 DOTs has been opened to resume block production. Considering a 28-day decision period, this will lead to an estimated total Parachain downtime of about 31 days.

The support from the community has been incredible, with numerous Polkadot and Parachain builders stepping up by offering help, advice, and support on the proposal itself.

We wanted to use this opportunity as a way for the ecosystem to identify areas of improvement within Polkadot itself to further strengthen the ecosystem’s position within the Web3 industry.

From a Parachain builder’s point of view, the reason for choosing Polkadot as a platform is because of a set of unique “services” it provides. We can think of Polkadot as a true decentralized service provider for:

  • Shared Security: Polkadot’s validators ensure a Parachain’s security by validating its blocks as they are proposed by collators.
  • Interoperability: Parachains can use XCM to communicate with other Parachains and build composable applications.
  • Uptime: if Parachains produce valid blocks, they are finalized and included within the network. Without uptime, the other two value propositions are essentially void. In fact without uptime there is a significant negative value to operating as a Parachain.

Parachains pay for these services by either bidding on their slots or allocating some of their own tokens to have the community support them via a crowdloan. This is an investment equivalent to anywhere from $300,000 to millions of dollars depending on market conditions. In addition to the economic factors, significant time is spent to fundraise, bid and acquire a slot. For a startup project, time is its most valuable resource and having to acquire or maintain a Parachain slot can be a distraction that slows downs the search for product market fit or the financing of the project.

In addition, in the case of Nodle, where most of its applications are for enterprise purposes, Parachain downtime has real a world economic impact for businesses. Parachain downtime creates a potential reputation impact for the ecosystem since it raises Polkadot robustness issues. With a Parachain halted, mission critical applications are stopped, and real world customers (many of whom are using web3 for the first time) are impacted.

The Nodle team takes responsibility for its upgrades, but the ecosystem as a whole is operating at the cutting edge, with many upgrade parameters impossible to verify in a testnet environment. To prevent uptime reliability issues in the future, we propose a handful of improvements. We believe these will not only be helpful for other projects, but believe these will be essential for the Polkadot ecosystem to be taken seriously and considered more often for enterprise applications.

Proposed Solutions for Parachain Uptime Reliability

We believe that for the Parachain ecosystem to be sustainable for enterprise use cases, Polkadot should maintain its uptime, interoperability, and shared security services even in the case of a problematic upgrade.

Auto Reverting Failed Upgrades

This could, for instance, be achieved by dropping a failed upgrade and reverting to the previous version if a deadline to produce a new block is not reached (this is of course not the case today, and is an adaptation to the current implementation) - if others agree, we intend to further research how to implement it and contribute to the Polkadot core code.

Polkadot’s uptime as a relay chain is excellent, without any major recorded incidents throughout its existence.

Looking at other blockchain platforms, downtimes are usually solved within hours or days (1, 2, 3). Any downtime is considered antithetical to the purpose of a blockchain, and even a few minutes of downtime results in serious repetitional impact. Looking at Web2 services, downtimes are solved within a matter of hours, and many businesses contractually guarantee uptimes of 99.999% per year. Yet on Polkadot, resolving a failed upgrade would take over 28 days. This means that the maximum possible uptime available when a problem is uncovered would be approximately 92%.

In the web2 space, one ITIC survey found that the cost for a single hour of server downtime totals $300,000 or more for 91% of the interviewed corporations. For someone like Amazon, using past 12 month net sales, 28 days of downtime would represent a loss of over $41 Billion USD. For Polkadot to even be considered for future enterprise applications it needs to reach the five nines uptime.

Building better testing tools

Nodle is committed to providing better tools for testing parachain upgrades and making them available to the community. In the case of the current upgrade, Nodle attempted to migrate over 47k+ NFTs to a different pallet. While this worked perfectly on testnet, few parachains have pushed Substrate to the limits like Nodle. We propose building better testing tools to help simulate migrations closer to production conditions. For instance, try-runtime had failed to detect the high PoV size and time to produce a block.

The Polkadot community has already been extremely active and provided updates to testing tools since we highlighted this issue. We will investigate whether the Pull Request opened on try-runtime on August 23rd is sufficient or whether it needs more improvements which we could contribute.

Getting back online through decentralized governance

Fortunately, Polkadot includes systems within OpenGov to reduce a 28 day revert and resolve uptime issues much faster. Unfortunately, these features are not directly accessible to Parachain teams, and only accessible to only a handful of people as they are restricted to the Polkadot Fellowship only.

Therefore, considering all the reasons mentioned above, we would like to ask the community and the Fellowship: if Polkadot has a way to restore its core business services as soon as possible for a Parachain; and if it is not possible, what is preventing Polkadot from doing it?

With the purpose of enabling the ecosystem to reach an enterprise grade level service, we would like to hear from the community and the Fellowship.

11 Likes

On other blockchains, if it got an issue due to a failed upgrade, the dev team and node operators will work together to revert / resolve the issues ASAP. Usually the chain will resume operation within a day or two. This process is usually very centralized.

As a Polkadot parachain, we have a way to resolve this kind of issue in a fully decentralized way. However, as stated, 28 days is simply unacceptable for every use cases. Decentralization is one of the key factor that all of us here are working towards but it cannot come with the cost of, well, not working. A fully decentralized not working product is not going to compete with a centralized but working product. We all know that.

Can we improve it? Yes, we already had some prior discussions and planned RFC on improving the situation. Eventually we will have a decentralized, and also efficient and secure solution. But that will take time, I will say at least 6 months, most likely 12 months or more, to build, test, and deploy. This doesn’t help with the problems that we are facing NOW.

Can we do something NOW? Maybe? OpenGov does have the ability to fast track proposals. In order to avoid abuse, the proposals must be whitelisted by Fellowship. Again, in order to prevent abuse, we shouldn’t whitelist proposals unless it is necessary.

Me as a Dan 4 Fellowship member, for the reasons stated above, I believe it is right for the Fellowship to help in this case. A working blockchain is better than a not working one.

6 Likes

Hey Eliot, one of the maintainers of try-runtime-cli here.

Would love to chat about how we can collab, feel free to reach out on element - liam:parity.io

4 Likes

Thank you for this extensive not too technical explanation of the 28-day-fiasco.

I really hope this will bring better tooling and ways to prevent - or in the worst case - revert it in a faster manner.

And also a big thank you to all the supporters, builders and teams that have been so helpful since day one :raised_hands:

2 Likes

The whitelisted caller track is not a fast-track solution: the vote still lasts for max 28 days, as it depends on how many token holders are voting providing support and approval thresholds. What happens here is that the thresholds are lower.

Another possible solution that was discussed some time ago was to code a new Governance track for Parachains: to exclusively propose submissions aiming to unbrick their chains, with lower thresholds but some level of certainty that all submissions were checked and with lower decision deposits than the admin track (maybe ecosystem devs can take care of this? or parachain devs can code this and open an RFC?). Of course this type of solution will take time, in the meantime, and if fellowship does not want to approve something like the Nodle one, the entire community should be voting on this to ensure a shorter time than 28 days (the 28 days rule is not a hard rule: times shortens as thresholds are met).

4 Likes

I think this is a good time to re-share this post from a couple months ago: Polkadot Summit - Ecosystem Technical Fellowship Workshop Notes - #13 by Birdo.

TL;DR from the post is to create a Parachain Technical Fellowship that has as part of it’s mandate to work on these scenarios.

IMO we don’t need to lose sight that Polkadot is a Relay Chain + all Parachains, however executing these actions also has a great deal of controversy; a lot of things can go wrong. A couple of ideas that can be executed here could be:

  1. Creating some tooling, maybe even leveraging chopsticks that allows dry-running of a runtime upgrade. In this case, before actually pushing this on chain, there would be one last step with more ‘real’ conditions than what the current tooling has.
  2. Creating a new track on OpenGov just for the sake of these situations. It would need IMO a high deposit, but probably some curves that would allow this to pass fast with a big amount of votes.
  3. A technical fellowship that can make it go even faster trough proper whitelisting of the calls.
6 Likes

This time the problem wasn’t that the migration failed, but more that the migration took too long. This is also not the first time this has happened and will probably also not be the last time. As I had said around the “famous” How to recover a parachain talks, we need to “test test test and test”. We already have chopsticks from @xlc and that is a very good tool that we should leverage more. We also have try-runtime and both are probably capable of showing you the PoV size of the migration, which would have shown the issue here.

However, what I would like to see, instead of coming retroactively when the shit hit the fan, people asking before applying. Maybe we should create some repo or discussions here in the forum when someone wants to apply an upgrade. People could then ask others for help, other people out of the ecosystem who may already have more knowledge when it comes to runtime upgrades. Just go out and ask for help :slight_smile: I would also be very very happy if we could do this outside of parity :stuck_out_tongue: I would also be willing to look into this as well from time to time. Maybe we could create this repo under the fellowship or somewhere else? We just need someone who would be willing to make this happening. We could also write down learnings from these migrations to help others in the future.

I also started to write down instructions on how to rescue a chain, instead of just writing them down in a chat. My idea is that people can use this to learn. Here are the two documents I currently have:

  1. Unfuck Moonsama - HackMD
  2. Unfuck Westend Number 123524 - HackMD
8 Likes

To be clear, try-runtime wasn’t able to track these until a week or two ago as far as I know. However, the latest patches added support for this. I already personally took the freedom to make a GitHub Action to further automate its use for both Nodle and other Parachains.

We may also be able to develop more tooling to better test chain upgrades, potentially by taking some form of fork of both the relay-chain and a target Parachain. Such system could be handy to test more complex migrations, especially for projects that use multi-block migrations. I am not aware of any tooling supporting exactly this yet, so there should be room to implement this. An alternative could be to extend Chopsticks which would go in the direction of @santi’s suggestion.

Would this fall under the Ecosystem Fellowship or the Polkadot Fellowship? It sounds like the main hurdle here would be to find somebody or some entity who is willing to support and maintain such an effort.

Nodle and I can contribute some writings from this experience as well. I believe HydraDX also had a similar post though I am unable to find the link right now. Where would be the best place to compile these writings? Could this be under the fellowship org for now? I assume it would be better there than on Parity’s github.

Yes, this would be essential. Every Parachain could elect one representative (which could also be a multisig or a DAO on another Parachain) to join this fellowship. Unfortunately the efforts to set this one up stalled a little as there was little traction in establishing the initial manifesto. Maybe we will be able to restume them later.

An alternative could be what @RTTI-5220 is proposing: a dedicated track to Parachain unbricking. Potentially this track could be usable only if a Parachain did not produce a block for the last X units of time to further strengthen the guarantees offered.

Though either ways, it sounds a little constricting to have to escalate to governance if any issues arise. Shouldn’t Polkadot and the Parachains have a failsafe in case upgrades fail? I wonder whether it would be acceptable to modify the implementation of the GoAhead signal and Parachain upgrade logic to allow a Parachain to auto-revert to its last known good runtime in case an upgrade is falling after X units of time. Would there be any counter indications from any people more familiar with the matter on this topic? If not, I could be willing to take a stab at it as this sounds like it would improve Polkadot’s uptime guarantees for Parachains.

It is there since beginning of 2022, but sadly it is a debug log. I had assumed it was an info look :see_no_evil:

Good job for providing this GHA! We also have try-runtime integrated in our CI since quite some time!

I also don’t know for sure. However, I think that Chopsticks or try-runtime should be ready to support multi block migrations :slight_smile:

Not sure if this directly needs some form of “collective”. I would really hope that we can find builders who help other builders. So, that we build some kind of collective knowledgebase on how to do runtime upgrades. Everyone running a parachain will at some point need to do a migration, so it should be in the interest of everybody to help there to also get back some help at some point.

How should the relay chain verify that the Parachain is in a bad state and how should it verify that the new state someone presents it is a “known good state”. We already discussed all these kind of solutions in the How to recover a Parachain topic.

1 Like

I totally understand this, but at the same time, I can’t keep myself from thinking that the definition of robustness is broken here.

The downtime exists, foremost, because Polkadot is robust when it comes to its governance mechanisms, choosing conservative parameters, as opposed to spinning up a Multisig controlled by Parity, W3F and a few other key ecosystem players, and letting this Multisig fix parachains (the equivalent of this setup is, unfortunately, not unheard of in the blockchain industry). If we are to build something that is the equivalent, it would have to be the proper way, similar to RFC idea: Parachain management & recovery parachain ¡ Issue #16 ¡ polkadot-fellows/RFCs ¡ GitHub.

I skimmed over the 3 articles that you shared and couldn’t find much detailed info on how e.g. Ethereum or Solana go about resolving these uptime issues. But my guess is that neither reside to robust onchain governance mechanisms.

That all being said, I am not against relaxing onchain governance as long as robustness is not compromised such that parachain recovery can happen faster, but I wanted to point out the core reason why it is slow today.

In the short term, I think the “Test Test Test” mantra from @bkchr is actually the most realistic suggestion. None of the parachain stalls I have seen so far have been for some unknown bug or unknown reason. They have mostly been because of abusing on_runtime_upgrade or on_initialize mistakenly, both of which can be well tested with @xlc’s Chopsticks or try-runtime (mostly maintained by @liamaharon @piomiko these days) are capable of reporting about.

For what it is worth, the documentation of both of the methods above is indeed clear about the risks to some extent:

Warning
The weight returned by this is treated as DispatchClass::Mandatory, meaning that it MUST BE EXECUTED. If this is not the case, consider using Hooks::on_idle instead.
Try to keep any arbitrary execution deterministic and within minimal time complexity. For example, do not execute any unbounded iterations.

and

Vert similar to Hooks::on_initialize, any code in this block is mandatory and MUST execute. Use with care.

Possibly better education could also help here, and we are working on that, but it is hard for me to pinpoint 1 location where we can document something like “YOU SHALL NOT GO OVERWEIGHT IN A PARACHAIN” and be sure that everyone sees it. I recall a parachain upgrade checklist somewhere, but I can’t find it now.

2 Likes

I think the MR from @joepetrowski to introduce System Collectives would be a step in this direction.
Such a new System Collective could then be added for the parachain teams and given privileges to reset the runtimes of another parachain in case of bricking.

1 Like

I agree also that some “Parachain Technical Fellowship” would be good for this role (as well as others).

Parachains would probably want their local governance track that would lock their parachain to be shorter than the collective’s track to re-set code/head data, since the same power that can unbrick a parachain can also brick one. Then, if the para is “alive” and sees a referendum come up regarding itself, it can lock itself to prevent the substitution.

1 Like

To be clear, I do not think relaxing on-chain governance should be the priority nor the best long term choice. Governance or otherwise privileged actions should be used only when absolutely necessary.

Nonetheless, it is important to note that the ecosystem we are all building here remains young and sometimes unstable. More teams will do mistakes as they onboard to Polkadot, and more issues or challenges will be discovered.

It is also worth highlighting that in the case of Nodle, testing was done with the tools previously mentioned and according to a precise, month long, process. This process and the tools used were not enough to catch the issue and prevent a month long downtime as covered in prior posts. Could this issue have been avoided with better tooling or more precise testing? Maybe. However, we cannot reasonably expect every single Parachain team (some of them much younger and more inexperienced than Nodle) to understand every single edge cases that may or may not arise.

As such, the core of my suggestion wasn’t to relax governance on the relay chain (albeit it could be a welcome change in the future…), but to adjust how upgrades are being applied to ensure more fault tolerance. If an upgrade fails on your laptop, it won’t get bricked, the system will most likely revert to a prior version. If you deploy a new container to your k8s cluster or favorite cloud function provider, a failure would cause it to stick to the prior known good version. My suggestion was to allow for a similar behavior within Polkadot and its Parachains. If an upgrade fails, shouldn’t we revert to the prior known good runtime instead of stalling and requiring a Governance level action?

4 Likes

This is not a matter of robustness. There is a design limitation that arises when a parachain halts. While a body of governance had all the rights to do upgrades or any other root privileged action on that parachain, the same body of governance suddenly lose their whole permissions if the parachain is not producing blocks. In some other layer 2 or layer 1 situations, the same body of governance can resort to a fork through their internal coordination and that would take hours but not 28 days. The polkadot itself has substituted code and didn’t need to be down for 28 days.

1 Like

The relay chain doesn’t know if an upgrade is failing because there was a broken runtime upgrade or just all collators work together and stopped their action because they didn’t want that a runtime upgrade is applied. This just weakens the security of your chain.

This was also already discussed last year. There could be a governance chain that provides the ability to have your governance being run by this chain. This then also means, if your chain goes down that governance is still working. All of that is already possible.