This post follows up on the recent Nodle Parachain upgrade which halted block production. A governance proposal that required to lock 100,000 DOTs has been opened to resume block production. Considering a 28-day decision period, this will lead to an estimated total Parachain downtime of about 31 days.
The support from the community has been incredible, with numerous Polkadot and Parachain builders stepping up by offering help, advice, and support on the proposal itself.
We wanted to use this opportunity as a way for the ecosystem to identify areas of improvement within Polkadot itself to further strengthen the ecosystem’s position within the Web3 industry.
From a Parachain builder’s point of view, the reason for choosing Polkadot as a platform is because of a set of unique “services” it provides. We can think of Polkadot as a true decentralized service provider for:
- Shared Security: Polkadot’s validators ensure a Parachain’s security by validating its blocks as they are proposed by collators.
- Interoperability: Parachains can use XCM to communicate with other Parachains and build composable applications.
- Uptime: if Parachains produce valid blocks, they are finalized and included within the network. Without uptime, the other two value propositions are essentially void. In fact without uptime there is a significant negative value to operating as a Parachain.
Parachains pay for these services by either bidding on their slots or allocating some of their own tokens to have the community support them via a crowdloan. This is an investment equivalent to anywhere from $300,000 to millions of dollars depending on market conditions. In addition to the economic factors, significant time is spent to fundraise, bid and acquire a slot. For a startup project, time is its most valuable resource and having to acquire or maintain a Parachain slot can be a distraction that slows downs the search for product market fit or the financing of the project.
In addition, in the case of Nodle, where most of its applications are for enterprise purposes, Parachain downtime has real a world economic impact for businesses. Parachain downtime creates a potential reputation impact for the ecosystem since it raises Polkadot robustness issues. With a Parachain halted, mission critical applications are stopped, and real world customers (many of whom are using web3 for the first time) are impacted.
The Nodle team takes responsibility for its upgrades, but the ecosystem as a whole is operating at the cutting edge, with many upgrade parameters impossible to verify in a testnet environment. To prevent uptime reliability issues in the future, we propose a handful of improvements. We believe these will not only be helpful for other projects, but believe these will be essential for the Polkadot ecosystem to be taken seriously and considered more often for enterprise applications.
We believe that for the Parachain ecosystem to be sustainable for enterprise use cases, Polkadot should maintain its uptime, interoperability, and shared security services even in the case of a problematic upgrade.
This could, for instance, be achieved by dropping a failed upgrade and reverting to the previous version if a deadline to produce a new block is not reached (this is of course not the case today, and is an adaptation to the current implementation) - if others agree, we intend to further research how to implement it and contribute to the Polkadot core code.
Polkadot’s uptime as a relay chain is excellent, without any major recorded incidents throughout its existence.
Looking at other blockchain platforms, downtimes are usually solved within hours or days (1, 2, 3). Any downtime is considered antithetical to the purpose of a blockchain, and even a few minutes of downtime results in serious repetitional impact. Looking at Web2 services, downtimes are solved within a matter of hours, and many businesses contractually guarantee uptimes of 99.999% per year. Yet on Polkadot, resolving a failed upgrade would take over 28 days. This means that the maximum possible uptime available when a problem is uncovered would be approximately 92%.
In the web2 space, one ITIC survey found that the cost for a single hour of server downtime totals $300,000 or more for 91% of the interviewed corporations. For someone like Amazon, using past 12 month net sales, 28 days of downtime would represent a loss of over $41 Billion USD. For Polkadot to even be considered for future enterprise applications it needs to reach the five nines uptime.
Nodle is committed to providing better tools for testing parachain upgrades and making them available to the community. In the case of the current upgrade, Nodle attempted to migrate over 47k+ NFTs to a different pallet. While this worked perfectly on testnet, few parachains have pushed Substrate to the limits like Nodle. We propose building better testing tools to help simulate migrations closer to production conditions. For instance, try-runtime had failed to detect the high PoV size and time to produce a block.
The Polkadot community has already been extremely active and provided updates to testing tools since we highlighted this issue. We will investigate whether the Pull Request opened on try-runtime on August 23rd is sufficient or whether it needs more improvements which we could contribute.
Fortunately, Polkadot includes systems within OpenGov to reduce a 28 day revert and resolve uptime issues much faster. Unfortunately, these features are not directly accessible to Parachain teams, and only accessible to only a handful of people as they are restricted to the Polkadot Fellowship only.
Therefore, considering all the reasons mentioned above, we would like to ask the community and the Fellowship: if Polkadot has a way to restore its core business services as soon as possible for a Parachain; and if it is not possible, what is preventing Polkadot from doing it?
With the purpose of enabling the ecosystem to reach an enterprise grade level service, we would like to hear from the community and the Fellowship.