Stabilizing Polkadot

Polkadot is shipping updates at an insane pace. We assumed that this is the best way to add value. More features + optimizations = more value. But is this still true?

What makes Polkadot valuable?

The fact that parachain teams are willing to build their business here instead of elsewhere. And the outlook of attracting more of such teams in the future.

It is therefore paramount to retain them and attract new parachain teams. But how do we do this? I would say by offering value to them. This may sound simple enough, but implies that we let our decisions be guided by their business needs.
Ultimately, the value that we offer greatly depends on how we offer it.

What is our value proposition?

Polkadot sells access to a resource. This resource has been called “core time” or “block space”. Polkadot proposes its value by offering this resource to parachains.

While doing so, it has levers to tweak this resource in various dimensions.
What we are currently offering looks a bit like this:

The feedback that Parity received from parachain builders indicates that this is not aligned with their needs. Our current offer introduces breaking changes to their production code quite often, either for optimizations or new features.

But breaking changes are not free (literally!). Every time there is a breaking change, parachain teams evaluate the following factors:

    1. Opportunity cost of missing out on the new feature.
    1. Developer cost of integrating the change.
    1. Risk of downtime due to bugs in the code.

As it turns out, the developer cost and risk of downtime almost always outweigh the opportunity cost of missing the new feature. This is a purely business-sided observation. Nonetheless, parachain teams are businesses that think economically.

New features are nice, but if the profit that they generate is not worth the risk, then what is their worth? This brings me to think that what we should offer, looks more like this:

Valuing stability over development velocity would make it much easier for teams to build on Polkadot. They could rely on Polkadot to not get rug-pulled by new development decisions regularly.

What to pick here is a difficult decision to take. But one that the Polkadot Community needs to take eventually. Other ecosystems that cater more to the needs of their builders are more than happy to welcome our outflow of talent.

Polkadot as Product

Polkadot development has been mostly about the Tech. But to stay relevant, we need to shift to think more like a business:

  • What do parachain teams need?
  • How can we help them to stay profitable?
  • How can we ensure that they stay here?

We have to let the answers to these questions deeply affect our development decisions.

Moving forward

This is not just a task for Parity, but for all builders in the Ecosystem.
Nonetheless, Parity set out three goals to reduce the maintenance and integration cost for parachain teams; the parachain-omni-node, a release process, and runtime integration testing.

This is, possibly, not enough. If you are a parachain builder, please comment more ways to make building on Polkadot easier for you.

Parachain Omni Node

The Polkadot-SDK has a huge surface area. We publish about 380 crates that are considered to be a developer-facing interface. This makes it very difficult for parachain teams to ensure that they are using the right versions and integrating the changes correctly.

This is all done in the name of customizability. But only a fraction of parachain teams actually want this level of customizability. For most of them, having a node that “just works” is fine.

The parachain-omni-node is set out to be a universal node that can be used to collate for any parachain (as long as it does not rely on custom changes).
This massively reduces our developer-facing surface area and abstracts away much of the complexity.

:point_right: What do you (parachain team) think about this?

New Release Process

We currently release three things: Polkadot-SDK crates, Relay+Para Nodes, and the Runtimes. All are relevant for parachain builders. To introduce more stability, Parity is proposing a 3-month LTS release cycle for the SDK and the Node software. It would mean that breaking changes only occur every three months (see RELEASE.md). There would still be minor version bumps for bug fixes in this period.
As the Runtimes are handled by The Fellowship, they may or may not follow suit with this.

To make this not just an internal project checklist but a success, we need to check with parachain teams if this aligns with what they need.

:point_right: What do you (parachain team) think about this?

Runtime Integration Testing

Distributed systems are tricky. There is no way around integration testing. Parity is currently in the process of hiring for exactly this purpose.
The idea is to have an end-to-end parachain integration test (similar to this) that we can run before releasing a runtime change.
Hopefully, this will catch more XCM bugs before they hit production networks.

40 Likes

Finding that sweet spot between rolling out exciting new features and keeping everything running smoothly is key for keeping Polkadot on the growth track for sure. The plans you’ve laid out definitely hit major pain points I have experienced as a parachain developer since 2019.

Regarding the Parachain Omni Node, I see it as a particularly promising development for simplifying the integration process and reducing the overhead for developers. However, I’m curious about how it will accommodate the addition of custom RPCs, especially since our parachain, along with several others, relies on the EVM RPC for essential functionalities. Will the Omni Node design easily support the integration of such custom RPCs? This feature is critical for us and, I believe, for others in the ecosystem who leverage EVM compatibility.

The new release process with a 3-month LTS cycle is another move I support wholeheartedly. It strikes a good balance between allowing for innovation and maintaining a stable environment for development, reducing the risk and cost associated with frequent breaking changes.

Lastly, the initiative for runtime integration testing is indispensable. It’s a proactive approach to ensure the reliability and smooth operation of the ecosystem, minimizing disruptions and potential bugs in production networks.

6 Likes

We need the Omni Node like two years ago but is better late than never.

For teams need EVM RPC, I would like to highlight that for Acala EVM+, we implement the RPC on a separate nodejs server and that have many advantages over Frontier. I believe this is the right direction for any custom RPC needs, rather a custom built node binary.

With Omni Node, LTS is much less a concern as there won’t be any maintenance burden of the node binary.

2 Likes

It is really great to see that Parity is reacting to the market demand here!

One thing that I am missing in this discussion is the mention of Kusama. The original “promise” was that Kusama is for fast iteration and Polkadot is the stable thing. What we have essentially seen is that both Polkadot and Kusama are iterating at high pace in lockstep, with Kusama acting as the first recipient of releases.

There is still high demand from the market for what Polkadot should be able to deliver, and there is always the question of why any feature should be withheld from Polkadot once it is done. But this still seems to me like a striking mismatch between the original idea and what is happening. Not sure what the consequence is, but I wonder if we could follow along more with the original idea.

2 Likes

Parachain Omni Node

Will the Omni Node support EVM related needs like

  • EVM block import queue
  • EVM database

For teams need EVM RPC, I would like to highlight that for Acala EVM+, we implement the RPC on a separate nodejs server and that have many advantages over Frontier. I believe this is the right direction for any custom RPC needs, rather a custom built node binary.

@xlc can you link the this here? Does this mean you are exposing a single rpc endpoint for the whole network wrt. EVM queries?

Runtime Integration Testing

For me this is probably the most important section to ensure we are not breaking anything. My main concern is that new test will not be consistently added. The engineers writing the changes use Rust while this is javascript + dedicated yaml-config files. From my experience having a) a seperate codebase and b) a different language for these kind of tests leads to just not writing new tests.

I personally would much prefer to have these tests in the same codebase as my runtime (see here). Ideally, we should have something like forge that allows to run these tests on different states of live networks. Chopsticks would be great for that, but then again the language barrier makes writing tests a pain IMO.

1 Like

bodhi.js/packages/eth-rpc-adapter at master · AcalaNetwork/bodhi.js · GitHub is the ETH RPC server for Acala. It is just a nodejs server talks with our Substrate RPC node and a SubQuery instance for historical data.

There are multiple benefits on such model, including supports Chopsticks, light client friendly and future alternative clients.

1 Like

That doesn’t make sense. Chopsticks mocks a Substrate RPC. You can write tests in whatever language you want.

1 Like

Should be possible, in the end this is just some small difference.

I don’t think that writing these tests in Rust is such a smart idea. We have tons of tooling in typescript/js for interacting with the chains. Having not a typed language like Rust also makes it much easier to write this stuff. For sure, I would not want to see there yaml. Using yaml for these tests is not a great idea IMO. However, if there is a good set of basic tests in JS, you will just need to copy them and modify it as you need it. Most of these tests will be about sending XCM and then checking the state, aka most tests will look the same. It is also much easier to find people who could get some tips for writing these tests. (I’m writing this while I’m not a JS dev and would also need to use stackoverflow :P)

3 Likes

For Kusama, opinions range from “expect chaos” to “I am doing business there”.
One day people want feature velocity and push for more frequent releases and then complain when it crashes.

That’s a very common pattern. I think we need a few things to get out of it:

  1. We should agree on what “good” means. That’s what SLO’s are good at.
    Let’s agree on a number (in practice multiples) on what good means. We could tell

KSM will produce blocks 99,5% of the time measured on a monthly window.

or whatever the number is, there is no perfect answer there: the higher the number the more stable you are and the slowest you are in terms of pushing features, the community needs to pick a a number.

Note that 99,5% means 44h down per year. 99.9% means 9h down per year. etc

The fellowship should be responsible for the SLO. If we are down, then no new release. If we are up, they can release. It could even be automated in the release system. Since the SLO is build on a time window, everyday your SLO goes up and you recover from your previous issue.

  1. We need better testing before the release. We (Parity) are working on it see above but we also need tests from parachains to define what “working” means for the parachain because the fellowship can only guess.

  2. We need a rollout strategy and they could be different for the runtime and the node. How long should we support an LTS? Also if the date of the releases are predictable (ala ubuntu) that’s also a good way to get organised.

We could also incentivize nodes to be on recent releases (not the plural) We do not want everyone to be on the same version but if we had most nodes nicely stripped over the last 3 stable versions, that would provide both resilience and reduce diversity which is good for debugging.

  1. Collectively we should be more principled on when to break an api and we should have a strong case to do it.
10 Likes

That’s a good idea!

44h per year sounds reasonable. 9h would imply that we always have on-call devs and that it is reasonable to expect to recover from a chain halt within 9h, which I don’t think is always realistic. And then blocking runtime releases for another year wouldn’t be that good either.

Picking measurement window size would be crucial here. Maybe a 2-3-month window would be good since it would provide sufficient time to look into the systemic root causes of a chain halt and develop countermeasures.

But what is also very relevant is that the discussion is not only about the downtime of the chain but also the downtime of XCM/interop. Interop is one of the core promises of Polkadot and we had a small number of XCM issues in recent months that created follow-up costs for a lot of teams.

So having a defined service level for interop should be considered in the context of this discussion.

3 months is a good start, but I think still hectic for many of the parachain teams with more limited resources (who sometimes manage to release new runtime upgrades every 3 months themselves). I think 6 months would be considerably better; and signal stability.

Still we should discuss what to expect from a LTS. If only critical bug fixes are expected to be backported, we can backport quite far. The more stuff we want to have backported, the shorter the period gets.

Also from back of my head, the only thing which really required an upgrade in the last 2 years was the introduction of XCM V3. This for sure should have been prevented. Otherwise there were no changes (please correct me if I’m wrong) that required any updating from the parachains. The main in the early days was the introduction of new host functions, which almost stopped the last years. But for that we build the solution for connecting the parachain nodes via RPC to some relay chain node. This way the parachain nodes can continue operating without having the new host function ready that would be required for the relay chain.

1 Like

Longterm we can achieve velocity without sacrificing on stability by finding ways to break up core functionality of Polkadot into smaller services managed by distributed teams. Additionally, this can be accelerated by researching, in tandem, how the protocol can be ran from a more distributed approach e.g. instead of the Parachain doing everything <=> monolith, it could be broken up into small services that are spun up as needed.

I would like to see a future where the Polkadot SDK consists of hundreds of micro-services all working together to run the protocol in an efficient manner.

Given how much integration tests we are seeing being written in Polkadot at the moment I would say this makes a lot of sense. The language barrier does not refer to the technical difficulties of using chopstick or whatever to test the codebase, it is rather about writing the tests in the first place.

Anyways, happy to see your JS tests and copy paste them when they are ready.

1 Like

That’s what confused me. You can absolutely write integrating tests using subxt.

But yeah it is less practical to write integration tests with Rust compare to TS due to:

  • We want to to also test JS SDK because that’s the dApps uses
  • It is easier to find TS devs than Rust devs
  • It is easier to write tests in TS than Rust
  • It is easier to use a scripting language to handle a moving target (i.e. the runtime) as there will be changes of properties or methods between runtime versions

Here are our e2e tests GitHub - AcalaNetwork/e2e-tests
There are still a lot of improvements can be done on the repo structure, test coverage etc

5 Likes

Thanks for the comments!
To sum it up, the important of topics seems to be:

  • Omni Node
  • EVM
  • Testing
  • LTS

@kianenigma is currently working on the Omni Node. AFAIK it will be extensible with a builder pattern, so it should be possible to do something like .with_custom_rpc(..) to add EVM RPC support.

The interest about EVM compatibility is a bit surprising to me, I think we currently dont put much focus on it. But thanks for the feedback.

About the SLO:
Yea its a good idea to define some. This would also help us reflect what worked out well and where we need to improve urgently.
I think the way to do this in a binding manner would be to create a Wish-For-Change track proposal and lay out the specification. That would basically mandate The Fellowship to do it.
Personally I have no experience with SLOs. @bkchr do you have an idea how to concretely move forward with this?

1 Like

Why not do what @xlc proposed? I mean yes, it should be possible to add some custom RPC quite easily. However, the idea about omni node is that people don’t need to touch the node at all and it just gets released by Parity as well.

Just for the record, we are putting focus on EVM compatibility :wink:

Generally an interesting topic, but not 100% sure how we can map this onto Polkadot. I think we should triple down on testing. Put up more “barriers” for these nasty bugs etc.

Why I find putting up a SLO quite hard is that while we probably identify the problem quite fast, if we need a runtime upgrade or similar this stuff is out of our hands. We can not predict how long governance will take. I mean I would assume that it acts quite fast if something is broken, but nothing we can plan or should “sign”.

Nevertheless, now with the fellowship secretary and organizing/writing down the stuff a little bit better. We could come up with some paragraphs around this topic.

1 Like

Okay if this is enough for the most urgent cases, then lets use it.

Yes it is more difficult to do for a decentralized service than a centrally managed one.

I think we should at least have something like https://www.githubstatus.com/ for Polkadot, where we record incidents and report on them in real time. It helps to build trust with companies that are evaluating Polkadot as possible solution.
IIUC SLOs are used by companies to evaluate service providers to manage their own risk of downtime. If we cannot really provide this now, we could at least show a good past record with such a tracker above.

1 Like

Would like to see this for ETH/BTC :see_no_evil:

Anyone can build this. Not sure the Fellowship should do this. This also brings over like some vibe of “controlled by”. For centralized services you also need this because you can not look into them, but for decentralized basically anyone can do this.

1 Like