Stabilizing Polkadot

For Kusama, opinions range from “expect chaos” to “I am doing business there”.
One day people want feature velocity and push for more frequent releases and then complain when it crashes.

That’s a very common pattern. I think we need a few things to get out of it:

  1. We should agree on what “good” means. That’s what SLO’s are good at.
    Let’s agree on a number (in practice multiples) on what good means. We could tell

KSM will produce blocks 99,5% of the time measured on a monthly window.

or whatever the number is, there is no perfect answer there: the higher the number the more stable you are and the slowest you are in terms of pushing features, the community needs to pick a a number.

Note that 99,5% means 44h down per year. 99.9% means 9h down per year. etc

The fellowship should be responsible for the SLO. If we are down, then no new release. If we are up, they can release. It could even be automated in the release system. Since the SLO is build on a time window, everyday your SLO goes up and you recover from your previous issue.

  1. We need better testing before the release. We (Parity) are working on it see above but we also need tests from parachains to define what “working” means for the parachain because the fellowship can only guess.

  2. We need a rollout strategy and they could be different for the runtime and the node. How long should we support an LTS? Also if the date of the releases are predictable (ala ubuntu) that’s also a good way to get organised.

We could also incentivize nodes to be on recent releases (not the plural) We do not want everyone to be on the same version but if we had most nodes nicely stripped over the last 3 stable versions, that would provide both resilience and reduce diversity which is good for debugging.

  1. Collectively we should be more principled on when to break an api and we should have a strong case to do it.
11 Likes