Infrastructure UX

Starting a thread on what I would broadly call Infrastructure UX.

Some context for this is that the topic of community run bootnodes came up, of which it was suggested that people running them should have them listen via websockets and have a TLS cert, as this will help facilitate things like Smoldot / Light Clients. This other forum post has some more details about that in particular.

Some other things mentioned was the topic of warp sync, using checkpointing, having people specify checkpoints, what the security model of some of these things are, and minimizing trust.

I had a few random thoughts on this from the perspective of people running infrastructure:

  • Node operators will value reliability, accessibility, ease of use, and minimizing time spent on things instead of trust minimization. End users and the public are the people that generally care more about trust minization
  • Trust minimization and the above are not mutually exclusive, although we greatly would need to actively care about the UX and functionaly of what kind of options there are for how to run things (ie, is it easy to make a db snapshot that is safe, is there a way to not have to trust other people or nodes to be malicious, what things can be baked into the client that eliminate the need for people to do things like mess around with folders, etc)
  • Node operators (validators, collators, rpc providers, etc) need to have a playbook for situations where they spin up new things, need to get something running in a critical time period with minimal overhead (bugs, things go down, upgrades, etc), or need to replace current infra.
  • Everyone wants as many different kinds of opportunities to mitigate downtime
  • Node operators generally run infrastructure for multiple ecosystems (outside of the polkadot ecosystem)
  • Some of the Infra Ux that people expect is various things that help facilitate high availability
  • There needs to be a good solution for people to get things up quickly and easily, that also leaves little room for error or messing things up
  • Given some of the time periods in Kusama, that means putting up something like a validator running to the tip of the chain and signing things (with possibly new or the same keys) in under 1 hour, ideally something should be able to get running in under 10 minutes. If the only solution given the constraints of a fixed set of validators with babe/grandpa is using the same keys, there needs to be good ways that prevent double signing and people shooting themselves in the feet. Things like remote signing with double signing protection exist and work well in other ecosystems
  • One of the most common things people are used to is restoring from a db snapshot
    • For historical reasons, people decided to use 1000 block pruning as the most common number for make db snapshots and host them
    • Generally when people make db backups it will be minus the folders for keys and parachains (seems as though not having the parachains folder may cause some circumstances where people might run into issues)
    • Unless there’s a way to better accomodate archive nodes, people will still need to have a decent db hosting/backup solution to get a (archive) rpc node up and running quickly
  • Using something like Warp Sync is probably a fine solution for people to get Validators / Full nodes (with pruning) up and running quickly, so long as it’s very reliable and consistent
    • it often hasn’t been in the past which is why a lot of people are reluctant to use it
    • it needs to be made sure that things like making db snapshots, upgrading to new versions, and general compatibility works in a consistent and non-breaking manner for warp sync
  • Most people still use Rocksdb as the main db for running nodes. The switch to using paritydb as the default db should be made soon, and outline any blockers or features needed before switching. This will somewhat change things in the ecosystem for what kinds of db snapshots people prioritize hosting, as many mainly still use rocksdb as the main one.
  • In terms of Infra UX, specifying something like the checkpoint hash for warp sync is fine (other ecosystems do something similar where they specify the block hash and possible node endpoints to quick-sync), although it would help a lot of we can specify cli arguments as a YAML/TOML/JSON file. ( This issue about it has been open for a super long time)
  • In terms of ‘security’, infrastructure providers will generally not favor that if there are other options for accessibility, ease of use, etc. If we want to help prioritize security we will need to offer baked in solutions that favor security and ease of use/access. Relying on people to do things ‘outside the client’ and software we offer will usually not favor security when things like critical infrastructure operations rely on having things consistently up and putting new things up quickly and easily.

I’d be curious for anyone that does DevOps or deals with Polkadot/Substrate related infrastructure - what are the pain points you experience? And what kinds of things do you think could be made better for infrastructure ux?

3 Likes

IMO this is fine and was fine. I would just like to see that the treasury isn’t paying for this. If people want to use this, they can, but need to pay for this on our their own. As described in the linked thread, treasury should pay for these checkpoint availability when this is 100% usable.

The checkpoint would just be the chain spec file, so nothing complicated to pass to the CLI.

Regarding the linked issue, if people want such things, they should show it. Github introduced reactions for the reason to show that you want something, without polluting the issue with “+1” comments. So, we need to get people who want this, to go out and add a least some sort of reaction. It would be even much better to see some sort of proposal on what they would like to see. If there is enough request on a topic, this could be for example also being paid by the treasury and some external contributor could help with fixing this :wink:

Fair point and that is also something that I want to have improved before we make this “production” ready. Aka have more tests, especially zombienet tests.

1 Like

One thing for RPC node is that it is simply impossible to perform horizontal auto scaling for it. I guess it will be possible with wrap sync so it only take seconds to spin up new node instead of hours. But the point is, it maybe worth it to apply to some web2 micro service structure to blockchain nodes.

It could be something like this:

  • Shared Database that supports single blocking writer, multiple non blocking reader
  • One or more stateless syncing node downloading blocks and verify them and write to shared db with idiomatic update
  • Multiple stateless serving nodes reads from db to serving RPC requests that scales with load
  • One or more stateless collators reading subscribing incoming blocks and producing parachain blocks and submit them to relaychain validators directly via p2p connection

A full node is simple a db + syncer + rpc server + optional collator within a single binary.

Substrate is already very modular and I see no reason we cannot break it apart into multiple binaries and having them connected via RPC. In fact, Cumulus already supports using external relaychian node which is towards to this direction. The only thing I am not sure is that if is technical feasible for RocksDB or ParityDB to supports shared usage or if the additional RPC communication overhead will introduce too much latency. However, for the use case of RPC node, this level of latency is definitely acceptable and we can simply still run everything within a single process for collator / validator.

The general strategy forward is to shift away from the RPC nodes paradigm.

A full node is nothing more than a database, and when a light client queries something from a full node, it is like a database query.

3 Likes