Starting a thread on what I would broadly call Infrastructure UX.
Some context for this is that the topic of community run bootnodes came up, of which it was suggested that people running them should have them listen via websockets and have a TLS cert, as this will help facilitate things like Smoldot / Light Clients. This other forum post has some more details about that in particular.
Some other things mentioned was the topic of warp sync, using checkpointing, having people specify checkpoints, what the security model of some of these things are, and minimizing trust.
I had a few random thoughts on this from the perspective of people running infrastructure:
- Node operators will value reliability, accessibility, ease of use, and minimizing time spent on things instead of trust minimization. End users and the public are the people that generally care more about trust minization
- Trust minimization and the above are not mutually exclusive, although we greatly would need to actively care about the UX and functionaly of what kind of options there are for how to run things (ie, is it easy to make a db snapshot that is safe, is there a way to not have to trust other people or nodes to be malicious, what things can be baked into the client that eliminate the need for people to do things like mess around with folders, etc)
- Node operators (validators, collators, rpc providers, etc) need to have a playbook for situations where they spin up new things, need to get something running in a critical time period with minimal overhead (bugs, things go down, upgrades, etc), or need to replace current infra.
- Everyone wants as many different kinds of opportunities to mitigate downtime
- Node operators generally run infrastructure for multiple ecosystems (outside of the polkadot ecosystem)
- Some of the Infra Ux that people expect is various things that help facilitate high availability
- There needs to be a good solution for people to get things up quickly and easily, that also leaves little room for error or messing things up
- Given some of the time periods in Kusama, that means putting up something like a validator running to the tip of the chain and signing things (with possibly new or the same keys) in under 1 hour, ideally something should be able to get running in under 10 minutes. If the only solution given the constraints of a fixed set of validators with babe/grandpa is using the same keys, there needs to be good ways that prevent double signing and people shooting themselves in the feet. Things like remote signing with double signing protection exist and work well in other ecosystems
- One of the most common things people are used to is restoring from a db snapshot
- For historical reasons, people decided to use 1000 block pruning as the most common number for make db snapshots and host them
- Generally when people make db backups it will be minus the folders for keys and parachains (seems as though not having the parachains folder may cause some circumstances where people might run into issues)
- Unless there’s a way to better accomodate archive nodes, people will still need to have a decent db hosting/backup solution to get a (archive) rpc node up and running quickly
- Using something like Warp Sync is probably a fine solution for people to get Validators / Full nodes (with pruning) up and running quickly, so long as it’s very reliable and consistent
- it often hasn’t been in the past which is why a lot of people are reluctant to use it
- it needs to be made sure that things like making db snapshots, upgrading to new versions, and general compatibility works in a consistent and non-breaking manner for warp sync
- Most people still use Rocksdb as the main db for running nodes. The switch to using paritydb as the default db should be made soon, and outline any blockers or features needed before switching. This will somewhat change things in the ecosystem for what kinds of db snapshots people prioritize hosting, as many mainly still use rocksdb as the main one.
- In terms of Infra UX, specifying something like the checkpoint hash for warp sync is fine (other ecosystems do something similar where they specify the block hash and possible node endpoints to quick-sync), although it would help a lot of we can specify cli arguments as a YAML/TOML/JSON file. ( This issue about it has been open for a super long time)
- In terms of ‘security’, infrastructure providers will generally not favor that if there are other options for accessibility, ease of use, etc. If we want to help prioritize security we will need to offer baked in solutions that favor security and ease of use/access. Relying on people to do things ‘outside the client’ and software we offer will usually not favor security when things like critical infrastructure operations rely on having things consistently up and putting new things up quickly and easily.
I’d be curious for anyone that does DevOps or deals with Polkadot/Substrate related infrastructure - what are the pain points you experience? And what kinds of things do you think could be made better for infrastructure ux?