I believe that we do have very good tech, and i think that we should prove it. Just claiming this without data to back it up is the same that everybody does.
Sure anybody could do this… but Parity&Fellowship are mostly always the ones who: wrote the buggy code, investigate the issue, fix it, and write the post-mortem. Yes in theory anybody else could also do that.
But by us doing this, we are not preventing anyone else from also doing it. So i think its not an argument for us to not do it.
Anyway, disregarding of who will do this, it could still help to define some metrics to allow anybody to build helpful monitoring. Maybe that should be its own topic, but for now i will put it in here:
SLIs (Service-Level-Indicators) & SLOs (Service-Level-Objectives)
These are the metrics and objectives that can help to paint a picture of how well our techstack is working. I think there are some low-hanging fruits that should be easy to define.
Polkadot Relay Chain
Authoring Rate
SLI: The number of authored blocks per time window.
SLO: Within [95, 100] per 10 min window.
Slow/Fast Block Rate
SLI: The number of blocks that have a timestamp diff of more than 10 sec to their parent per time window.
SLO: Not more than 10 per 1 hour window.
Finalization Rate
SLI: The number of finalized blocks per time window.
SLO: Within [95, 100] per 10 min window.
Slow Finalization Rate
SLI: The number of blocks that take more than 20 seconds to finalize per time window.
SLO: Less than 10 per 1 hour window.
Relevance
The relevance of the metrics is difficult to assess. But they should give us a one-way implication in the form of "SLO not reached" => "bad end-user experience"
.
The other direction of the implication is more difficult to define, but i think its a good start like this.
(This approach is a bit similar to the sTPS (standartd-TPS) idea, where we quantify some exactly defined metrics to prove our point.)