Ah I understand the problem now. Specifically, when bringing up new nodes behind a load balancer there’s a period of time during which the node isn’t “ready” yet.
Independently of the new API, it is fundamentally impossible to do that in a clean way. For example, maybe node A is at block 10000 and node B is at block 50 (because it has only managed to connect to one other peer and this other peer is at block 50). They will both report that they’re healthy, but you probably want the load balancer to redirect clients to node A.
Even with warp syncing, you have no way to actually know whether you’re at the head of the chain or not. It’s actually even worse with warp syncing, because nodes currently can’t switch back to warp syncing after they’ve finished it the first time. So you might end up accidentally warp syncing to a very old block, and your node will then sync very slowly from there.
The only clean solution is to have a custom load balancer specifically for Substrate/Polkadot that will compare the latest block of all the nodes that it is load balancing.
This is a typical example where web2 conflicts with web3, because in the web2 world you never have that problem. Your web server is either started or not, it is never in a “started but can’t answer yet” phase.
Anyway, Health checks for your target groups - Elastic Load Balancing isn’t related to the JSON-RPC API anyway, as AWS can’t do live checks by sending JSON-RPC requests.
My opinion, for the sake of pragmatism, would indeed be to add some endpoint (like readiness and liveness endpoints for service health monitoring · Issue #1017 · paritytech/substrate · GitHub suggests), but implement it on top of the Prometheus server rather than use system_health
.
If this is done by an external tool, the format in which Prometheus exposes metrics is insanely simple, and arguably even easier to obtain through a script than with a JSON-RPC request.