We have started the process to increase the number of validators participating in parachain consensus from 300 (current) to 500! The first step of activating the prerequisite approval voting improvements on the node side was already initiated via this referenda. Enacting will happen on 25th of May.
Details and what to expect
From a performance/optimisation perspective, in our testnet experiments and benchmarking we observed a drop in CPU and network bandwidth usage of approval voting protocol of up to 6x. This is expected to be observed on Kusama also.
Relevant node side changes enabled by https://kusama.polkassembly.io/referenda/392:
- v2 assignment certificates - validators are assigned to
vrf_modulo_samples
(6) candidates in tranche0 using a single message/signature instead of one per candidate. One such signature costs around ~250 microseconds to verify, so this where the most significant reduction of CPU usage comes from - approval of multiple candidates with single signature - at the expense of latency nodes minimize the amount of approval vote messages and signatures. The latency is controlled by configuration:
max_approval_coalesce_count
More details on what work has been done so far to make this possible is described here.
Impact
With 500 validators involved in parachain consensus, the total core count can be increased to 100 (or even more considering we are oversubscribing via max_validators_per_core == 3
), security is improved and backing vote rewards are more fairly distributed.
At the same time, we might observe a higher approval voting base lag, of up to 1.5 relay chain blocks. This happens as nodes back-off from signing an approval message until no_show_slots/2
(9s) passed or when at least max_approval_coalesce_count
candidates have been approved. If max_approval_coalesce_count
is reduced to lower values the impact on approval checking and finality is reduced or even completely removed.
Plan
- Prerequisites enacted (Enable approval voting protocol improvements | Polkassembly)
- Wait 1 week to gather data
- Referenda to increase set size to 400
- Referenda to increase set size to 450
- Referenda to increase set size to 500
- Gather data for a while to conclude that the same plan will work flawlessly on Polkadot.
What can potentially go wrong ?
We trade off latency to improve bandwidth/CPU by waiting to approve multiple candidates with one message. This might increase finality lag beyond the current base line. If this happens we might be forced to dial down max_approval_coalesce_count
and potentially end up scaling up to less than 500 validators.
Future plans
By the end of 2024 the goal is to have solved the remaining scalability bottlenecks including raising the hardware requirements of validators. To be clear, this means having the implementation ready, testnet runs and benchmarks giving us a strong signal we are ready to go to production networks in 2025.
With these bottlenecks out of the way the available block space will be doubled.
Some of the challenges we know of are related to node subsystem design, availability and potentially the need for parachain consensus runtime refactoring and optimizations.