[RFC] Increasing recommended minimum core count for reference hardware

The polkadot binary is a multithreaded program that will benefit from having more hardware cores available, and all configurations are tuned to take into consideration the worst case scenario on the worse recommended hardware configuration. Currently, the minimum hardware requirements recommends at least 4 hardware cores with SMT disabled.

This thread is started in order to gather feedback on the opportunity of raising the minimum requirements for a polkadot validator to at least 8 hardware cpu cores, so if you got any concerns of any kind about this please let us know.

Why would the network benefit from raising the minimum hardware requirements ?

With async backing support enabled, parachains can now produce bigger blocks which take a lot more time to execute, collator authoring time can be increased from 500ms up to 2000ms, that means the execution of PVF for backing and approving processes might need 4 times more CPU resources to execute those blocks and confirm they are valid.

Currently, PVF execution pool is capped at maximum 50% of the 4 hardware cores, that means for each relay chain block we have a maximum of 12s(relay_chain_frequency * 2 cores) of cpu time for the parachain block execution. In average a validator needs to execute around 7(1 backing + 6 tranche0 approvals) parachains blocks, so if we divide the maximum cpu time by the number of parablocks, it results the fact that the average parachain blocks execution times will not be bigger than 1.7s (12s / 7).

By increasing, the minimum required core count we can then safely increase the hard capacity of the PVF execution pool. For example if we raised the capacity to 50% of 8 hardware cores, then the theoreticall average for parachain block execution times would increase from 1.7s to 3.4s (24s / 7 ), significantly above the 2s backing time recommended by async backing.

Why don’t we raise the PVF execution hard capacity ?

Currently, the PVF execution hard capacity is 50% of 4 hardware cores, increasing it would give us more execution time for the parachain blocks, however this analysis concurred that it would not be safe on validators running with the minimum requirements of 4 cores, because it would steal valuable cpu time from other subsystems doing critical work as well.

Other optimisations avenues

In parallel with this, other optimisations are also being progressed on that would reduce the amount of work a validator needs to do.

However, since the optimizations being worked on stacks up with raising the minimum required hardware if we want to be able to support 10x more usage on the network we would actually want to do both.

Current state of afairs

Looking at Polkadot telemetry it seems that the majority of validators already use more than 4 cores.
Note! This data it is not reliable because it includes collators nodes and because on some orchestration systems the binary would see a certain number of cores, but you get rate-limited via other mechanisms
Kusama

Polkadot

This raises the question, that if the majority of validators are already running on more powerful hardware than the minimum required then there is already spare/wasted capacity that we can use to execute more parachain blocks, which would actually lead to better utilization of the provisioned hardware and will result in the network being able to support a significantly number of Polkadot cores(not to be confused with hardware cores).

Alternatively, if validators are runing at the bare minimum requirements then increasing the core count from 4 to 8, might increase non negligiably the costs of runing the validators for some people. I avoided doing some cost estimations here, because it will vary significantly from setup to setup, but we are looking forward to hear from the validators in the community if they think there will be impacted and how.

Preparing for the future

With all that in mind, we think that raising the minimum core count for validators from 4 to 8 is a change that greatly helps the Polkadot network overall to prepare for a future where usage increases 10x or more and it is a change that we should start proactively rather than wait until the maximum theoretical throughput is met.

Final note

While we think this is the path we should head into future, we are also aware that this might have an impact on the validators cost, so looking forward to hear from the validators community and what we can do to mitigate the impact as much as possible.

4 Likes

I’d heard elsewhere that not all validators run even 4 cores, although maybe their original source was this data, and maybe the collators etc confuse the results. Anyways…

I think the rewards design in Validator rewards in Polkadot/Kusama - HackMD would safely penalize validators who run under spec hardware. We should discuss it sometime.

I’d heard elsewhere that not all validators run even 4 cores, although maybe their original source was this data, and maybe the collators etc confuse the results. Anyways…

Hard to tell we don’t have any good data besides: https://telemetry.polkadot.io/ and that’s unreliable as well.

I think the rewards design in Validator rewards in Polkadot/Kusama - HackMD would safely penalize validators who run under spec hardware. We should discuss it sometime.

Yes, I agree mid to long-term this is how this differences should auto-regulate, but I don’t think this would land on a short timeframe, without that the best we can do now is raising the minimum recommended spec and give time to people to try to follow it.

Could you speak to dedicated physical hardware cores vs shared configurations like virtualization (vCPU’s) and also container situations where multiple containers may be sharing X physical hardware cores.

Think about this: a validator running on a Digital Ocean droplet with “8 CPU’s”, but in reality, other customers are running on the same physical hardware.

Or the opposite extreme: a validator running on bare-metal that really has eight physical cores not shared with anything else.

Or this: a Kusama validator and a Polkadot validator running in containers on the same physical hardware bare metal sharing eight cores?

I find ‘core’ is almost meaningless now, given virtualization and various sharing situations.

Trying to understand if it’s about simultaneous thread taking capacity or actual performance.

We’re primarily discussing recommendations here, aka capacity, but…

At present polkadot & kusama are badly underutilized. It’s possible this changes once ETH bridges work, but if not we’ll hopefully run some glutton parachains soon-ish.

At some point someone should finally launch some applicaitons that really demand the throughout too.

We need validators to spend enough CPU time by that point, which we’ll necessarily enfroce though validators’ rewards too.

It’ll become about actual performance eventually.

As Jeff said, both in the Run a Validator (Polkadot) · Polkadot Wiki and in this post when you talk about a core, the assumption is that you use the full(almost) capacity of that core for a single polkadot validator.

Could you speak to dedicated physical hardware cores vs shared configurations like virtualization (vCPU’s) and also container situations where multiple containers may be sharing X physical hardware cores.

The type of instances you purchase from your cloud provider should tell you if you get those cores completely for yourself or if the CPU time it is burst-able with the assumption you are sharing the physical cores with other virtual tenants. Polkadot protocol has a lot of places where the assumption that operations are happening in a certain amount of time, so for running your validator you should always favor predictable CPU allocation for your containers polkadot program.