Distributed validator infrastructure for Polkadot

drew · April 15, 2024, 8:44pm

Hello, I posted an issue on some thoughts on a distributed validator infrastructure for Polkadot validators that would ideally support any Substrate chain with GRANDPA/BABE consensus. Looking for feedback, insights, advice, and interest from anyone who might be open to exploring with me and the Tangle (https://tangle.tools) team!

github.com/paritytech/polkadot-sdk

Distributed validator infrastructure for Polkadot

opened 08:40PM - 15 Apr 24 UTC

drewstone

I10-unconfirmed

I'm starting a bit of research and looking for advice/insight into expanding thi…s issue into a full-featured spec for creating a distributed polkadot validator cluster. This would be similar to what [Obol](https://obol.tech/) and [SSV Network](https://ssv.network/) are doing on Ethereum. Examples of what a distributed validator cluster look like on Ethereum can be found [here](https://github.com/ObolNetwork/charon-distributed-validator-cluster) although I can't speak to the reasonableness of that setup. A **distributed validator infrastructure** (DVI) for Polkadot is an infrastructure for splitting and distributing validator keys into multiple using secret sharing / DKG, for the purpose of running a Polkadot validator across multiple non-trusting nodes. This would lay the foundation for new liquid staking tokens and Lido like products on Polkadot/KSM and any Substrate chain leveraging GRANDPA and BABE, although I imagine the setups for relay chains would differ from standalone chains. ### Distributing the BABE client Below, I outline the functionality I believe needs to be distributed between the operators in a DVI cluster. I am looking for more insight and feasibility here. - BABE authorities have a VRF key that they use each slot to generate random numbers. Can this be distributed using MPC? - BABE authorities also have a signing key that they use to sign blocks. Can this be distributed using MPC? - DVI cluster will each slot generate the VRF output using MPC and if the output meets the threshold requirement, they are selected the slot leader, and they collectively sequence the next block. Each operator in the DVI cluster will run the respective client software needed to handle all operations in **Normal Phase** and **Epoch Update Phase** of the BABE protocol as [outlined in the wiki](https://research.web3.foundation/Polkadot/protocols/block-production/Babe). They can optionally run some fast BFT consensus protocol like [HotShot](https://github.com/EspressoSystems/HotShot) to agree on the best chain to build on as well as other internal consensus matters IF NEEDED. If running the respective clients as is is sufficient, I'm curious if this fits the bill. ### Distributing the GRANDPA client Similarly, we describe what likely needs to be distributed (not necessarily well-ordered or unique) - All operators in cluster identify if their DVI authority is a voter for the GRANDPA finality round. If so, they execute TSS MPC to vote using their distributed GRANDPA key. - Execute [Play-Grandpa-Round](https://spec.polkadot.network/sect-finality#algo-grandpa-round), any node in the cluster can broadcast to the rest of the network or through some relay. ### Questions - Is anyone available for a call or additional insights that prove infeasibility here? - Is this something that's been explored before in Polkadot and is there any prior art/work? I haven't found much. - What big open questions am I missing?

Goal

One major goal with building out a DVI is to build a decentralized staking infrastructure and LST that would earn native yield from Polkadot validation and provide a rich primitive for applications such as restaking. A major goal for Tangle is to bring in DOT liquidity for restaking security and provide a similar ecosystem as Eigenlayer but here in this ecosystem. LRSTs are an effective way to continue to bootstrap security and provide new yield opportunities and I think if this infrastructure is possible to build that it would provide a lot of value to this ecosystem.

0mm-mark · May 5, 2024, 12:38am

Thanks for bringing these pieces of tech to light:

This would be similar to what Obol and SSV Network are doing on Ethereum. Examples of what a distributed validator cluster look like on Ethereum can be found here…

IIUC, this provides a validator with some redundancy - the motivation being that they could avoid some penalty if their infrastructure falls over at a point in time when they are expected to produce/validate a block, etc.

Rather than reinvent the wheel… I wonder if this is a starting point or shared functionality for validating across relay chains?

For motivation I’ll note that some very rough and preliminary calculations suggest that for an equilibrium state you might need in the order of 2K-3K participants. What is the best definition of a participant? Who knows. But for current purposes, let us say it is validators. The good news is that number is not 1 million, the bad news is that number is not 100. For context see these figures from @burdges:

The pertinent observation is that for more that 1K validators you are likely looking at more than one relay chain.

Unfortunately some obstinately refuse to acknowledge the focus of development really needs to be the relay chain.

Any thing you can do to move the ball forward would be great - even if all you do is establish what won’t work or won’t help.

You might wonder if the recent gray/JAM paper improves matters. While it does make several assertions about the economic security, and it does correctly (in my view) acknowledge the critical role this plays. It is, on this topic, unfortunately, another example of crypto-obscurantism. As best I can tell does not provide anything new on the economic security front. And silent on the critical question of what is the ideal number, and more importantly the minimum number, of participants under different conditions.

I’ll reiterate the preliminary and incomplete nature of what I raised above. And point out the obvious problem of not having data from a system (ETH, DOT, etc.) that we think is capable of reaching a steady state, or we can reasonably conjecture is in that state, and so can be used to inform our parameter estimates - in fact we aren’t even sure we have systems capable of maintaining a equilibrium in the face of adverse events. Also I’ll note that my figures above relate to what I would call a non-speculative token design, while ETH, DOT etc are speculator token designs (aka securities).
Finally, there may be more than one way to skin this particular cat, and it is possible the calculations referred to above are correct and irrelevant - a better alternative being available.

burdges · May 5, 2024, 11:23am

A “distributed validator” seems nonsensical. All these blockchains are already a distributed systems.

A “threshold validator” makes sense: It’s several physical machines doing redundant work but doing threshold concensus signatures.

Afaik “distributed validator” could only really mean “threshold validator” with different subnodes under the control of different sysadmin. It’s way less than what people usually mean by “distributed”. It’s possible “threshold validator” would provide operational security benefits for validator operators, maybe even if the same sysadmin controls all subnodes.

I replied to threshold validator on github of course. As I said there…

We do keep our crypto threshold friendly whenever possible, but at least from polkadot’s perspective threshold validators provide little value, even though the underlying idea makes sense. If we needed more “decentralization” then we should adjust our parameters.

Kusama should not use threshold validators. We should debug & deploy NIST post-quantum crypto in concensus on kusama temporarily, which proves polkadot could deploy post-quantum crypto in production. There is no chance that NIST selects crypto with simple & secure threshold flavors.

Unfortunately some obstinately refuse to acknowledge the focus of development really needs to be the relay chain.

Rob’s comments look unrelated.

We do actively develop the relay chain, adding features & improving performance, but…

At present, we’ve no resoruce pressure on polakdot, so we need more development of example applications, like games or whatever, probably both externally and in-house.

As I said elsewhere…

You could bridge polkadot, kusama, and similar projects, but these bridges require 2/3rd honesty on both sides, like what cosmos assumes. You’d expect social engeneering attacks bring down cosmos-like bridge ecosystems eventually.

We envision multiple parallel relay chains randomly divvying up one large validator set, selected by NPoS ellection, using only DOT staking of course. In this, we’d prove 2/3rd honesty on each chain, instead of assuming it like cosmos does. We know only two ways to do this proof:

We make all validator operators run equally many nodes for each relay chain.
We require (a) the heavier 80% honest security assumption, as well as (b) shuffling of validators between the relay chains using (c) threshold randomness.

Anyways…

Threshold validators need threshold crypto, not relay chain features.
Distributed validator could only mean threshold validator. Staking is irrelevant there.
NPoS is already better than other “liquid sataking” ideas.

There do exist other reasons to provide more staking features, like staked DOTs being refrenced by collators, who require only liveness assurances not safety or soundness.

0mm-mark · May 5, 2024, 11:34pm

That seems reasonable.

My understanding was the current use case was as you note:

and to do so independently of the RC configuration/preferences.

Agreed. Although my use case is subtly different: The Attribute X may be adequate for Property Y, but inadequate for Property Z. That is on me - I did weakly cast this as being validator specific, but it needn’t be. What is generic is the presence of more than one RC.

I’m not disputing the RC development pace, and the use case is more strategic than tactical. But, as I acknowledged, this use case is a non-speculative token and that is categorically different from DOT.

So far, as best I can tell, the differences fall within the scope of parameter settings. Apologies for not being clearer, the RC code base is working its way up my todo list.

This should focus things: Are there integration (or unit) tests exercising the multi-RC use case? Or even documented rules-of-thumb about the trade-offs?

Here you mean non-BEEFY bridges? Or does BEEFY share these properties?

I’m inclined to agree with your GH comment that OmniLedger probably has some useful insights/results.

drew · May 22, 2024, 11:22pm

Thanks for the feedback and thoughts.

Yea, distributed validator or threshold validator, I’m using both terms to describe the same system. The MPC would also be relevant presumably for other use cases, key management/custody applications that want to interact with Polkadot.

I’d like to keep pushing out thoughts on a design and see if @burdges @0mm-mark you have more thoughts here, but going deeper begins to expose some possibility of PBS style block building. Specifically, if you distribute the signing of BABE and GRANDPA blocks across a cluster or a network of nodes, you run into the following decisions:

For BABE, consensus needs to be reached on what block to sign.
For GRANDPA, nodes can run the finality client and generate threshold signatures for blocks they want to finalize with less coordination (or maybe also requiring consensus although I don’t see the immediate malicious behavior with not).

For BABE it seems that if the cluster is being managed by different entities they may have some say in how they collectively want to build a block and then sign it with their key shares (still considering the threshold validator). This process seems like an avenue to explore proposer-build separation as I know @rphmeier has mentioned being interesting in Polkadot. Builders could send blocks w/ proofs to the threshold validator (proposer) who signs off on a block without having seen it. Of course a threshold validator is necessary to accomplish this but as I’ve thought about this more I’ve realised the space in between steps has room for exploration. Or maybe not and I still have more to understand.

and to do so independently of the RC configuration/preferences.

IIUC you’re saying that it would be potentially possible to reuse a cluster across relay chains? That does sound like an interesting extension.

0mm-mark · May 23, 2024, 12:47am

No. Closer to saying it MAY be necessary to have more validators than what one RC can support.
This is very much wet paint, and as I emphasized likely depends on how a RC is parameterized/configured. So you may be able to configure a chain so that this threshold changes - its all about tradeoffs - and you’ll need to accept that tradeoff if you want equilibrium pricing of your token.

You should also bear in mind the issues I have in mind arise from a consumer token design, while Polkadot is a speculator token (that consumers have no choice but to use). These are categorically distinct and it seems reasonable to expect that details that are an “issue” for one are inconsequential for the other.

burdges · May 23, 2024, 6:04am

I’d ignore BABE myself since it’s not too sensitive. I’d first ask the question: How much do we gain by backing & approval checks being done on isolated and/or redundant machines? It’s backing where validators could be slashed 100% and approvals could be slashed like 50%.

You could’ve a primary machine that run the full relay chain node, and a secondary machine that duplicates the backing & approvals work, and threhsold sign off on backing & approvals statements.

In fact, we could’ve a few subnets within the validator for example, along with the subnet provided by the ISP. We’ve the “front” relay chain machine that runs BABE, memepool, etc, but itself has two+ network cards, so the backing & approval checks run on “cores” machines with no internet connection.

We could threshold share the approvals & grandpa keys among the “cores” machines maybe. We do not necessarily have multiple cores machines do the work, but the approvals VRF key is extremely sensitive for the system. The problem here is the “front” machine can censor the assignments, which already suffices to break polkadot, but cannot get the validator slashed.

In principle, all these “cores” machines havew two network cards too, with the “finality” machine on this even more isolated subnet, but not sure this helps much.

I suppose these machines could be Rasberry Pis too, except polkavm won’t necessarily run nicely on Rasberry Pis.

I’m not 100% sure the best idea here. It’s be cool if there was an “output only” network option for the annoucements, but this does not really exist. Also, we might change some details here in jam, like using a threshold VRF at the whole network level.

drew · June 28, 2024, 10:30pm

Thanks for the feedback. It certainly narrows the focus down to ignore BABE. This would be a good first milestone to get an architecture similar to what you’ve described working to test.

The problem here is the “front” machine can censor the assignments, which already suffices to break polkadot, but cannot get the validator slashed.

We can create slashing consequences in auxiliary protocols that enforce proper operations no. Why can’t the “cores” have internet connections and can report their completed threshold signatures when they’re being censored to another protocol, such as Tangle. We plan to run this type of service as a restaked service on Tangle, and so we can encode slashing for malicious behavior.

Can you describe the “cores” more specifically? What is the benefit of having no internet connection? How do they communicate with the relay chain front node and why wouldn’t they trust that front node?

burdges · June 29, 2024, 2:51am

Ain’t clear what you mean but in general no you cannot punish most forms of miss-behavior.

I abused terminology there. We’ve CPU cores of course, but in polakdot “availability cores” are boxes where the realy chain handles parachain candidates. Almost all our resource consumption occurs during approvals, which occurs after the relay chain deletes the candidate from an “availability core” (and forgets it exists). “Availability cores” are thus limited not by our ability to do availability, but by our ability to check candidates, so you might call them “availability & approvals cores”, but which we call “cores”.

These “cores” are completely virtual though. “Availability cores” exist on chain, but “approval cores” exist across a random 30+ validators, and move randomly very few seconds.

Anyways I’m saying “the session keys realted to doing parachain work”, which currently is only two I think, and maybe the grandpa key does double duty.

Tangle is an IOTA thing? The trinary guys? lol

In polkadot, you cannot merge validators like on ETH or other L1, which maybe what you envision doing. All validators do somewhat different work

In principle, a hardenned validator would be multiple machines that employs physical data diodes. You’d need them physically co-located for the data diodes.

Also, if one validator were “distributed” across distant machines, then it’d wind up being too slow, and lose its rewards.

drew · July 10, 2024, 6:37pm

If we were to run this as an AVS on Tangle (i.e. offchain service run by active or waiting validators) then we can program our own slashing criteria on Tangle. We would leverage other assets to secure this service on Tangle, which would be open to slashing based on the onchain SLA defined in that specific service instance.

Lol, no. Same name, totally different. And what I mean by we can encode slashing is something specifically deployed to Tangle. The offchain service on Tangle that would run the distributed validator cluster for Polkadot could have a custom SLA on Tangle with the involved parties. If they don’t fulfill their duties as much as they can be defined in a Solidity or Ink! contract, they would have their assets on Tangle open to slashing.

I imagine so, yes we would look to run these distributed nodes in close proximity to one another, either in the same VPC or closer.

burdges · July 10, 2024, 9:43pm

This is irrelevant. All real networks drop packets, making it impossible to know if a packet is dropped for malicious reasons. It follows you cannot slash for dropped packets, and so you cannot slash for direct network level censorship. Anti-censorship tooling exists, but at other levels of the protocol, not relevant here.

It never makes sense to have one validator operated by multiple legal entities, so one validator should always be operated by the same individual person, or the same team of people who trust one another.

We’re discussing the architecture inside one single validator node, but which consists of multiple machines, doping different parts of the protocol, but all operated by the same legal entity. Internal slashing makes no sense. Why would you slash yourself?

As I wrote above…

There are badly designed protocols like ETH where one machine pretends to be multiple validators, which harms their decentralization. It’s possible you envision similarly harming the decentralization of polakdot? This is impossible in polkadot becuase validator have unrealted work loads. You’ll waste money or lose rewards for no reason.

Also…

AWS tooling should never be encuraged, so we should reject all funding proposals that even mention AWS specific tooling. AWS based validators could be banned entirely of course. Instead, we’ll eventually detect IP address ranges in a decentralized way, so then we’d reduce rewards whenver too many validators land in the same IP block. We’d then have a few AWS validators, likely enough so that their rewards degraded, but then the samrt ones would notice and abandon AWS.

drew · July 11, 2024, 4:17am

I said AVS, actively validated services, not AWS.

drew · July 11, 2024, 4:27am

Lot’s of assumptions I haven’t mentioned. These clusters would be operated by people running Tangle validators as individual entities. Nothing would stop someone from leveraging the tech without having anything to do with Tangle or its infrastructure. I’m just explaining how I would like to leverage this technology.

I don’t disagree here. Again, I’m just telling you how I will integrate it. There likely isn’t anything to slash. On the other hand, this cluster could operate a MEV relay or have priority tx flow for certain users or IPs or other configuration over it that benefits a group. Nonetheless, this is separated from implementing the system itself.

I’m literally trying to build something interesting in this ecosystem, lmao. And that thing does the reverse, it uses many machines to facilitate 1 validator…

burdges · July 11, 2024, 1:03pm

These are unrelated things that have nothing to do with running a validator.

Afaik, there are no problems if people want to restake dots to run other things, like collators or catastrophe bonds or whatever. We need another kind of lock that interacts across chains, but I think messages work kinda poorly here since if the parachain disapears then nothing could send the remove lock message.

In principle, we could’ve some lock that applies a slashing origin and an unstaking delay, and some “statement pattern” underneath that origin. A users’ own interface creates a Merkle proof that either the origin no longer exists, or a that the “statement pattern” is satisfied. In essence, a “statement pattern” is an extremely nerfed smart contract that describes flexible chains of hashes.

We might simply permit arbitrary smart contracts on AssetHub of course, but that’s not required per se. We’ll imho need to forbid smart contracts on some chains, like governance, bridgehub, and dkgs, but anyways.

We’d obviously never permit “staking leverage” within polkadot, where the same stake underwrites multiple polkadot validators. The whole point of NPoS, staking curves, etc is to maximize the stake that’s slashable.

This was my intial assumton, aka that you’re talking about validator operational security, but it became increasingly unclear from your terminology.

In fact even still, restaking for purposes besides running validators has (a) nothing to do with validator operational security, and (b) nothing to do with validators at all. It’s purely a nominator function.

As I said in the beginning, it’s really not chear what should exist on the extreme operational security front:

Yes, there are models where one physical validator could run across multiple machines. Yet, any typical model requires physically colocated machines, in which only designated “front” or “firewall” or “data diode” machenes have internet access, and all the orders have only local ethernet access. Your typical security guy should see zero value in multiple internet connected machines being one validator. There is otoh a complexity in polkadot because there are several ways that censorship become a security concerns for us. We’re thus left with missmatch between the operational security measures that benefit the validator vs network.

I’d think complexity is the enemey of security here, or as Adam Langely says “have one joint and keep it well oiled.” We designed polkadot around individual validators run by different people who directly contact one another over the internet, so anything different should be clearly defined and closely analyzed.

None of this has any connection to restaking.

burdges · July 23, 2024, 12:35am

Imho, restaking for collators maybe both easier and higher priority than the operational security stuff you’re raising here. If you want to do restaking, then maybe try to identify something specific that’s useful there?

In general, collators do not require slashing, so the simplest restaking solution does not even require the slashing message complexity mentioned above, just the ability to ensure the restaking is really comming from different dots. This is still doing cross parachain merkle proofs, so beyond the usual runtime work, maybe about the level of doing the on-chain part of an off-chain messaging solution. As a project, it shiould be interesting, but not too hard, useful to a variety of people, and definitely warrants grant money.

Topic		Replies	Views
Polkadot Staking Dashboard (staking.polkadot.network) and Polkadot.js not supporting decentralization of Validators by design validators	28	2275	April 17, 2024
Polkadot: The centralized decentralized ecosystem - a cartel? Ecosystem	18	2401	February 17, 2023
Economic Model for System Para Collators Tech Talk	24	2304	November 19, 2022
Polkadot DA vs competition	25	2323	November 13, 2023
Infrastructure Treasury funding Governance infrastructure	15	1403	August 1, 2023

Distributed validator infrastructure for Polkadot

Goal

Related topics