Resisting (D)DoS attacks by light clients

This post consists in some notes about how to resist light client DDoS attacks.

The conclusions end up being a bit underwhelming, but I’ve decided to post this anyway, because why not.

Problem statement

Light clients, by definition, do not store the storage of the chain. Whenever they need to access some data (e.g. the balance of an account) they need to send a network request to a full node, and the full node answers.

A full node needs a bit of time in order to read from its database (due to limited CPU and disk speed) and to send back the response (due to its limited bandwidth). If a full node receives more requests per second than it is able to answer, it gets overloaded. In that situation, the queue of unanswered requests accumulate, thus each request spends a lot of time in the waiting list, thus the time before a request is answered increases. This diminishes the quality of the service offered to the light clients.

While it is possible (and desirable) to reduce the number and size of requests that a light client implementation send to full nodes, it is not possible to differentiate an “optimized light client” that reduces its number of requests to a minimum from a “wasteful light client”.

It is also not possible to detect when two light clients are actually controlled by the same user or the same machine, as we don’t want to introduce tracking, currencies (end users don’t want to pay just to access the chain), or proof of work.

For this reason, it is possible for a malicious actor to spawn a very large number of light clients and make them send tons of requests per second, with the objective of overload the full nodes and degrading the quality of service of the legitimate light client users. This is called a DoS or DDoS attack.

How this is traditionally solved

This problem is pretty similar to the problem of resisting (D)DoS attacks in a traditional web2 client-server architecture, where clients are equivalent to light clients and servers equivalent to the full nodes.

The way this problem is traditionally solved is by having an upper bound to the number of connections that a server maintains at any given time, and an upper bound to the number of simultaneous requests per connection. Doing so consequently puts an upper bound to the time for a request to be answered, and thus guarantees a minimum quality of service to each connection.

On top of this, the maintainer of the service then dynamically adds or removes servers based on the number of active clients.

If a large number of clients suddenly appears, the maintainer starts new servers in order to increase the total number of connections that can be maintained, and thus “absorb” the load.

In case of an attack, the maintainer of the service spends money running the additional servers, but the attacker also spends money sending the requests. “Resisting” the attack is then about waiting until the attacker gives up.

Adapting this solution to web3

The solution in the previous section can’t be applied as-is for web3 services, as nobody is here to dynamically add (or remove) “servers” when the load increases.

There might exist “manual” solutions such as having node operators automatically spawn new servers and making the treasury reimburse the cost of resisting a DoS attack. But they are tricky, as it is very hard to verify the amount of money requested, and doesn’t encourage reducing the resources consumption of a full node. This might even encourage full node operators to start DoS attacks themselves in order to get money in return.

If we assume that the number of full nodes/servers can only be adjusted very slowly, then putting an upper bound to the simultaneous number of connection that each server maintains would make a DoS attack trivial, as the attacker could simply open as many connections as possible to all the full nodes that it can find until nobody can connect anymore.

(note: as mentioned in the problem statement, we don’t want to introduce measures such as filtering by IP)

Instead, there is no choice but making the number of connections per server unbounded.

(Note that, because each connection maintains some state, there is always some kind of upper bound derived from the amount of memory available. However this bound can in practice be in the order of several millions, which can basically be considered unbounded.)

Note that the number of simultaneous requests per connection should still be bounded, as this guarantees an equal distribution of resources per connection on each full node. This reduces from two to one the degrees of magnitude of how powerful an attack can be.

Topology

Leaving the number of connections per full node unbounded means that an attack could target specific full nodes and is guaranteed to basically render them useless, as opposed to spreading the attack over all the full nodes.

However, because light clients randomly choose the full nodes they connect to, and that this connectivity is kept secret, it isn’t possible for an attacker to target one specific light client, and thus having a targeted attack seems counter-productive.

Light clients should however periodically disconnect from the full node with the highest response time, as this could be an indication that it is under attack.

Conclusions

Here are the conclusions:

  • There shouldn’t be any limit to the number of light client connections a full node simultaneously maintains (currently there’s a limit at 200 I believe).

  • Each light client should be limited to one request simultaneously on each full node.

  • The code of the full node that handles light clients should be bounded in terms of CPU, bandwidth, and memory, in order to protect the rest of the full node (i.e. the block verification and authorship, the connections with the other full node, the JSON-RPC server, etc.). While this seems complicated, it is easily done by making sure that all the light-client-related-stuff is processed by at most N tasks.

  • It is really important for light clients to randomly choose who they connect to, as it protects them from targeted attacks.

  • Ideally there should be network monitoring systems that allow “us” (the community) to react to the network being overwhelmed without being caught by surprise.

6 Likes

Hey @tomaka

Disclaimer: I have absolutely no clue about the fullnode and lightclient system architecture, so some points might be rendered irrelevant.

How does a Light-Node DOS attack differ from a “normal” RPC (HTTP) attack? You’re unable to scale the fullnode in both cases, and the transport layer should not play a distinguishing role, right?

The problem sounds similar to a sibyl attack and would thus require some sort of either proven identification or payment which - in your description - is not a go-to way. Is there any possibility to require a light-client to lock an amount of DOT which - in case of abuse - could be withdrawn from the respective full-node (think of validator slashing)? It’s like a middle ground between a whitelisting and a pay-per-request solution. There are follow-up questions like “how can we verify or prove that a light-client misbehaves?” and the UX might suffer from this, but I’m just trying to throw another approach in here.

I think generally it’s extremely hard to limit request abuse unless the light-client has some sort of stake to lose.

However, because light clients randomly choose the full nodes they connect to, and that this connectivity is kept secret, it isn’t possible for an attacker to target one specific light client, and thus having a targeted attack seems counter-productive.

Light clients should, however, periodically disconnect from the full node with the highest response time, as this could be an indication that it is under attack.

I’m unaware of the scale and how exactly a light client discovers full nodes, but this could open up geo-targeted attacks. An attacker can DDOS full-nodes in a (possible) remote area and make sure that light-clients drop the honest connections and thus only communicate with the malicious ones (as they have proper response times).

An RPC attack is basically what I’ve described as a “web2 system”. The owner of the RPC node is supposed to scale up by spawning more servers.

Because of the current situation where RPC operators ask for money from the treasury to operate, this could cause interesting problems for example if only one specific RPC node operator is targeted by an attack.

There are two problems with that idea:

  • It would be a terrible UX to ask an end user to pull out their credit card and buy some DOTs just to be able to access the network. It would also encourage the chain’s maintainers to continue proposing a free service in the form of RPC nodes, which is insecure. A bit like with the letsencrypt story, people are only going to move to the secure option if it’s free.

  • How do you define abuse? Whatever your implementation of “abuse” ends up being, an attacker is going to remain below the limit. The only way to punish someone in case of abuse is to keep the abuse detection system secret/closed source, which is what web2 services typically do. But we can’t do that here.

Yeah that’s a good point. Note that I’m not suggesting to connect to the node with the lowest response time, simply disconnect from the one with the highest response time, so it might take a looooot of time for this attack to be successful, as you’re still connecting at random. That being, it’s still a valid point.

To add to the opening post, the direction I’m going with smoldot (which is also a prototype of a full node for reminder) is:

  • Remove the concept of “in” slots. The number of incoming connections and peerings is unbounded.
  • The “out” slots still exist. The nodes to connect to are chosen randomly.
  • Connections are split in two categories: high priority ones, which includes all the connections used in “out slots” plus the intra-validator connections, and low priority ones, which is all the others.
  • All the low priority connections share a configurable bandwidth and CPU limit. The high priority connections don’t have any limit other than the system’s.
  • When it comes to collations, a connection where a successful collation was made is moved for a certain period of time to the high priority connections.

This design would unfortunately be insanely difficult to port to Substrate, because of the architecture of rust-libp2p.

An RPC attack is basically what I’ve described as a “web2 system”. The owner of the RPC node is supposed to scale up by spawning more servers.

But this would be the exact same scenario as when you have a light-client DDOS attack. You have to spawn in both scenarios more full-nodes as thats the only way to scale it (regardless of how you expose the interface). Whereas in web2 you can scale different services (more rpc or REST endpoints, more caching instances, sharding on databases etc) dependent on which type of system is under pressure.
tbf I have to read up more on the node architecture before being able to give more valuable input here.

It would be a terrible UX to ask an end user to pull out their credit card and buy some DOTs just to be able to access the network. It would also encourage the chain’s maintainers to continue proposing a free service in the form of RPC nodes, which is insecure. A bit like with the letsencrypt story, people are only going to move to the secure option if it’s free.

I totally agree

  • How do you define abuse? Whatever your implementation of “abuse” ends up being, an attacker is going to remain below the limit. The only way to punish someone in case of abuse is to keep the abuse detection system secret/closed source, which is what web2 services typically do. But we can’t do that here.
    opening idle connections, sending an enormous amount of requests per second, …

Another idea could be some sort of PoW-proof where the difficulty depends on a combined metric of the points above and a successful proof opens a time-boxed session - but since you mentioned that introducing pow is not desirable (and I agree to an extent) I guess its not really an option.

As I explain in the OP, I go with the assumption that we’re not capable of adjusting the number of nodes very swiftly (if at all).

Instead of relying on the idea that you can just add more servers, blockchain nodes have to be designed to handle a large number of incoming connections and degrade them all equally.

It’s clearly not a mind-blowing new way to handle DoS attacks, it’s just the conclusions of the differences between web2 and web3.

At first glance this idea sounds reasonable. Would you provide more details why PoW to mitigate DDoS is not a way to go?

Go try to explain to light client users that their browser needs to do meaningless calculations for 30 seconds to 2 minutes before the UI even starts loading and draining their mobile battery in the process.

1 Like

There shouldn’t be any limit to the number of light client connections a full node simultaneously maintains (currently there’s a limit at 200 I believe).

I’m thinking on the following scenario:

A single program acting as a light client node exhaust all the connections creating p2p nodes with different PeerIds in the same program.

Is there anything that prevents this?

The client puzzle mechanism should be dynamic, so it only applies if there are more than N connections, and the difficulty increases proportionally to the number of light client connections on the full node (or maybe the CPU and network load).

This is a nice document on a solution for Onion services:

Still, it’s much better not to add the complexity of client puzzles, since the potential impact of a DDoS attack to prevent light client connections is not clear.

In my opinion, if someone uses a large botnet (renting it or their own), their goal would be to bring down a network and not just target light clients… and of course requires other kind of countermeasures, even at the Autonomous Systems level… :smiley:

1 Like

Didn’t we already conclude here that people need to some how pay for light client data? Specifically state proof data.

People already pay for so called “RPC servers”. So a protocol-enshrined data market doesn’t seem far fetched. I also don’t think it’s far-fetched that bad actors will DDOS all full nodes with light client requests. I mean just look at the griefing attack on kusama

What prevents an adversarial “light client” to spoof himself as a “full node” in the network?

There’s no difference between bringing down the network and preventing people from using it. What’s the point of a running network that can’t be accessed by anyone?

I know that from a marketing standpoint you can claim that your network continued running, but if we’re honest it’s the same.

I’m not sure who concluded that. I remember mentioning that having a paid system in addition to the free system could be an idea, but not that it was the solution we should go forward. Pragmatically speaking, an incentivization protocol is insanely hard to design.

Do you pay to connect to RPC servers? Because I don’t.

I don’t think this was a griefing attack? It seems like a boring software bug to me.

Nothing. However it is much easier to resist to full nodes DoS attacks, as you can just refuse full nodes when you’re full. Full nodes, contrary to light clients, stay online for a long time and have long-lived connections, so refusing new connections doesn’t have a huge impact. It’s also not a big deal if a full node takes several minutes to connect to the network, as opposed to light clients.

Basically, pretending to be a full node instead of a light client for a DoS attack makes it way more difficult.

Clear enough, thank you. The question was pointing out a problem with the idea of making the light clients pay to access the network (with the escrow account and signed requests mentioned in the other thread) since the role can be spoofed and bypass the payment.

Considering the possibility of network monitoring to detect blockchain protocol attacks, I have some thoughts and questions:

Detecting attacks at the Polkadot protocol layer seems challenging. Typical tap devices cannot inspect noise encrypted traffic, but they are effective in detecting high traffic originating from the same address rang. So the focus should be on asymmetrical application level requests and connection slot exhaustion without generating high traffic.

Could implementing a beacon node, i.e honeypot approach, specifically for security monitoring be an option worth exploring? Or better a detection mechanism in the full nodes themselves?

What are the most asymmetric network exchanges, where a request causes a high load on the responder?

It’s a really interesting challenge :grin: Do you have any ideas on how to approach it?