This post consists in some notes about how to resist light client DDoS attacks.
The conclusions end up being a bit underwhelming, but I’ve decided to post this anyway, because why not.
Problem statement
Light clients, by definition, do not store the storage of the chain. Whenever they need to access some data (e.g. the balance of an account) they need to send a network request to a full node, and the full node answers.
A full node needs a bit of time in order to read from its database (due to limited CPU and disk speed) and to send back the response (due to its limited bandwidth). If a full node receives more requests per second than it is able to answer, it gets overloaded. In that situation, the queue of unanswered requests accumulate, thus each request spends a lot of time in the waiting list, thus the time before a request is answered increases. This diminishes the quality of the service offered to the light clients.
While it is possible (and desirable) to reduce the number and size of requests that a light client implementation send to full nodes, it is not possible to differentiate an “optimized light client” that reduces its number of requests to a minimum from a “wasteful light client”.
It is also not possible to detect when two light clients are actually controlled by the same user or the same machine, as we don’t want to introduce tracking, currencies (end users don’t want to pay just to access the chain), or proof of work.
For this reason, it is possible for a malicious actor to spawn a very large number of light clients and make them send tons of requests per second, with the objective of overload the full nodes and degrading the quality of service of the legitimate light client users. This is called a DoS or DDoS attack.
How this is traditionally solved
This problem is pretty similar to the problem of resisting (D)DoS attacks in a traditional web2 client-server architecture, where clients are equivalent to light clients and servers equivalent to the full nodes.
The way this problem is traditionally solved is by having an upper bound to the number of connections that a server maintains at any given time, and an upper bound to the number of simultaneous requests per connection. Doing so consequently puts an upper bound to the time for a request to be answered, and thus guarantees a minimum quality of service to each connection.
On top of this, the maintainer of the service then dynamically adds or removes servers based on the number of active clients.
If a large number of clients suddenly appears, the maintainer starts new servers in order to increase the total number of connections that can be maintained, and thus “absorb” the load.
In case of an attack, the maintainer of the service spends money running the additional servers, but the attacker also spends money sending the requests. “Resisting” the attack is then about waiting until the attacker gives up.
Adapting this solution to web3
The solution in the previous section can’t be applied as-is for web3 services, as nobody is here to dynamically add (or remove) “servers” when the load increases.
There might exist “manual” solutions such as having node operators automatically spawn new servers and making the treasury reimburse the cost of resisting a DoS attack. But they are tricky, as it is very hard to verify the amount of money requested, and doesn’t encourage reducing the resources consumption of a full node. This might even encourage full node operators to start DoS attacks themselves in order to get money in return.
If we assume that the number of full nodes/servers can only be adjusted very slowly, then putting an upper bound to the simultaneous number of connection that each server maintains would make a DoS attack trivial, as the attacker could simply open as many connections as possible to all the full nodes that it can find until nobody can connect anymore.
(note: as mentioned in the problem statement, we don’t want to introduce measures such as filtering by IP)
Instead, there is no choice but making the number of connections per server unbounded.
(Note that, because each connection maintains some state, there is always some kind of upper bound derived from the amount of memory available. However this bound can in practice be in the order of several millions, which can basically be considered unbounded.)
Note that the number of simultaneous requests per connection should still be bounded, as this guarantees an equal distribution of resources per connection on each full node. This reduces from two to one the degrees of magnitude of how powerful an attack can be.
Topology
Leaving the number of connections per full node unbounded means that an attack could target specific full nodes and is guaranteed to basically render them useless, as opposed to spreading the attack over all the full nodes.
However, because light clients randomly choose the full nodes they connect to, and that this connectivity is kept secret, it isn’t possible for an attacker to target one specific light client, and thus having a targeted attack seems counter-productive.
Light clients should however periodically disconnect from the full node with the highest response time, as this could be an indication that it is under attack.
Conclusions
Here are the conclusions:
-
There shouldn’t be any limit to the number of light client connections a full node simultaneously maintains (currently there’s a limit at 200 I believe).
-
Each light client should be limited to one request simultaneously on each full node.
-
The code of the full node that handles light clients should be bounded in terms of CPU, bandwidth, and memory, in order to protect the rest of the full node (i.e. the block verification and authorship, the connections with the other full node, the JSON-RPC server, etc.). While this seems complicated, it is easily done by making sure that all the light-client-related-stuff is processed by at most N tasks.
-
It is really important for light clients to randomly choose who they connect to, as it protects them from targeted attacks.
-
Ideally there should be network monitoring systems that allow “us” (the community) to react to the network being overwhelmed without being caught by surprise.