We stumbled upon a problem during our transaction throughput testing on our kusama parachain: revalidating transactions stuck in a loop on gateway nodes (?).
Context
Our network consists of 7 nodes,
- 3 collators - parachain-collator-swe🇸🇪, parachain-collator-deu🇩🇪, parachain-collator-ita🇮🇹
- 3 gateways - parachain-gateway-kor🇰🇷, parachain-gateway-deu🇩🇪, parachain-gateway-usa🇺🇸
- 1 extra archive node - parachain-archive
Only gateways can have peerings with collators, so there is no direct connectivity between parachain-archive and collators.
There is a graph for txpool_validations_scheduled node metric
During our testing, we are spamming one of the gateways with a lot of balance.transfer calls, client was located in Europe in all the following steps, except (3).
(Load script code: benchmark.ts · GitHub)
We started with parachain-gateway-usa🇺🇸 as our first target, and everything went smoothly; collators handled every transaction (first spike on the graph, 1).
Then we proceeded with parachain-gateway-kor🇰🇷, and everything went well (second spike, 2).
Then we retried with parachain-gateway-usa🇺🇸 again, but with the client located in North America, resulting in success (third spike, 3). The sender location doesn’t make a difference.
But then, interesting things started to happen.
During parachain-gateway-deu🇩🇪 testing (orange line on the graph), we filled the transaction pool with transactions (4), and collators executed a couple of transactions… And then, the rest of the transactions were stuck in the loop on the gateway, revalidating and moving from Ready state to Future state and vice-versa (spiky orange line on the graph, 5).
Then we restarted parachain-gateway-kor🇰🇷 (6), collators executed another part of the initial batch of transactions, and the rest were stuck in the same loop again. Now there are two nodes in this loop: parachain-gateway-deu🇩🇪 and parachain-gateway-kor🇰🇷. (7)
Then parachain-gateway-usa🇺🇸, and the same story as with parachain-gateway-kor🇰🇷 (8)
Then parachain-archive (9), and there were no transactions executed at all. So restarting a gateway works as a pump; restarted gateway gathers some part of the transaction pool, manages to send some of them to the collator and then stops for some reason?
And finally, we restarted parachain-collator-deu🇩🇪 (10), and every transaction was finally processed (11).
This behaviour is reproducible, and we have found that adding 100ms latency to parachain-gateway-deu🇩🇪 network makes this issue go away.
Summary
parachain-gateway-deu🇩🇪 getting stuck under load, and only collator restart helps to resolve this issue.
We also tested with transactions sent from Europe and North America with the same result (to parachain-gateway-deu🇩🇪 in both cases); sender location doesn’t make a difference; only transactions sent toparachain-gateway-deu🇩🇪 getting stuck.
What might be the cause of this behaviour? What can we do to prevent the malicious actor from making our gateways stuck in this state?