Parachain gateway stuck on transaction revalidation

We stumbled upon a problem during our transaction throughput testing on our kusama parachain: revalidating transactions stuck in a loop on gateway nodes (?).

Context

Our network consists of 7 nodes,

  • 3 collators - parachain-collator-swe🇸🇪, parachain-collator-deu🇩🇪, parachain-collator-ita🇮🇹
  • 3 gateways - parachain-gateway-kor🇰🇷, parachain-gateway-deu🇩🇪, parachain-gateway-usa🇺🇸
  • 1 extra archive node - parachain-archive

Only gateways can have peerings with collators, so there is no direct connectivity between parachain-archive and collators.

There is a graph for txpool_validations_scheduled node metric

During our testing, we are spamming one of the gateways with a lot of balance.transfer calls, client was located in Europe in all the following steps, except (3).
(Load script code: benchmark.ts · GitHub)

We started with parachain-gateway-usa🇺🇸 as our first target, and everything went smoothly; collators handled every transaction (first spike on the graph, 1).

Then we proceeded with parachain-gateway-kor🇰🇷, and everything went well (second spike, 2).

Then we retried with parachain-gateway-usa🇺🇸 again, but with the client located in North America, resulting in success (third spike, 3). The sender location doesn’t make a difference.

But then, interesting things started to happen.

During parachain-gateway-deu🇩🇪 testing (orange line on the graph), we filled the transaction pool with transactions (4), and collators executed a couple of transactions… And then, the rest of the transactions were stuck in the loop on the gateway, revalidating and moving from Ready state to Future state and vice-versa (spiky orange line on the graph, 5).

Then we restarted parachain-gateway-kor🇰🇷 (6), collators executed another part of the initial batch of transactions, and the rest were stuck in the same loop again. Now there are two nodes in this loop: parachain-gateway-deu🇩🇪 and parachain-gateway-kor🇰🇷. (7)

Then parachain-gateway-usa🇺🇸, and the same story as with parachain-gateway-kor🇰🇷 (8)

Then parachain-archive (9), and there were no transactions executed at all. So restarting a gateway works as a pump; restarted gateway gathers some part of the transaction pool, manages to send some of them to the collator and then stops for some reason?

And finally, we restarted parachain-collator-deu🇩🇪 (10), and every transaction was finally processed (11).

This behaviour is reproducible, and we have found that adding 100ms latency to parachain-gateway-deu🇩🇪 network makes this issue go away.

Summary

parachain-gateway-deu🇩🇪 getting stuck under load, and only collator restart helps to resolve this issue.
We also tested with transactions sent from Europe and North America with the same result (to parachain-gateway-deu🇩🇪 in both cases); sender location doesn’t make a difference; only transactions sent toparachain-gateway-deu🇩🇪 getting stuck.

What might be the cause of this behaviour? What can we do to prevent the malicious actor from making our gateways stuck in this state?

This is more suitable as a Substrate/Cumulus GitHub issue, rather than a forum post. Not dismissing it, but just want to make sure it lands in the right place for any investigation/bugfixing.