The fork-aware transaction pool

I’d like to share that the new fork-aware transaction pool implementation for substrate-based nodes is ready and available for testing. This improvement addresses several key issues with the previous transaction pool, which often led to stuck or dropped transactions.

What was wrong with the old implementation?

In short, the previous transaction pool was providing an invalid ready set of transactions to the block builder, causing valid transactions to be dropped on collator. This, in turn, resulted in transactions being stuck on the RPC nodes.

This issue is easy to reproduce locally in a small network (e.g., 1 RPC node, 2 collators). After submitting a large number (3-4 blocks) of transactions from a single account, some of the transactions would never make it into a block, becoming stuck on the RPC node. (I can share more details and setup instructions if anyone is interested).

More techincal details about the root cause can be found here and here.

How the fork-aware transaction pool fixes this?

The new implementation maintains a transaction pool state for every fork, meaning it keeps a ready set of transactions for each block that’s notified to it. This ensures that the block builder always works with the most up-to-date ready set of transactions, aligned with the state of the given fork. No future or stale transactions are provided to the block builder.

For a deeper dive into the internals of the transaction pool, you can check out an initial proposal and also this document.

What’s next?

While the new implementation is ready for testing and should resolve the major issues, there are still areas for improvement. A list of known future work and enhancements has been documented, and I’d appreciate your input on prioritizing these.

Feel free to check the planned future work here. If you have specific priorities, please let me know or give a thumbs-up on the GitHub issues related to the future work to help prioritize what should be tackled next.

Grafana

I’ve also set up a Grafana dashboard with insights into the new transaction pool mechanics. You can view it here.

If anyone is interested, I can export and share the dashboard JSON.

How to run it?
The PR description is providing a code snippet that needs to be integrated with the custom node. For omni-node the fork-aware transaction pool is already there.
To enable new pool simply use the following command line argument: --pool-type=fork-aware.
For debugging purposes the following log settings will be helpful: "-lbasic-authorship=debug -ltxpool=debug".

Thank you all for your support and patience throughout this process. Your feedback and testing will be invaluable as we continue refining the transaction pool. If you encounter any issues or have suggestions, don’t hesitate to reach out. I’ll be glad to help.

17 Likes

Amazing work @michal ! Thanks a lot for putting all the hard work in to making this happen!

1 Like

Awesome!

All polkadot developers have suffered from babe’s forkfulness, but this fight improved robustness, ala this improved memepool. We’ll solve babe’s forkfulness with sassafras aka safrole, but all these improvement live on in polkadot.

JAM development should take this into consideration: JAM is being developed on sassafras aka safrole, so without the forks, and JAM simplifies the code base, so we risk losing many robustness improvements in JAM. JAM testing should adopt babe-like forkfulness, maybe a sassafras+aura testing mode, or just turn sassafras to double issue slot tickets.

Should move to thikal ergoline consensus.

Working on testing this with the omni-node on 50% of our testnet collators now and it appears to have resolved our transaction pool issues due to forking.

Thanks for all the hard work on getting this out! Will report if we have any issues with it.

3 Likes

Great to hear that, thanks for testing. Let me know if you encounter any issues.

Actually I didn’t have enough time to test old and new pool in the same network (theoretically they may have different view on what is the valid transaction what could in turn lead to some reputation adjustments). I planned this check as a follow-up issue.

Integritee-paseo is now running one “old” collator and one polkadot-parachain@ed231828 OmniNode with --pool-type=fork-aware. So far, our issue seems solved. Will revert if we see any trouble in the next few days

Thanks a lot for this @michal !

1 Like

I can confirm: we stressed our mainnet with 136079 extrinsics from a single account within roughly a week https://integritee.subscan.io/tools/charts?type=extrinsic

no issues so far with dropped extrinsics

3 Likes

Frequency Mainnet has also been running 3 of 8 collators with fatxpool over the last week. No issues once we also moved to 6s block time (I assume so that transactions don’t timeout before it gets to one of those 3 with the fatxpool).

Thanks again for the work on this!

1 Like

That sounds like great news! Thanks a lot for testing this out!

Just out of curiosity, have you tried any other tests? If so, what kind of tests did you run?

Main scenarios I was already testing on my side before merging:

  • many txs from single account,
  • single tx from multiple accounts,
  • many txs from multiple accounts,
  • checking if limits are obeyed,

This was done in different network configurations: only collators, collators + RPC nodes, also for relaychain nodes.
And it was also done for different weights of the transactions (which results in different max number of txs in block).

I am now working on testing (and improving if needed) transactions priorities, and next I will take closer look on testing mortal transactions.

If you have any thoughts on how to test it better and what shall be checked, please share.

All three of these have happened without issue on Frequency

  • many txs from single account,
  • many txs from multiple accounts,
  • single tx from multiple accounts, (Bit less of this one however)
  • Haven’t worried about limits really.
  • Setup is currently with only collators.
  • Frequency also has a good range of tx weights running through it, so good there as well

I think my largest worry is number of forks. Sometimes finality can lag and not sure what issues that might cause.

Would you share the number of forks, and what is finality lag? Checking finality lag is also something on my todo list.

I have not been tracking the number of forks, so I don’t have that data. It is just a worry around pool limits.

Finality lag is when the relay chain is taking a long time to confirm blocks. I’ve seen 20+ blocks when the lag is large (My alarms trip at a lag of 10 for more than 5 minutes). When everything is healthy the finality lag (with 6s block times) is 6-7.

So if the relay chain had issues, the finality lag (and thus the potential number of forks) could be high. Not sure how high, but 50 blocks would not be out of the question for a minor incident.

This is a valid concern. To solve this we are planning to add LRU cache for views, so the pool will only keep the views up to the given, pre-defined number of blocks / forks. I’ll bump up this on my todo list.

1 Like

As a rule, relay chain finality lags when ELVES lags because more validators no-show than usual. There are many nasty reasons this could happen, and bugs, but often it should mean some underspeced validator got ellected, not really a big deal.

If finality lags, then the relay chain still follows longest chain, well somewhat, so parachains should be trying to do so too. I’m not sure how this is handled now.