Optimal filesystem for blockchain databases

I’ve been wrestling with a subpar write performance using ZFS on some fresh-out-the-oven NVMe disks, trying to get them to play nice with blockchain workloads. Given ZFS’s reputation for being pretty tweakable, I thought I could dial it in—mess with the recordsize, tune a few parameters, and boom, we would be golden.

But, oh boy, was I wrong. It’s been more of a headache than expected. And trying to find any sort of guidance or benchmarks online? Forget about it. It’s like looking for that proverbial needle in the haystack, if the needle was possibly never there to begin with.

So, with the internet giving me the cold shoulder, I decided it was time to roll up my sleeves and get my hands dirty. For the benchmarking setup, I used polkadot client benchmark for write operations and fio for read operations:

curl -s https://api.github.com/gists/868fac581cf66af7780a1ccfd02c7b1b | jq -r '.files | first(.[]).content' > chainbench.sh && chmod +x chainbench.sh && ./chainbench.sh

And let me lay out the hardware carpet for ya:

  • CPU: AMD Ryzen 9 7950X3D
  • Motherboard: ASRock Rack B650D4U-2L2T/BCM
  • RAM: 4x MICRON DDR5 UDIMM/ECC 4800MT/s, dialed down to 3600MT/s (CPU limitation afaik)
  • Storage: Samsung SSD 990 PRO 4TB on a PCIE 4.0 interface.

The gauntlet was thrown, and after the dust settled, it turns out ext4 and xfs are the champs for Polkadot’s storage needs. They hit that sweet spot of efficiency. Meanwhile, filesystems like ZFS and NILFS2? Not so much. They kinda dropped the ball, especially with random writes, which, let’s be honest, is a big deal for blockchain operations.

So I highly recommend the tradeoff off giving up feature rich filesystems for the simplicity and performance of ext4. You can pretty much skip the snapshots/backups for validators since good blockchain you can always re-sync from the network if there ever is an issue(warp sync doable in minutes).
For some archive nodes it might make sense to use copy on write filesystems(btrfs/bcachefs) to enable snapshots and backups. Though the way ParityDB keeps editing same huge files over and over is quite terrible for incremental snapshotting/backuping. I always run 2 instances of archive nodes per network so I end up to use lvmthin/ext4 with occasional backup using rsync to hard drive just in case.
image

Looking ahead, keep your eyes peeled for BCacheFS(CoW) and SSDFS(Pure-LFS with CoW policy). BCacheFS just made its grand entrance into the mainline kernel, and SSDFS is waiting in the wings. My setup couldn’t give them a whirl this time around (limited by pve-kernel), but they’re definitely ones to watch out for once supported by Linux LTS kernels. SSDFS in particular looks promising because it significantly reduces write amplification and improves SSD lifespan through its innovative use of logical extents, a log-structured design, and efficient data migration and garbage collection schemes, enhancing both storage efficiency and performance.

So, there you have it. My little adventure into look for an optimal filesystem.
This hopefully sheds some light on when selecting filesystem for your nodes and help you to avoid making the mistake of using of ZFS.

5 Likes

Consider using a mirror instead of raidz for your pool configuration to potentially boost write speed and decrease write amplification. Keep in mind that this change will result in less usable space and redundancy of use cryptex

Opt for RAW image files over QCOW for your virtual machines for improved performance and reduced overhead. If you require QCOW2 features, remember to preallocate their metadata.

Opt for a large recordsize (like 1M) for your datasets to minimize fragmentation and enhance sequential write performance. Be aware that this adjustment could lead to higher memory usage and latency for small random writes.

Enable sync=disabled for your datasets to bypass the ZIL and enhance write throughput and IOPS. However, be cautious as this may elevate the risk of data loss or corruption in the event of a power failure or system crash.

Thanks a lot for the optimization tips! I really appreciate your insights and suggestions. I understand that my post might have had some passive aggressiveness towards ZFS after rotating 4 servers(~50 lxc) filesystem from zfs to lvmthin/ext4 leading post to be somewhat trigger-some. Apologies for that. I do think tho that ZFS is an outstanding filesystem for object storage, offering excellent efficiency in space utilization as well as top of the line user experience with its CLI.

However, despite the potential for tuning, it am yet to see it being optimal for blockchain operations, which primarily rely on small, random writes. Rob recently wrote a great article about the low-level operations that help understanding the workload that goes on.

These tests were conducted with a single disk setup, and I’ve previously run my nodes in a striped mode incorporating nearly all the tips you mentioned, except for using a larger recordsize. Instead of VMs, I’ve been leaning towards LX-containers to minimize overheads, finding them to be a more efficient choice for my setup. If you manage to find the correct tuning that is even close matching log structured filesystems, I would be more than happy to see some results that convince me to rotate back.

On completely another note, researching more about filesystems that could be great fit for these operations I came a cross with IPLFS. Log structured filesystem that does its best to avoid having garbage collection by having discard bitmap replacing block bitmap.

We already have blockchain optimized key-value store with ParityDb which leads me to ponder the strategic advantages of identifying and supporting the development of the most suitable filesystem for our specific requirements. The creation of such filesystems is often the work of individual developers, undertaken over many years, with limited prospects for integration into the Linux kernel and broader adoption. However, by sponsoring the development of an optimal filesystem, we could not only accelerate its creation but also foster a symbiotic relationship that benefits both our operations and the wider community. After all, low-latency storage seems to be the most impactful factor for scaling up validator performance.

1 Like

I agree with the suggestions to use mirror instead of raidz, RAW instead of QCOW, and large recordsize for better write performance. However, I would not recommend using sync=disabled unless you have a reliable UPS and backup system. The risk of losing data or corrupting your pool is too high for me. I also use cryptex to encrypt my data at rest, which adds some overhead but also some security benefits. I think it is worth it, especially if you store sensitive or personal data on your pool.

1 Like

I disagree with the suggestions to use mirror instead of raidz, RAW instead of QCOW, and large recordsize for better write performance. I think these choices sacrifice reliability, flexibility, and efficiency for marginal gains. Mirror vdevs waste more space and are less resilient than raidz vdevs. RAW files are less portable and harder to manage than QCOW files. Large recordsize can cause fragmentation and waste space for small files. I agree with using sync=disabled if you have a good UPS and backup system, as it can improve performance and reduce wear on your disks. I don’t use cryptex, as I trust ZFS’s native encryption to protect my data at rest.