I’ve been wrestling with a subpar write performance using ZFS on some fresh-out-the-oven NVMe disks, trying to get them to play nice with blockchain workloads. Given ZFS’s reputation for being pretty tweakable, I thought I could dial it in—mess with the recordsize, tune a few parameters, and boom, we would be golden.
But, oh boy, was I wrong. It’s been more of a headache than expected. And trying to find any sort of guidance or benchmarks online? Forget about it. It’s like looking for that proverbial needle in the haystack, if the needle was possibly never there to begin with.
So, with the internet giving me the cold shoulder, I decided it was time to roll up my sleeves and get my hands dirty. For the benchmarking setup, I used polkadot client benchmark for write operations and fio for read operations:
curl -s https://api.github.com/gists/868fac581cf66af7780a1ccfd02c7b1b | jq -r '.files | first(.[]).content' > chainbench.sh && chmod +x chainbench.sh && ./chainbench.sh
And let me lay out the hardware carpet for ya:
- CPU: AMD Ryzen 9 7950X3D
- Motherboard: ASRock Rack B650D4U-2L2T/BCM
- RAM: 4x MICRON DDR5 UDIMM/ECC 4800MT/s, dialed down to 3600MT/s (CPU limitation afaik)
- Storage: Samsung SSD 990 PRO 4TB on a PCIE 4.0 interface.
The gauntlet was thrown, and after the dust settled, it turns out ext4 and xfs are the champs for Polkadot’s storage needs. They hit that sweet spot of efficiency. Meanwhile, filesystems like ZFS and NILFS2? Not so much. They kinda dropped the ball, especially with random writes, which, let’s be honest, is a big deal for blockchain operations.
So I highly recommend the tradeoff off giving up feature rich filesystems for the simplicity and performance of ext4. You can pretty much skip the snapshots/backups for validators since good blockchain you can always re-sync from the network if there ever is an issue(warp sync doable in minutes).
For some archive nodes it might make sense to use copy on write filesystems(btrfs/bcachefs) to enable snapshots and backups. Though the way ParityDB keeps editing same huge files over and over is quite terrible for incremental snapshotting/backuping. I always run 2 instances of archive nodes per network so I end up to use lvmthin/ext4 with occasional backup using rsync to hard drive just in case.
Looking ahead, keep your eyes peeled for BCacheFS(CoW) and SSDFS(Pure-LFS with CoW policy). BCacheFS just made its grand entrance into the mainline kernel, and SSDFS is waiting in the wings. My setup couldn’t give them a whirl this time around (limited by pve-kernel), but they’re definitely ones to watch out for once supported by Linux LTS kernels. SSDFS in particular looks promising because it significantly reduces write amplification and improves SSD lifespan through its innovative use of logical extents, a log-structured design, and efficient data migration and garbage collection schemes, enhancing both storage efficiency and performance.
So, there you have it. My little adventure into look for an optimal filesystem.
This hopefully sheds some light on when selecting filesystem for your nodes and help you to avoid making the mistake of using of ZFS.