One of the milestones of my Q4 treasury proposal is adding support for warp syncing to the full node.
When I added this milestone, what I had in mind is to implement it the same way as Substrate does: the full node would download a warp sync proof from the network, which proves that the latest finalized block is a certain block, then download the state of the chain at this particular block, and then actually start syncing.
This approach, however, has a drawback. Other nodes prune the state of blocks after a while, and thus if downloading the state of a block is not fast enough, it might be that the state of that block gets pruned and is no longer downloadable.
While it doesn’t seem to be a problem right now, if we imagine a state of 10 GiB (which isn’t that much), and that the state of a block is pruned after 1024 blocks, you would need an average download speed of 1.67 MiB/sec in order to download everything in time. It is therefore not an imaginary problem, but something that can really happen.
Consequently, I’ve decided to take a different approach in the smoldot implementation.
Just like Substrate, smoldot will download a warp sync proof and download the state of the block it has warp synced to, but contrary to Substrate, it will also immediately start syncing more recent blocks in parallel of the download.
If, when verifying a block, the runtime accesses a storage item that hasn’t been downloaded yet, smoldot will prioritize the download of this specific item, so that the verification can be performed as soon as possible.
(I’m simplifying a bit, as in practice we’d ask for a call proof and not just a single item)
Verifying a block gives the diff in the state between this block and its parent. Any item that is not in the diff is therefore identical between the parent and its block.
This means that once a block has been verified, we can continue downloading the state but this time of the child rather than the parent.
Using this method, the full node is therefore able to reach the head of the chain much quicker, and in a reliable way.
It is not 100% robust, because you still need enough bandwidth to be able to keep up with the chain, but this is currently only around 3 to 5 kiB/seconds.
The drawback is that this is more complicated to implement, but at least I know that I won’t need to revisit it later.