Rethinking historical storage

I wrote a blog post with my thoughts how to improve historical storage. Cross posting here for visibility. Feel free to leave a message either here or on twitter.

The post:

A twitter thread:

3 Likes

substrate-etl is our pretty complete archive of Polkadot/Kusama in BigQuery built by us – there are around 10/70 chains that do not have reliable archive nodes exposed.

state_traceBlock is highly useful but especially expensive. Once you throw out the blockchain maintenance, it can be 10-20x cheaper but it still dwarfs everything else.

Most of the Ethereum ecosystem depends on trace data for deep chain analysis, and I think the world is sophisticated enough to use data warehouses to solve higher level aggregation problems (counting extrinsics/txs/events, summing up $ of some slice of time for some project) that is not a mere matter of lookup.

Currently onfinality and others operate public nodes for indexers for free, but have to rate limit, and do not expose state_traceBlock because its “unsafe” and seriously cannot be public. Most everything else is quite tractable.

A good Polkadot Data Alliance bounty can address the ecosystem needs. See this charter:

2 Likes

To solve this problem, the service will need to gain the capability of executing the state machine2 directly. The storage of the execution should be backed by the RDBMS.

As a heads up, the smoldot full node used to have a very naive database that would just store the state of the finalized block, plus for each non-finalized block on top of it the storage diff between the block and its parent.
Basically, a design a bit similar to what you suggest.
This was done with SQLite.

Unfortunately, because it’s not optimized for trie root calculations, executing a block (even a basic block with zero or one transaction) would take around 5 to 10 seconds because you have to read every single value in the entire storage in order to be able to compute the Merkle root of the new block.

While this code was very unoptimized, I would unfortunately be surprised is you managed to have reasonable calculation times.

EDIT: of course 5 to 10 seconds is only if you need to calculate the Merkle root, but what I’m saying is that it’s slow as hell in general

@tomaka Nice, this is super interesting.

Luckily in the archive use-case you actually don’t need to calculate the state root. The blockchain node (as used in the text) will take care of and you literally need only the state diffs from that node. For most use cases it should be fine, because state_call can be executed within the environment that already provides the state root through the header of a block in which it is executed.

But now I realize that it won’t work in every case. Specifically, in cases the state root calculation is invoked from any other call site than this.

Yeah, the host doesn’t have any guarantee that the state root calculation is done at the end of the block.


To go back to the article, the database code of smoldot is very separate from its syncing code, and they are bound together only at the top level.
See these 400 lines of code (unfortunately this is perhaps one of the most messy pieces of code of smoldot, because the full node is not my priority and the syncing code that lies underneath is still moving quite a bit, but it’s slowly getting stable).
As such, it should be relatively easy to index new blocks in a way that is optimized for archive node usage.

I don’t want to present smoldot as the miracle solution, but I’ve made sure to detach as much as possible what doesn’t need to be tied together, and to remove as many non-obvious assumptions as possible from the bottom layers of code, with precisely the objective of being able to do the kind of things that you mention in your article.

1 Like

I was under impression that smoldot would be able to help me!

My gut feeling is telling me right now, that a solution based on a set of facilities from substrate (an RPC that returns a state diff, etc) that are leveraged by a relatively small ingester/service binary would strike the most simple and efficient solution. That feeling is based on almost nothing, so I may and likely change my mind after (and if) I do the research.

Regarding the state_call. I feel that the aforementioned trick where the state root function called within a state_call returns the context block state root would cover 99% of the cases. After all, not every user would need state_call.

However, here is the solution for the users who need that and willing to pay the price for it. state_call will be instantiated with the state backend, that will in turn be based on a trie backend. The trie won’t have the trie nodes locally and instead it would need to retrieve them from another server that stores all preimages. That trie will be populated by the ingester each block. Thus the blockchain node (substrate) should be capable of returning the set of hash → preimages pairs acessed during block execution. That capability could be the same one that returns the state diffs.

The preimages could be stored in a persistent Redis instance or again the RDBMS. Both of options should be scalable. Also, it should be trivially shardable as well. The total latency of a state_call call would become worse, but not too much if set up properly: an async call to a remote KV db should have a very low latency.

Update: There is an advanced optimization that could smooth out the latency a bit. It’s based on the observation that to run the execution, the flat kv data is sufficient, and the trie nodes are only needed for the state root calculation. That means, the state accesses can be served by the normal flat KV db and the trie could be rebuilt in the background in parallel to the execution.

2 Likes

I just want to point out that it would be nice, eventually, to modify the runtime environment (in order to remove deprecated mechanisms). But if we allow state_call to work on any historical block, we can’t do that, and we will be in the same situation as Ethereum where there are thousands of lines of codes that are basically unused and almost unrefactorable.