The path towards decoding historic blocks/storage in Rust

At present, the PolkadotJS API is the only means (that I’m aware of) of decoding historic blocks on Polkadot chains (aside from desub, which I’ll get to later).

We’d like to add the ability to decode historic blocks in Rust for a couple of reasons:

  1. To open the door for Rust tooling to be built around this ability instead of restricing developers to using TypeScript.
  2. A rust implementation also opens the door for other languages to leverage the same code (ie TypeScript via WASM compilation, or many other langs via C bindings).
  3. To provide an alternative to PolkadotJS, which is no longer being actively developed by its primary author (although we are aiming to continue maintaining it until suitable alternatives exist).

So, I’d like to share with you all my plan, and the progress so far, towards being able to decode historic blocks using Rust.

(Side note: this post is an elaboration of, and update on, the issue originally posted here)

Introduction

I’ll start by summarizing the overall problem that we face.

Today, if you ask for the metadata from a chain, you’ll probably get back version 14 or version 15 metadata. These versions both contain a scale_info::PortableRegistry, which itself contains all of the type information needed to construct and encode (or decode) valid extrinsics, storage keys and values, and more. Types in this PortableRegistry each have a u32 identifier, and this is used to point to them elsewhere in the metadata when we are describing what calls or storage entries are available.

Libraries like Subxt (Rust) and Polkadot API work by downloading the metadata from a chain and using it to understand how to SCALE encode and decode values based on this type information so that they can build and submit valid extrinsics and such.

If you go back a few years (ie to when Polkadot ran runtimes that contained V13 or below metadata), this type information (ie the scale_info::PortableRegistry) did not exist at all. Instead, all that we had in the metadata were the names of the various types that were used in things. There was no information about what those names meant, or how to encode/decode types with certain names to the right shape. So how did we know how to encode/decode anything?

PolkadotJS was created in 2017 as a client which was capable of interacting with Polkadot (and later, its parachains). It required type information to know how to encode/decode things, but none was available, so it had to construct its own. It built a mapping from the name of a type to some description of how to encode/decode it (nowadays this is mostly here). Since the shapes of many types evolved over time, PolkadotJS would add overrides to its type information that would take effect in certain spec versions and on certain chains in order to continue to be able to understand their shapes. Thus, if PolkadotJS knew which chain and spec version you were targeting, it would be able to look up how to decode information for it.

Newer libraries like Subxt and Polkadot-API were able to leverage the type information in modern metadata and so have never evolved this ability, meaning that PolkadotJS remains the only way to decode historic information today. This is now changing, as we have recently started work on building the relevant features to be able to decode historic data in Rust.

Decoding historic data in Rust

First, I’ll start with what we had in place until recently in Rust. Then I’ll summarize our overall plan for adding the ability to decode old data in Rust. Finally I’ll explain each step in more detail, as well as where we’re at today.

What we had until recently in Rust

This diagram gives a rough idea of the main Rust libraries that we had until recently that are relevant here. Arrows are “depends on” and show the rough hierarchy of them (various dependencies are not represented here).

rust-decoding-before drawio

Let’s summarize each of these, starting from the bottom (follow the links to read more about each one):

  • parity-scale-codec provides the basic SCALE encoding and decoding implementation. This library does not care about any type information, and simply encodes and decodes Rust types according to their static shape. Its main exports are the traits Encode and Decode. Simply put:
    • Encode has a function fn(&self) -> Vec<u8> to SCALE encode self to bytes.
    • Decode has a function fn(bytes: Vec<u8>) -> Self to SCALE decode bytes into Self.
  • scale-info provides a structure (PortableRegistry) which contains the type information needed to know how to SCALE encode and decode types. Types can be obtained from this structure if you know their type ID (a u32). This is present in V14 and V15 metadata.
  • frame-metadata defines the format that metadata will take. One can SCALE encode or SCALE decode metadata into this format. The format has changed over time, and so metadatas are all wrapped in an enum to which a new variant is added each time we produce a new metadata version. Newer versions of the metadata (V14 and V15) contain a PortableRegistry and point to types in it when describing things like the available extrinsics.
  • scale-encode and scale-decode primarily export EncodeAsType and DecodeAsType traits, and implement them for common Rust types. These both build on parity-scale-codec. Simply put:
    • EncodeAsType has a function fn(&self, type_id: u32, types: PortableRegistry) -> Vec<u8>. In other words, this encodes values based on the type information provided (and not based only on the shape of &self).
    • DecodeAsType has a function fn(bytes: Vec<u8>, type_id: u32, types: PortableRegistry) -> Self. In other words, this decodes some SCALE bytes into some type based on the type information provided (and not based only on the shape of Self).
  • scale-value primarily exports a Value type. This type is analogous to serde_json::Value and represents any valid SCALE encodable/decodable type. This Value type has a string and serde representation, and also implements EncodeAsType and DecodeAsType. Any SCALE bytes can be decoded into a Value.
  • subxt is a client library for interacting with chains in Rust, and doing things like submitting transactions or looking up storage values. It relies on all of the above to be able to intelligently encode and decode values to construct transactions and such.

We can see here that scale-info is pretty integral in that it provides the type information that all of the higher level libraries use to SCALE encode and decode bytes. This is a problem if we want to be able to re-use these libraries to also be able to encode and decode historic data that doesn’t have any associated scale-info type information, though.

What we’re now working towards

In summary, we’d like to work towards modifying our relevant libraries to look more like this:

rust-decoding-after drawio(1)

Green boxes represent new crates, and yellow boxes represent areas of significant change.

Let’s dig into this more:

Step 1: scale-type-resolver

In order to be able to re-use scale-encode, scale-decode and scale-value to decode historic data, the approach that we are taking is to remove scale-info from their dependency trees and replace concrete uses of it with a generic trait that can be implemented for anything that is capable of providing the required type information (including scale_info::PortableRegistry).

So, the first step is to create such a trait, which we’ve called TypeResolver and have recently implemented in the new scale-type-resolver crate. This crate is no-std, and the trait exposes an interface that can be implemented on scale-info::PortableRegistry with zero additional overhead (in theory at least). In order to be zero cost, the trait works by being given a visitor which implements ResolvedTypeVisitor; the relevant method on this visitor is called depending on the shape of the type being resolved.

Step 2: Make use of scale-type-resolver throughout the stack.

The next step is to make use of this new trait in scale-encode, scale-decode and scale-visitor instead of having any explicit dependency on scale-info. This had the effect of generalizing all of these crates so that they can be used to work with historic types as well as modern ones.

scale-encode version 0.6 and scale-decode version 0.11.1 have already been updated to depend on scale-type-resolver instead of scale-info. We’re now working on porting scale-value and subxt to using the latest versions of these libraries.

Step 3: scale-info-legacy

By this point, our main Rust libraries can all, in theory, decode historic types. But we only have a way to describe modern types via scale-info! So, in the same way that scale-info describes modern types, scale-info-legacy will provide the means to describe historic types. Some notes about this:

  • Historic types are referenced by something like a name rather than a numeric ID: in older metadata versions, we only have type names to go by. So we’ll want to be able to build type mappings that can be handed a type name and resolve it into a description of the type (that’s compatible with TypeResolver).
  • Historic type information doesn’t exist in metadata, so we should also strive to provide a bunch of default type information that is aware of changes across spec versions. These can provide a starting point for chains to then extend with type information for any custom types that they used. We can obtain a bunch of this from PolkadotJS to get us started.
  • It should be really easy for users to provide their own type information on top of (or instead of) the defaults.
  • We need great error messages in the event that type information couldn’t be found, to make it as easy as possible for users to add missing types as they are encountered, until they have provided all of the necessary type information. It’s expected that this will happen a lot to begin with.

My plan is to start work on this crate in the next week or two. I am aiming for it to be ready some time during Q2 2024, although there may be a long tail of work involving building up a test suite for decoding historic types and adding missing types to the defaults that we’ll provide for Polkadot/Substrate.

Step 4: desub

A desub crate (well, set of crates) already exists, and was used as part of substrate-archive to decode historic blocks into JSON for storing in a database. It’s marked as green on the diagram because the plan is to effectively replace it with something that can leverage the scale-* crates we’ve developed in order to provide a more generally applicable and better integrated decoding experience (although we’ll adapt and make use of various bits and pieces that it offers).

The goals of this crate will be:

  • To provide generic interfaces for decoding extrinsics and storage values given some bytes and type information in a way that builds on top of the lower level “decoding types” functionality now available to us. Subxt may eventually be able to re-use this logic rather than having its own storage/extrinsic decoding logic, so that’s something we’ll keep in mind here.
  • To put on top of this a simple, high level interface for decoding arbitrary block/storage info from bytes given the relevant metadata and type information.
  • I have a suspicion that we will also provide a simple RPC layer to connect directly to archive nodes and pull the relevant information, rather than requiring the user to obtain the bytes themselves first.
  • To contain any CLI tooling that might be useful in helping users to construct the correct type information (for example, perhaps we’ll add a scanner to find out which blocks contain a spec version change; something that PolkadotJS historically kept track of internally).

There’s still some uncertainty around exactly what the interface will look like here; we’ll probably need to try some things to see what works.

By the end of Q2 2024 I expect we’ll have made some decent progress on this, with an initial release expected in Q3-Q4. As with scale-info-legacy, I expect there to be a long tail of testing to discover decode issues in historic blocks and storage data.

Step 5: scale-info -> scale-type-resolver

scale-type-resolver currently exposes the TypeResolver trait, and also contains an implementation of TypeResolver for scale-info::PortableRegistry (behind a feature flag). Thus, exactly one scale-info version will implement TypeResolver at any one time (the version that scale-type-resolver is pulling in). if scale-info has a major update, then we need to update scale-type-resolver to point to it, leading to the entire hierarchy of crates depending on scale-type-resolver to need updating too.

So, a small thing I’d like to do once the dust has settled is to instead have scale-info depend on scale-type-resolver and then implement the TypeResolver trait itself. This means that multiple versions of scale-info can implement the TypeResolver trait, and our core libraries (scale-encode, scale-decode and scale-value primarily) are no longer impacted at all by scale-info updates.

This should be left until everything is working well and we’ve found no obvious reason to update scale-type-resolver.

Future

With all of this in place, there may be some desire to update subxt to be more generic over how it handles historic types too, so that it can take on the task of fetching historic data as well as modern data, and is able to decode everything nicely. An advantage of subxt doing it all is that we avoid duplicating some of the logic around making RPC calls to nodes and decoding extrinsics/storage bits.

For now though, I think that it’s better to focus subxt on working at the head of some chain, and to keep functions for accessing historic data separate. Let’s see how things shape up in the next year or two!

Alternatives

I considered a couple of alternate approaches prior to this:

  1. Reusing scale_info::PortableRegistry as a means to store legacy type information, rather than being generic over it. The reason that I did not pursue this line was that being generic over type information gives us more overall flexibility, making it more likely that we can create legacy type information that is efficient to query, and generally making it less likely that we run into any major road blocks (ie PortableRegistry not being able to handle generic “type names” like Vec<u8> in a way we’d like).
  2. Being more generic! We’ve taken the approach to be generic over the structure that can resolve type IDs into the corresponding type information, but we could have gone further and decided to be generic over the entire process of decoding types (ie having a TypeDecoder trait that takes in any ID and returns some decoded thing). This was the original plan, as it allows complete flexibility over how we handle historic type decoding, but I abandoned this line when I realized that it would lead to us duplicating a bunch of type encoding/decoding logic, and prevent us from using libraries like scale-decode in the way that I’d like.

Summary

  • A scale-type-resolver crate has been added to be generic over how we obtain type information.
  • scale-encode and scale-decode now use this instead of directly depending on scale-info.
  • scale-value and subxt are heading this way too (well, subxt will still depend on scale-info, but it’ll use up-to-date versions of things). Expected in a couple of weeks.
  • We’ll build a scale-info-legacy crate for providing historic type information. Expected some time in Q2 2024.
  • We’ll build a new desub crate to contain all of the high level interfaces we’ll want for fetching and decoding historic data. Expected by Q3-Q4 2024.

If you’ve read this far, then well done! I’m open to any questions or thoughts on this.

12 Likes

Great to hear, I used desub when building uptest(Libuptest Decoding Extrinsics - Uptest substrate documentation), because there is so little options to choice from. I bet a lot of people want a light weight(/not drown in dependencies) alternative. ~flipchan

I didn’t know about desub when developing sub-script which supports v12, v13, v14 metadata. For v12 metadata support we used schema files like what is used for Polkadot.js support.

Also for polymesh-api-client we support dumping historical blocks (v12 - v14 metadata) to json (see the dump_blocks example).

1 Like

Nice initiative. Back in the days we built the substrate graph prototype around substrate-archive which helped us understand, decode and index historic and current extrinsics. If I understand correctly, like this we do not need to index the whole chains extrinsics but would be able to directly access extrinsics of a certain block, no matter if the types changed or not?

Apologies if this is self evident - I haven’t yet cut my teeth on this part of the stack…

One question is about relay-chains that are more recent and don’t have to deal with this legacy issue.

Is there a feature flag to skip compiling this legacy logic or is this moot by design and there is no compile/run time cost in these cases.

Is there a feature flag to skip compiling this legacy logic or is this moot by design and there is no compile/run time cost in these cases.

The general approach is to rely on this TypeResolver trait defined in a very tiny scale-type-resolver library throughout the stack.

What this means is that, if you want to work with historic data, then you can pull in a (not yet created) crate that can give type information about these historic types, but for working with modern runtimes, you just don’t need to bring in those dependencies at all (so eg subxt won’t depend on any code for decoding old types, since it’s only concerned with interacting with the head of a chain right now).

I’d like to avoid needing feature flags where possible, but if we pull old and new decoding together into a single high level create with a nicer interface then we’d end definitely up with feature flags to opt out of the legacy (or modern, perhaps) type decoding if you don’t need it.

If I understand correctly, like this we do not need to index the whole chains extrinsics but would be able to directly access extrinsics of a certain block, no matter if the types changed or not?

The code being built here will basically allow somebody to take the bytes from an old block or storage entry and be able to decode them into some meaningful values. We might provide a high level interface that one can use to connect to an archive node and download/decode bytes from a block or whatever, but we’ll also provide all of the code beneath that so that you can obtain the bytes however you like and then be able to decode them. I hope that answers your question? :slight_smile:

1 Like

Just to clarify, the quoted text wasn’t from me - you may want to edit so the correspondent is notified.