Pieces of a decentralized data lake

Karim · August 10, 2023, 9:55am

It’s been 42 days since Decoded ended and we’ve made a long way since then. To this day, the Parity data team is working on follow ups with many of the teams we met then. We’re so excited and happy to see the growing momentum around data and infrastructure within the Polkadot ecosystem.

For those who couldn’t make it, we made a recap with all the videos, slides and topics we addressed during the conference. You can find those here: Parity Data - Polkadot Decoded 2023

One common thread we’re actively pursuing is how to make Polkadot data accessible to everyone who wants to access it in the most frictionless way possible. We call this “democratisation” of the data. It’s a very important topic that’s been on our roadmap since day 1, because beside data and numbers we believe that most of the value for our ecosystem is generated when people, including outsiders, interact with and understand our data.

As you might already know, we’re currently capturing, processing and storing most of the blocks produced in our ecosystem. We’re not the only ones doing so, similarly Subsquid, Substrate-ETL (Colorful Notion), SubQuery, and many other teams successfully pursue similar goals.

Today I’m writing this post not to look back, but to look ahead and share with you some exciting data work we’ve been doing that could drive the democratisation of the data to the next level.

Status quo: Accessing data

In a previous forum post, we shared about the architecture that we set up and how it helps us answer a specific set of questions. To simplify, it sort of looks like this if you were to access data from there:

setup1

The first obvious way is to use a service or some kind of database to access data, in this case it’s Google BigQuery. This architecture takes time and expertise to maintain and to be quite honest, it’s very difficult to open source as it quickly gets tied up with other things that might not be 100% related to just getting block data. Setting something like this up has a big barrier of entry since you’ll need to run RPCs (or connect to some) and do a lot of backend operations work.

Another challenge of this type of architecture is providing access to the data in a simple way. What simple means certainly varies whom you’ll ask. In our case, unless you make the tables and BigQuery directly available, there are no other ways. Substrate-ETL does this right and it works for many use cases but not all of them, like when you want to access data from your browser. You need an account and other things in place to be able to query something on BigQuery. In other cases, you’ll have to download CSV’s from websites, put them in some sort of Sheets and hope nothing goes wrong between three successive copy/pastes.

There are alternatives and I want to show you one of them now.

Query all Polkadot transfers from your browser

In the talk presented at Decoded, we shared how it is possible, right now, to query data directly in your browser from any object store, like for example, IPFS.

To decode what this means try this:

Open this link here (you’ll need the latest Chrome Browser):
Run the following command: SELECT block_number, timestamp, sender, receiver, amount FROM 'https://dashboards.data.paritytech.io/datasets/transfers/t1.parquet' LIMIT 10;

What happens behind the scenes, is that your browser is directly performing queries in a streaming fashion on top of a file that is stored in our Google Cloud Storage . This file was exported directly from our BigQuery without any processing.

It works the same way when storing the file in IPFS and using an API Gateway. If you’d like, give it a try, download that parquet file, upload it to Crust Network and run this in the same window:

SELECT block_number, timestamp, sender, receiver, amount FROM '{IPFS_GATEWAY_URL}' LIMIT 10;

A graphical representation of what is going on is this:

setup2

So what is going on here and why is it cool?

The data in the parquet file contains all the transfers made on the 6th August 2023 on the Polkadot relay chain in parquet format.
Right now the data is exported from our BigQuery, with a specific format and ruleset and pushed to Google Cloud Storage. If stored in IPFS, any other team in the ecosystem using the same ruleset on the same data will generate the same file (same CID), thus avoiding duplicate data. If the CID’s differ, we can be sure one of the datasets is incorrect and can investigate.

I think you can infer a little where I’m getting with this, but for now I’d like to say that with a proper methodology and process set up in place as part of the data bounty we’re drafting, we can ensure a very high degree of reliability and data correctness without the coordination overhead.

We can also open source most of the processing that happens after a block is captured.

What would this data capture look like? There would be two steps to this process, first, we capture the blocks from each chain and strip the “chain specific information” (validation hashes etc) that are usually not required for analytical queries and take up a lot of space. This way we can significantly reduce data space usage and keep it compact.

Once all the blocks for a day are captured, we store a parquet file according to specific rules (sorting, etc) to make sure that if we capture it the same way again, the file will end up having the same CID: write data in “curated” way.

Similarly, we can store files in plain cloud storage (if an entity can pay the bill). The graph below explains this a little bit.

On top of this, we can collaboratively build a marketplace of curated datasets. Blocks contain all the data, but not all of it is information relevant to a specific use case. This is why we can set up the following, where any team can contribute with definitions, code and even augmented datasets:

setup4

This is what’s done for the demo shared above.

The cool thing about this? It’s all enabled by WASM. This means that once the files and processed datasets are in file storage, any device that can run WASM can query these curated datasets:

setup5

Let’s have a look at two more demos.

Building the next generation of data tools

Now, we understand not everyone is a fan of SQL or would be bothered with writing queries and that’s okay. Most of the people interested in data are more interested in interactive tools on top of that data, some flashy graphs or interactive tools, the common denominator of that being a regular dashboard.

These data tools can build on the previous point to abstract away any sort of API or service required to run queries. By leveraging the data that is directly stored (either on IPFS or S3 compatible storage like R2, S3, GCS…) we can build a new kind of data tools.

To drive the point home, we prepared the following:

Live querying in an interactive application, in your browser

It leverages the previous “trick” to enable live querying of data and custom visualisations. Nothing is required to use this apart from a web browser, similar to how Polkadot light clients work. You’ll also notice the speed of this demo.

All of this of course assumes that somebody prepares a visualisation and shares it, meaning they know about the data. This also assumes somebody prepared and stored the data.

Note: it’s possible to encrypt the files at rest and decrypt them at query time, for sensitive stuff. AnyType uses this sort of technique if I’m not wrong.

Just for the fun of it

The most valuable activity when working with data is to distil it into information. For these cases we also prepared another fun demonstration.

In the following Notebook, you can query data from storage/IPFS and ask it questions in natural language Ask Substrate.ipynb using ChatGPT. What is happenning here is ChatGPT translates the written text into SQL queries, runs it against the file and returns the results. This is great for interactive exploration, for when you don’t know what you don’t know.

To get started I recommend making a copy of that collab notebook, add your OpenAI API key and query away.

This would also work for multiple files. While this method currently doesn’t support glob queries (“*.parquet”), we can combine and join as many files, on the fly, as we want so we’re not limited to just 1 parquet file per query. If you encounter any issues with that, you can reach me on Twitter/Telegram.

Why am I telling you this?

I mainly wanted to share a written follow-up of the presentation in Decoded under this “Tech Talk” format and am looking for feedback/questions/improvements on this. As you also might know, I’m passionate about data and when I saw that this worked, I had to share it.

This setup has many benefits for how we want to curate potential data bounties and how we coordinate to create this common good “data lake” maintained by all of us.

Decentralised ownership
- Multiple teams provide and store the data
Marketplace for data
- That is accessible by anyone, anywhere
- Anyone can create and add new datasets, even for other (block)chains
Self serve
- You don’t need an account to access it or create new datasets
Interoperability and standardisation
- Parquet format is portable and simple to work with
- CIDs ensure we have the proper datasets and simplify verification

A kind of practical and open source/open access “data mesh”.

One thing is missing though, some sort of lookup table/index to know which blocks should be referenced in the queries, but we could even use our own technologies to make an index like that available. Another important thing to note is that this would definitely require curation to keep a certain level of quality and order and avoid it turning into a “data swamp” as some like to say.

If anyone has an idea, I’d love to hear it! In some discussions I heard that a smart contract would be helpful for this or even a git repo, but I’m not completely qualified to evaluate that - a demo would help to validate assumptions.

The natural next question is how much this would cost to operate and maintain: To be fair, not much.

The other natural question: Will it scale? With the “lite” datasets, we should aim for a few gigabytes for a chain and try it out, I think it might just work.

Where do we go from here?

Here’s what we’ll explore within the Parity data team: Our data lake is churning reliably right now and we have a couple hundreds of metrics defined. We’ll schedule some exports for these parquet files and publish them first to GCS and then to Crust as a test run and will keep you posted when there’s something workable to try out. If you want to collaborate somehow on this idea, let us know, we’d love to make this happen.

All this magic above is enabled by WASM. I really believe that WASM will massively change the way we manage data and backend workloads.

Now I’m not saying that we have everything in place for a decentralised “Dune” type service but …

dzlzv · August 10, 2023, 4:43pm

Fantastic thread! I am giga bullish on the combo of light client + parquet + in-browser WASM DuckDb or similar.
As a shameless plug, a quick reminder that Squid SDK offers a rich toolkit for making parquets, so that you can do all the fun of data cleaning and transformation and forget about the burden of RPC access, splitting the data sets into manageable chunks, uploading to an s3 bucket etc.

sodazone · August 10, 2023, 6:26pm

Are you guys considering Arrow Flight? As a complementary technology to parquet storage.

https://arrow.apache.org/docs/format/Flight.html

tomaka · August 10, 2023, 7:18pm

Just want to point out that a light client verifies all the data and guarantees its accuracy, while the solution presented in this thread relies on blindly trusting the data providers. They’re really very different in this regard.

Karim · August 11, 2023, 7:49am

Thank you very much @dzlzv, means a lot coming from you! I believe our discussions and exchanges have had their fair share of inspiration for this thread.

I’m also closely following what Daria is developing and can’t wait to try it out myself to get a better understanding of the inner workings of the Squid SDK. I think at some point we can do an experiment, take all the transfers for a day, agree on how we want to store them and check if we get the same CID or not

@sodazone for the purposes of this experiment, we went with the defaults that DuckDB supports, in this case parquet, to enable what they call “http range queries” which prevents downloading the whole file when querying. This works very well with a columnar data format. To be honest, I don’t understand a 100% how it works yet since it feels like magic. From checking their repo, it seems DuckDB doesn’t support Arrow Flight yet so maybe it won’t fit this specific client-side querying use case?

Of course @tomaka, this solution is mainly focused on historical archival of blocks and for convenience/analytical purposes. For any of the new blocks, where we don’t need an archive node, we currently have a proof of concept leveraging substrate connect to alleviate the requirement of a data provider. It would essentially work the same for data collection. One can imagine that something like this could be used as an ongoing “test” to verify if the blocks stored are indeed correct. However, I’m not an expert or qualified in the inner workings of the light clients (yet ^^) so please correct me if I’m wrong.

dzlzv · August 11, 2023, 9:11am

I think attestation/verification of static files is tangential to the topic here – by far the biggest benefit of decentralization and client-based queries is near-infinite scalability at diminishing infra costs.

As for data verification, there are multiple approaches:

on-chain reputation-based attestations of the CIDs (very easy to implement)
ZK proofs like what Axiom is building
randomized query result verification against archive nodes
Running ETLs in SGX to prove the data transformation is sound

tomaka · August 11, 2023, 10:07am

And the details are left as an exercise for the reader

sodazone · August 11, 2023, 10:49am

To be honest, I don’t understand a 100% how it works yet since it feels like magic. From checking their repo, it seems DuckDB doesn’t support Arrow Flight yet 1 so maybe it won’t fit this specific client-side querying use case?

As far as we know Arrow Flight protocol is complementary to client-side querying and parquet storage. To our understanding it takes away the problem of knowing which data and metadata are available (using ListFlights and GetSchema respectively) and how to transfer it in an efficient and scalable way. It also supports write operations to ship data. We found this introductory article that explains it quite well: Apache Arrow Flight: A Primer | Voltron Data

A kind of practical and open source/open access “data mesh”.

Providing a distributed access layer as Flight Services could be worth exploring since it seems that it has the potential to solve data coordination and access problems.

This is just an idea that should be validated with a working prototype to know the real trade-offs, hidden intricacies and potential misconceptions.

Topic		Replies	Views
Select * from polkadot; Tech Talk	11	2802	November 6, 2025
Rethinking the road to decentralisation philosophy	12	1671	July 6, 2023
Dune Analytics Style Data Service for Polkadot / Kusama Tech Talk infrastructure	16	2045	December 1, 2023
Polkadot Data Independent Day! 01/11/2024 Ecosystem	1	310	January 12, 2024
Launch of the Polkadot Data Knowledge Base Ecosystem	0	337	August 6, 2023