Select * from polkadot;

Data, the lifeblood of any worthwhile endeavor, is the central piece in solving the great puzzles of our time. It is the heart of progress and the ultimate arbiter of trust - to us it is the key to answering the question of “Why should I build on Polkadot?”. Understanding the data is paramount to harnessing Polkadot’s full potential as well as making that potential evident to external observers.

Parity data team

Our newly formed data team within Parity embarked on a quest to build a data warehouse that is currently serving Parity in the mission of decoding Polkadot’s block data as well as Polkadot’s ecosystem data and translating both into valuable community insights. Our main focus so far was covering Parity’s internal data needs but we’ve decided to progressively open up and sharing about our initiatives and what we’ve learned.

In this technical post we’ll share about what this data platform is made of and the key metrics surrounding it. We’re sharing this as a step towards contributing to what we like to refer to as the “Polkadot Data Alliance”.

This is brought to you by @pavla, @olliecorbs, @wsteves, @hakmark, myself & team.

Why did we build this?

There are many reasons why we built this and why it’s useful, the first one being the lack of standardized metrics and definitions for Polkadot and their representation on data explorer tools like Dune or Nansen.

The frequent reasons why Polkadot wasn’t being represented is that it’s a complex ecosystem and the operations required to get all the data reliably is a very tedious task.

There was a very long lead time to answer basic questions on the ecosystem.

It doesn’t have to be like this anymore.

While there are already many indexers (Subscan, Subsquid, Subquery, Sub.id, …) there is a need for a sort of database for all of the ecosystem, where it’s possible to run queries on a single chain or multiple chains at the same time and where data analysts can just use lightly pre-processed data without spending time fetching the data themselves.

A sort of abstraction layer for the community to collaborate on definitions and insights (à la Dune) using modern data engineering practices to get things like:

  • Number of runtime upgrades across all chains
  • Average block time in the past week for chains A, B, C
    and more…

We’ve built our Platform on top of GCP (Google Cloud Platform) with the phenomenal support of our Infrastructure & DevOps team (thanks @erin & @pierre.besson).

The platform: Dotlake

In a nutshell, we’ve built a data platform to store all the blocks, events, extrinsic and more for all the chains in the Polkadot ecosystem. The platform is built in such a way that it is both scalable and cost efficient with the ambition of covering all the chains.

We’ve decided to leverage a lot of already existing technologies - on the shoulders of giants so to speak. The architecture is quite simple and focuses on low ops, meaning running as little services ourselves as possible:

  • Terraform: orchestrating cloud resources
  • Substrate Sidecar maintained by our Delivery Services team (thanks @gautam & @IkerParity & Tarik)
  • Rust & Python
  • GCP services
    • Cloud Storage (backup & persistence)
    • BigQuery for querying & analysis
    • Cloud Run & Jobs: Sidecar API service & cron job runners
    • Workflows & functions to tie it all together

How it looks like, on a high level for a single chain:

After this is setup, all what’s left to do is iterate from block number N to block number M and store the raw result, containing all the data we need as JSON in Google Cloud Storage. This provides us with a convenient abstraction as well as a documented data model that has been reviewed and implemented by our Delivery Services team. Besides, this way we don’t reinvent the wheel with a custom block parsing strategy that might be prone to errors (unknown unknowns). For now we only get blocks, but the same works for traces and other data points provided by Sidecar too.

This technical decision has been instrumental in getting us where we are today, at a point where we have a better grasp on chain data based on open source tools.

The same approach also works for an archive node in which case we loop bigger block ranges under sustained load for a short period of time. The biggest selling point of this solution is that by running the Substrate sidecar API in a service called Cloud Run, we can split the traffic towards multiple different archive nodes using revisions. This means, we can create N Cloud Run services for one chain, each connected to a different archive node and split traffic between the ingest service and Cloud Run, hence speeding up the ingest by N, regardless of any optimizations to ws-max-connections or runtime-cache-size without having to manage or maintain single servers. This is important for chains with millions of blocks and/or chains where archive nodes are not available in the same region.

Taking both these things into consideration, the full architecture can be set up in a way where it works like a clock, on a schedule and without manual intervention:

One important part of this graph is the component which we call Block Compressor. It’s essentially some piece of code which allows us to run some optimizations and reduce the amount of data that is useful for analytical purposes.

The block compressor compacts the JSON records returned by the Sidecar and drops the following fields by setting their data value to “SKIP”:

  • paraInclusion.CandidateIncluded
  • paraInclusion.CandidateBacked
  • parachainSystem.setValidationData

Doing so allows us to massively reduce the size of the blocks. This adds up the more blocks we include in the compression as they end up stored in 1 avro file. For instance, given 5 raw JSON blocks returned by the sidecar we reduce the size, from 635 KB to 14 KB which directly translates into cost savings in terms of BigQuery storage as well as very performant query times on complete historical blocks. In this “lite” format, the complete data for the Polkadot relay chain can be compressed down to 30 GB and made available as SQLite/DucksDB database for local analytics - for example.

Of course the above can be configured to run as docker-compose setup for a single chain and run locally. Given the same data model, we can collaborate on definitions, ie how to query the data and how to display it in interactive visualizations. Some simple examples we’ve developped to iterally and figuratively connect the “DOTs”:

Polkadot & Kusama Block time

Courtesy of @OliverTY

Governance

Auctions

Some example queries

Not only do our data analyst know how to drop bars and normalize data, they’re working very hard in decoding the data and providing a standardized set of definitions for the Polkadot ecosystem. Some examples:


# Daily Active Addresses for Polkadot Relay & Parachains

WITH
  successful_extrinsics AS (
  SELECT
    number AS block_number,
    TIMESTAMP_MILLIS(timestamp) AS block_timestamp,
    chain,
    authorId,
    relay_chain,
    ex.success,
    ex.signature.signer.id AS address
  FROM
    `dotlake_lite.blocks_polkadot_*`,UNNEST(extrinsics) ex,UNNEST(events) ev
  WHERE ex.signature.signer.id IS NOT NULL AND ex.success = TRUE ),

  daily_active_addresses AS (
  SELECT
    DATE(block_timestamp) AS date,
    chain,
    relay_chain,
    "Number of successful extrinsic signers" AS metric_value,
    COUNT (DISTINCT address) AS number_of_addresses,
  FROM successful_extrinsics s
  GROUP BY date, chain,relay_chain
  ORDER BY DATE) 
  /*---------------------------->*/
SELECT * FROM daily_active_addresses
# 2023 Polkadot Relay DOT Transfers by Originating Extrinsic

WITH balances_transfers AS (
  SELECT
    number AS block_number,
    TIMESTAMP_MILLIS(timestamp) AS block_timestamp,
    chain,
    authorId,
    relay_chain,
    ex.success,
    ex.hash AS ex_hash,
    ex.method.pallet AS extrinsics_pallet,
    ex.method.method AS extrinsics_method,
    ev.method.pallet AS events_pallet,
    ev.method.method AS events_method,
    STRING(PARSE_JSON(ev.data)[0]) AS sender,
    STRING(PARSE_JSON(ev.data)[1]) AS receiver

  FROM `dotlake_lite.blocks_polkadot_polkadot`,UNNEST(extrinsics) ex, UNNEST(events) ev
  WHERE
    ex.success = TRUE AND ev.method.method = "Transfer"
    AND ev.method.pallet = "balances" AND ex.signature.signer.id IS NOT NULL )
/*----------------------*/
  SELECT
    extrinsics_pallet,extrinsics_method,COUNT(*) AS number_of_transfer_events
  
  FROM balances_transfers
  WHERE DATE(block_timestamp) > "2022-12-31"
  GROUP BY extrinsics_pallet,extrinsics_method
  ORDER BY number_of_transfer_events DESC;

Resulting in:

As you see, leaving the data untouched and parsing it using SQL afterwards, provides a lot of benefits: less code and more flexibility. BigQuery in SQL supports something called UDF (user defined functions) which can augment the capabilities of the SQL queries. As a community, we can work on this layer together and provide the necessary definitions and build new tools such as views on top of the data (transfers, XCM, treasury activity, …).

As well as getting raw block data, we also have storage function results that we store using Cloud Run Jobs (not Sidecar). You can find an example of this on our prototype public data dashboards.

Anyone can run this, reproduce the data and build up a dataset that is similar. The current data platform, runs at a cost of roughly 150£/day for 30 fully ingested Polkadot, Kusama and solo chains.

How can you access data today?

Our goal is not to keep this to ourselves. With posts like these we hope to share and build with the ecosystem. One team in the ecosystem is building a similar data warehouse and is actively working into getting Polkadot data into a BigQuery dataset available to anyone. This team is Colorful Notion (Polkaholic), represented by @sourabhniyogi and @mkchungs.

You can access the data yourself, today, in BigQuery following this guide: substrate-etl. We’re working closely with Colorfoul Notion and Google to get this data into blockchain-etl, which would make this the reference Polkadot dataset available to anyone, just like Bitcoin and Ethereum data is available as a public dataset within BigQuery.

Our goal with this collaboration is to make sure we democratize Polkadot data and not lock it in.

This initiative is linked to a WIP bounty, if you’re interested in participating, reach out to @pavla or myself. We’ll followup to this post once we have more details. We’ll make sure both these datasets will be available to the community going forward.

We’re also very closely working with SubWallet @djhatrang, TokenTerminal, Polkalytics (@alice_und_bob) and many other teams to make sure the data is available and verified/validated as well as efforts not always restarted from scratch. We’ll share about this a separate forum post as I’m starting to think this post is getting very long.

Your input

We’d love to hear what you think and if you have ideas what we could add or try out with this data as well as a pragmatic way of open sourcing this work and working more with data. One concrete way we identified would also be to contribute to Sidecar and leverage that phenomenal resource for anyone that wants to get data without writing too much code.

If you’re interested in creating infographics, interactive data visualizations, getting samples of the data or have any idea how we could make this better, reach out or create a new thread and tag anyone from the data team.

We’re also planning quite a few talks at Polkadot Decoded, you can come join the data telegram group that @dataPhysicist from Apexti created over here: Telegram: Join Group Chat

Conclusion

Ensuring accurate and comprehensive representation of the Polkadot Ecosystem is crucial to its success and growth. Representation, however, isn’t an automatic process—it needs to be nudged and supported. Data is the indispensable tool to achieve this within the Polkadot ecosystem. By harnessing the power of data, we can gain insights into network patterns, detect inefficiencies, and unveil opportunities for improvement. A clear grasp of data is vital for the Polkadot ecosystem to thrive, adapt, and foster a more inclusive environment that truly reflects the ever-evolving landscape of the blockchain and cryptocurrency world.

Now, with our trusty data warehouse as well as Colourful Notion’s substrate-etl dataset in place, we’ve given Polkadot an invaluable resource to understand what is going on. Our team’s efforts are paying off, and as the heart of science beats within our newfound data hub, we’re better equipped to navigate the intricate labyrinth of the Polkadot ecosystem.

With data as our guiding light, we’ll continue exploring the ever-evolving landscape of web3, one byte and one block at a time. :hugs:

29 Likes

This is great and exactly what I wanted.

5 Likes

fantastic work! I can’t wait to dig into it!

5 Likes

Great job, team! Of course we need such a data index infrastructure to help us build a Dapp like Nansen or Dune. But I’m confused about your innovation and the comparison with other index tools like Subsquid.

It seems that you have optimized and compressed the storage when traversing the blocks.

From what I know from Subsquid, they can also support all substrate chains index. And they can index more than 50k blocks/s, seems like a great speed enough, just wondering what are your advantages compared to them.

Hope it won’t bother you too much, just curious about that.

Thank you all for the kind words as well as for the likes and support!

Allow me to clarify and hopefully satisfy your curiosity, it’s not a bother at all :smiley_cat:

The current data architecture presented here is not meant for building Dapps, it’s instead an analytical dataset to perform queries across different chains and serves as a baseline to calculate a whole lot of things. It’s also called a data warehouse: a repository of structured and semi-structured data that is used for reporting, analysis, and decision-making purposes, to us it’s a thin abstraction layer on top of the data generated by all the chains (the blocks + state changes) built using open source tools.

The target public for something like this are data analysts or people within the ecosystem who’d like to investigate what happened on a specific chain, on a specific date and compare that to what happened on another chain as well as build up their own definitions and aggregates for metrics (ie: How do we define an active account in the Polkadot ecosystem?)

Indexers generally provide API’s tailored for access in a browser or mobile apps, based off of indexes for various types of blockchain data, such as transactions, addresses, or smart contracts. They improve the speed and efficiency of searching and retrieving data by creating a data structure that maps the relationships between the data elements but for one chain at the time. These mappings/computations need to be known before indexing. This allows for quick and efficient searches, lookups, and retrieval of information based on pre-defined methods of computing these metrics. They provide REST API’s or GraphQL API’s for access in Dapps as you correctly point it out, that aren’t meant for data analytics but more for interactivity and real-time access. With a few opinionated endpoints, it’s possible to use an indexer and quickly scaffold the functionality that you would require for your app, like Subsquid and Subquery already show perfectly.

On the other hand, a data analyst that wants to build up a history of an event that wasn’t indexed, a new event that is not yet indexed or a different way of mapping/calculating things, needs to write whole lot of code and iterate through a lot of blocks, API’s, to manually parse events and extrinsics, figure out types etc adding more friction to the real goal which would be data analytics. This additional friction is not helpful if you account for the scale of the ecosystem. Queries like:

  • How many times did the account X call the pallet Y over the past month
  • What was the balance of account Z at date D
  • Etc…

… become complicated the more we’d like to investigate things that aren’t along the beaten path or pre-computed indices. It’s 2023 and it shouldn’t be this way, which is why Colorful Notion and us provide and support such an initiative.

We believe that there are different approaches to this particular problem than building and maintaining API’s or forcing data analysts to learn Javascript or make HTTP calls ^^. One of them, as illustrated here, is to provide access to an SQL like database of absolutely everything (or only analytically relevant fields) and collaboratively build up definitions and investigate what’s in the data. It won’t matter if there’s an update at a specific date, we get the block, store it efficiently and make it available for when we agree on how to parse & decode it. The benefit of having it in SQL is that it’s understandable, standardized, and sharing SQL scripts (like what Dune is doing) is a much more scalable approach than having each data analyst rely on a series of locally maintained scripts using a multitude of API’s that would be subject to change.

These solutions are complementary to each other but one point is clear: You can build indexers on top of solutions like these but it’s laborious to build an exhaustive solution like this on top of an indexer. As another forum post shows, computing the same thing using different indexers, can yield different numbers. No matter how fast the data is fetched; if it’s wrong or presents different interpretations, it isn’t helpful.

So in terms of speed of historical ingest: When building a database for archival and analysis, the historical ingest needs only to be built once. This means we won’t be rebuilding the dataset every day, which would require us to optimize for speed. Indexers, however, need this functionality since many features are client facing and would need a faster sync time for faster iterations and also accommodate for applications that wouldn’t require a full index of things but just a subset.

As far as speed goes, as I mentioned in the post, the current architecture is without any optimizations to the nodes themselves or any tweaking at all of concurrency values on the nodes. It is however good enough. But for the sake of the exercise, given that it is possible to load balance multiple Sidecar API’s (within cloud run revisions), which could be connected to multiple different nodes (each with a high concurrency value set), I am highly confident we could beat such benchmark since we’d be leveraging the scale of the cloud vs the speed of syncing on a single node. We’re only bound by I/O in our case since we write JSON. Writing things to a DB or in memory would make it even faster. (this would be similar to running multiple Firesquids in parallel to compare apples to apples, but then I’m certain we’ll hit limit imposed by physics & nature). :rocket::rocket:

To recap, we’re not competing with current solutions, but rather complementing the data offering in the Polkadot ecosystem to provide advantages to a different demographic.

So I assure you, @Whisker17 the team is doing quite a bit more here than just slowly compressing blocks. Joke aside, I hope this answers your questions and clarifies things a little bit. :wink:

Today we can easily answer much more things about the ecosystem that we couldn’t a few months ago - and we decided to share that insight, our know how as well as a pragmatic way forward.

6 Likes

Love the initiative here, and really looking forward to more and more valuable data being surfaced to the ecosystem.

I just wanted to call out that there are likely other avenues where the data team can contribute and make large impact. Note specifically the article released today from Messari, which compared the decentralization of various proof of stake protocols:

As you can see, Polkadot was not included. Someone in the comments queried about this, and the response was simply that a data pipeline needed to be contributed to the repo used to scrape this data.

I guess my high level comment is, the initiative is great, and I really look forward to seeing all that comes from it, but I can speak from first hand experience, sometimes we are too deep into our own thing, that we forget where impact can actually be generated. My guess is that creating our own dashboards and pipelines may at times be less important than giving high quality data to people who already have a large audience.

Some other examples of places where data team could contribute to existing platforms:

6 Likes

Thanks @Karim amazing work!

And congrats to the team @pavla, @olliecorbs, @wsteves, @hakmark !

Is Dotlake Opensource?

Is there a git repo?

2 Likes

Thank you for the words of support @shawntabrizi ! I must have missed the notification but allow me to respond now.

A lot has changed since the time I posted this and things are shaping up great this year to deliver on our goal of democratizing data and ensuring appropriate, correct and exhaustive coverage of the Polkadot ecosystem.

Here are just a few things we’ve already done:

So all in all, the Parity Data Team isn’t just a data collection team anymore, but an active driver of Polkadot’s representation accross multiple dimensions. Our focus is on tech, but it’s beautiful to see that this can have incidentally extremely positive effects on the ecosystem. We deliver and will keep delivering.

In fact, we’ve been compiling a list of all the posts referencing, linking at or mentioning the results of the efforts outlined here and the reception has been tremendous in terms of reach and impact, so sharing more of these insights is the right call. It is amongst the most popular topics people, within and outside of the ecosystem, enjoy learning about, like this poll by Cris Pap shows.

The team hasn’t been only busy collecting a 360 overview dataset of the Polkadot ecosystem but also report about it and share technical tidbits and POCs of possibilities like outlined in this forum post Pieces of a decentralized data lake .

We’ll follow up on this and more. The team is now strengthened by working side by side with the Infrastructure department within Parity. If you have more ideas or things we can tackle, let me know and we’ll add that to our roadmap, the roadmap which we’ll look forward to share and talk about more in this forum.

We have quite a few surprises in store for this year and we’re truly excited to share and advertise much more. I’m personally a strong believer that Parity (Infra & Data, …) should talk more about what it’s doing and what’s being developped. Happy also to engage in discussions around the tech part of things, anytime.

@rvalle thank you very much for your kind words, I’ll relay it to the team right away. Our plans will certainly include releasing the DotLake (code, data), as many teams have reached out already to be able to build on and reference a reliable resource for their projects. DotLake being initially designed to be an internal Parity dataset, we’re taking steps towards changing that. But happy to make a demo and run some queries for you in the meanwhile, we should be able to answer most of the questions - and if not, we’ll fix that :wink:

1 Like

Thanks @Karim ,

My motivation is actually aligning our initiative. We run Polkawatch, funded by the community, so there is no point in doubling effort when creating our roadmap. When possible we reuse other community effort and also contribute to other projects.

Our project is very different in nature, with datasets living in IPFS instead of Google, but we are considering adopting the sidecar for indexing, and wanted to check that part.

I obviously support your call for openness, code and roadmap, in DotLake.

Thanks for your offer to answers questions, will reach out.

1 Like

That’s amazing, the website looks very well designed.

Storing datasets in IPFS is a great idea, I believe moving in that direction is a smart choice and a way to ensure at the same time correctness and avoid gatekeeping. In my Decoded 2023 talk I presented just that, how to query datasets directly on IPFS and without any centralized entity in between from the comfort of your browser, leveraging DuckDB & WASM (however it was very late in the day and there were few people present :stuck_out_tongue_winking_eye: )

Sure thing, looking forward to chat.

For info, we are already working on sharing DotLake data with a few other teams, I’m sure we can figure something out to help your roadmap. Our mission is to find a robust and sustainable way to make DotLake one of the data sources you can confidently build or base design choices upon and not just a place to get data.

2 Likes