Select * from polkadot;

Thank you all for the kind words as well as for the likes and support!

Allow me to clarify and hopefully satisfy your curiosity, it’s not a bother at all :smiley_cat:

The current data architecture presented here is not meant for building Dapps, it’s instead an analytical dataset to perform queries across different chains and serves as a baseline to calculate a whole lot of things. It’s also called a data warehouse: a repository of structured and semi-structured data that is used for reporting, analysis, and decision-making purposes, to us it’s a thin abstraction layer on top of the data generated by all the chains (the blocks + state changes) built using open source tools.

The target public for something like this are data analysts or people within the ecosystem who’d like to investigate what happened on a specific chain, on a specific date and compare that to what happened on another chain as well as build up their own definitions and aggregates for metrics (ie: How do we define an active account in the Polkadot ecosystem?)

Indexers generally provide API’s tailored for access in a browser or mobile apps, based off of indexes for various types of blockchain data, such as transactions, addresses, or smart contracts. They improve the speed and efficiency of searching and retrieving data by creating a data structure that maps the relationships between the data elements but for one chain at the time. These mappings/computations need to be known before indexing. This allows for quick and efficient searches, lookups, and retrieval of information based on pre-defined methods of computing these metrics. They provide REST API’s or GraphQL API’s for access in Dapps as you correctly point it out, that aren’t meant for data analytics but more for interactivity and real-time access. With a few opinionated endpoints, it’s possible to use an indexer and quickly scaffold the functionality that you would require for your app, like Subsquid and Subquery already show perfectly.

On the other hand, a data analyst that wants to build up a history of an event that wasn’t indexed, a new event that is not yet indexed or a different way of mapping/calculating things, needs to write whole lot of code and iterate through a lot of blocks, API’s, to manually parse events and extrinsics, figure out types etc adding more friction to the real goal which would be data analytics. This additional friction is not helpful if you account for the scale of the ecosystem. Queries like:

  • How many times did the account X call the pallet Y over the past month
  • What was the balance of account Z at date D
  • Etc…

… become complicated the more we’d like to investigate things that aren’t along the beaten path or pre-computed indices. It’s 2023 and it shouldn’t be this way, which is why Colorful Notion and us provide and support such an initiative.

We believe that there are different approaches to this particular problem than building and maintaining API’s or forcing data analysts to learn Javascript or make HTTP calls ^^. One of them, as illustrated here, is to provide access to an SQL like database of absolutely everything (or only analytically relevant fields) and collaboratively build up definitions and investigate what’s in the data. It won’t matter if there’s an update at a specific date, we get the block, store it efficiently and make it available for when we agree on how to parse & decode it. The benefit of having it in SQL is that it’s understandable, standardized, and sharing SQL scripts (like what Dune is doing) is a much more scalable approach than having each data analyst rely on a series of locally maintained scripts using a multitude of API’s that would be subject to change.

These solutions are complementary to each other but one point is clear: You can build indexers on top of solutions like these but it’s laborious to build an exhaustive solution like this on top of an indexer. As another forum post shows, computing the same thing using different indexers, can yield different numbers. No matter how fast the data is fetched; if it’s wrong or presents different interpretations, it isn’t helpful.

So in terms of speed of historical ingest: When building a database for archival and analysis, the historical ingest needs only to be built once. This means we won’t be rebuilding the dataset every day, which would require us to optimize for speed. Indexers, however, need this functionality since many features are client facing and would need a faster sync time for faster iterations and also accommodate for applications that wouldn’t require a full index of things but just a subset.

As far as speed goes, as I mentioned in the post, the current architecture is without any optimizations to the nodes themselves or any tweaking at all of concurrency values on the nodes. It is however good enough. But for the sake of the exercise, given that it is possible to load balance multiple Sidecar API’s (within cloud run revisions), which could be connected to multiple different nodes (each with a high concurrency value set), I am highly confident we could beat such benchmark since we’d be leveraging the scale of the cloud vs the speed of syncing on a single node. We’re only bound by I/O in our case since we write JSON. Writing things to a DB or in memory would make it even faster. (this would be similar to running multiple Firesquids in parallel to compare apples to apples, but then I’m certain we’ll hit limit imposed by physics & nature). :rocket::rocket:

To recap, we’re not competing with current solutions, but rather complementing the data offering in the Polkadot ecosystem to provide advantages to a different demographic.

So I assure you, @Whisker17 the team is doing quite a bit more here than just slowly compressing blocks. Joke aside, I hope this answers your questions and clarifies things a little bit. :wink:

Today we can easily answer much more things about the ecosystem that we couldn’t a few months ago - and we decided to share that insight, our know how as well as a pragmatic way forward.

6 Likes