DotSentry: Ecosystem-wide Monitoring Solution

With a hint of realtime & open source data access for the ecosystem

Introduction

Many of the last incidents that happened within the Polkadot ecosystem were reported by ecosystem members & users, in the form of “Hey, there seems to be an issue”. Some examples are:

This certainly isn’t the kind of experience we’re envisioning for Polkadot, which would only worsen if we assume a growing number of even faster upgradeable chains.

Hence the need for increased observability & monitoring to directly address the point mentioned by Alexandru in the post linked above.

Note: I didn’t think this post would get so long, I might present the solution at Decoded (if my talk gets accepted :crossed_fingers: :disguised_face: :crossed_fingers: )


Within the Parity Infra/Data department we’re working alongside engineers to ensure that all the tools and data are available to deal with situations like these, but the current tools have a few limitations.

DotLake, while providing 360° insights on the ecosystem, contains data that is delayed by a few minutes/hours (and require SQL skills) and our Prometheus/Grafana stack mainly provides information about node health and other metrics that are scraped at regular intervals ranging from tens of seconds to minutes: node data is also not representative of the state of the network but just the state of the nodes. This makes detecting and reacting to issues a challenge as there is a gap between what we can detect and how fast we should detect it. The complexity of this challenge will only increase the more people adopt and build on Polkadot.

When troubleshooting issues, it is also necessary to rely on public block explorers which aren’t built for debugging in the first place or which don’t have the current chain we want to investigate at hand. All this introduces friction in the process that translates directly into additional cognitive load and more time to resolve problems that inevitably happen. As our initiative around Stabilizing Polkadot gets more traction and support, we’re making a first proposal to address this downstream problem, with downstream solutions (aka where the code is produced) following in the next months. Among these downstream solutions already gaining traction is the new ecosystem test environment bounty, and certainly more will follow, our overarching goals being to prevent problems from happening, detect & mitigate when they inevitably happen, fix the problems and track the efficiency of the fixes over time.

The solution we’re proposing today isn’t a “DotLake but made realtime” nor a “Prometheus/Grafana stack made block aware” but something more: it is a purpose built system, based on open source tools, to observe and detect in real time, as well as react to, anything that happens within the Polkadot ecosystem.

A few key design constraints for the system that we added in, after receiving some feedback from the ecosystem are that we made sure it can be operated in a cost effective manner (i.e. without breaking the bank or the treasury), with low maintenance and simplicity and without relying on any complex deployments (Kubernetes I’m looking at you). The whole stack is a docker compose up away from running.

We’re sharing this write-up for a few reasons:

  • The first is to underline our commitment to excellence in engineering and make Polkadot a rock-solid foundation you can build on
  • The second is to open up the initiative for people who’d like to provide their opinions/pain-points and feedback for what they would like to see
  • We need your feedback to make this better, monitoring is good but it’s a technical solution, improving things takes time and effort on top of that

The architecture we ended up with has a few additional advantages that we’ll elaborate on later. Let’s have a look at the solution.

How to get real-time data

In order to understand the operational and transactional integrity of blockchain systems, it’s crucial to monitor both the nodes and the content of the blocks themselves. So the first thing we need to figure out is how to get real-time data for both the chains and also the nodes.

For this article, we’ll focus on the chain data and are working with ecosystem participants to augment our setup with P2P data and more (the latter is less of a challenge).

A typical Polkadot node exposes RPC endpoints to which it’s possible to connect directly using websockets, without requiring any libraries. While it’s true that maintaining / persisting long-lived websocket connections is a pain, this is better done on the backend especially if we have to connect to multiple chains at the same time. These connections need only listen to the head of the chain and can be set up to connect to multiple nodes, reducing the risk of missing out on data. Anytime a new block, whether finalized or unfinalized is formed, it will be captured and sent downstream for processing.

This, while not really being anything novel, clarifies how to get notified in real-time when new blocks are produced on any chain within the ecosystem, without requiring any libraries at all. This mechanism will serve as a “clock” that we’ll compare with the real world time for our monitoring system. We’re calling this monitoring system DotSentry.

DotSentry is a “sentry” for DOT, a guard for the Polkadot ecosystem. It leverages the following key technologies that allow massive scale and most importantly flexibility at an unbeatable price tag:

  • NATS: broker, open source
  • ClickHouse: extremely performant database, integrates natively with NATS, open source.
  • Substrate Sidecar: a stellar utility managed by Parity to help decode blocks from the chains, the basis for DotLake, also open source

The system is currently running for a couple of weeks and a few demos have been shared here in there, but I’d like to specifically thank @kianenigma and @OliverTY for their feedback. This write-up consolidates the learnings and offers a plan for where we want to take the idea and elaborates on a few opportunities.

In the end, we want to propose 2 things:

  • An open source code base to run this locally for yourself
  • A hosted option with all the bells and whistles that we can make available at dotsentry.xyz

Let’s get deeper into the weeds of the system.

Working with real-time data

The system is architected around a central broker called NATS. NATS’ sole responsibility is to receive data and forward it in real-time to other systems. Its aim isn’t to persist data. Without such a system, it would be necessary to have different services communicate directly with each other, which adds a lot of coupling and friction the more the system grows. This comes at a certain cost we’ll elaborate on in a bit, but for now this is how it looks:

The block listener service has also one single job and it needs to be superb at it. Anytime a new block is produced by the chain, it notifies NATS by sending a message containing the number of the block and some light metadata. In other terms, the block listener is saying essentially “Chain X has produced block number N” and that’s it. There can be multiple block listeners for the same chain, NATS can be configured to deal with this pretty well, but since it’s more important to capture produced blocks, it is perfectly fine to have a “more than once” notification for the same block. This ensures that we’re monitoring the chain instead of monitoring the RPC nodes (or a light client).

This block listener is the example entrypoint for DotSentry.

If we abstract things a bit, other configurations will look like this:

We could have services both listening to newly produced messages on NATS (in this case, block data) and writing back new data, and we can also have services that only read and ones that only produce data.

An example of a service that only produces would be a storage function, that is run at specific intervals and pushes the data to the broker. An example of a service that does both, would be a storage function call that is run whenever a specific on-chain event (or group of events) are detected. One can also create a service that calls a set of storage functions with precise parameters, and broadcasts the results of the operation to understand whether there was a regression or a specific update to an API. Similarly, if you’re not a fan of RPCs, light clients would work the same as the input for the data.

Another example that is more pertinent here would be the block fetcher service:

This service complements the block listener by leveraging an external Substrate Sidecar to fetch the full data for a block. This will be in essence some JSON data containing all the data for this specific block. This happens on auto-pilot whenever a new block is produced by any chain. This is quite neat because the block fetcher service is critical, and can be triggered independently from the block listener, in case of failures or in the event of a necessary backfill.

Two more examples (to really get the point accross):

NATS can be natively connected to ClickHouse, meaning that any new JSON block data that is incoming (or any other data stream for that matter), can be persisted into one or multiple tables, thus providing capabilities to do real-time SQL querying on block data for free. This will allow us to leverage all the work we did in DotLake and reuse the parsing logic for this new use case.

Moreover, NATS also exposes a Websocket endpoint, anything that we could do previously for backend services can be natively made available to any browser based client, in real-time. We can construct a real-time overview of the whole ecosystem using this same architecture.

This was all to drive one point home: working with real-time data is a bit more tricky and requires quite a few failsafes, because problems will inevitably happen. Having decoupled services allows us to remediate these problems in a nice way, as well as coordinate efforts across multiple parties, plus we get some neat added benefits. All we need is this shared data stream provided by NATS and then tests, checks and more can be added on top without any constraints on the programming language.

Incidentally, this sort of architecture will have the additional positive effect of providing a free and open source real-time data feed for the ecosystem to build on top of, and serve as an addition (or evolution) of the DotLake infrastructure. This being provided in two parts: one open source and one freely available hosted service, we can strive to avoid any centralization or single point of failure at the same time as providing a solid foundation for ecosystem teams to build on. In fact, setting up an ephemeral block explorer (keeping just a few blocks) on top of this will prove handy for teams who are just starting up, partly addressing Luca’s points mentioned in this post: RFC: Should we launch a Thousand Cores program? - new chains can be automatically picked up by the system from the Polkadot-JS App repo (thanks @Ben for the hint). One could even build a similar service to Flipside crypto using this. (maybe we “müssen” build it? :wink: )

DotSentry V0 architecture

When putting all the elements presented above together, this is one thing we can end up with to serve our purpose of monitoring chains in real-time. I’ll illustrate the flow of data below.

  1. The Block Listener listens to the head of all Polkadot chains and pushes new block numbers and their metadata to NATS. This component automatically handles reconnects, balancing out multiple nodes etc

  2. The Block Fetcher listens for new block numbers on NATS (for any chain) and then queries the appropriate Substrate Sidecar for that specific block number and chain. A simple service that emits a REST request and holds not much logic, the heavy lifting (parsing, metadata handling etc) is done by Sidecar

  3. The Block Fetcher pushes the full JSON data for the block to NATS

  4. Both ClickHouse and a Web Browser get the realtime JSON data for the block for further processing. On ClickHouse side, materialized tables / views allow to create custom queries to watch for changes. On the Web Browser side, we can display on 1 view each new block for all the chains, by using a single Websocket connection instead of maintaining N connections to multiple chains in one window.

Here is a demo of the thing working in the browser, to observe the block times for the Polkadot relay chain, in real-time: x.com it’s working in a locally hosted Observable Framework notebook.

I’m using the example of ClickHouse and a Web Browser here, to illustrate that SQL is not sine qua non for all of this to work, but just an example. Matter of fact, all that’s important is the data that gets pushed into NATS and how it’s made available downstream. I’ll share a few more use cases and how this system serves as a “DotSentry”.

Try it out today

So I’ve gone ahead and also deployed an instance at wss://dev.dotsentry.xyz/ws. You can access it today and play around with it, there’s no CORS on this bad boy yet so you can have all the fun you want. The V0 is quite stable but I can’t make guarantees: it’s a dev stage and will evolve. The prod stage will need more guarantees: in fact, the monitoring system needs to be more available than the thing it’s supposed to monitor ;D. The networks that are deployed now are as follows:

  • Polkadot: block numbers & block contents
  • Polkadot AssetHub: block numbers & block contents
  • Polkadot BridgeHub: block numbers & block contents

Here’s some code to get you started in 5 minutes, create a HTML file anywhere on your computer (dotsentry.html) and put this code:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>DOT Sentry</title>
</head>
<body>
    <script type="module"> 
        import {
            connect,
            StringCodec,
        } from "https://cdn.jsdelivr.net/npm/nats.ws@1.10.0/esm/nats.js";

        const sc = new StringCodec();

        // connects to DotSentry, don't mind the auth it's just to restrict publishing to topics
        const nc = await connect(
            {
                servers: ["wss://dev.dotsentry.xyz/ws"], user: "test", pass: "test",
            },
        );

        // Subscribes to the block numbers
        const sub = nc.subscribe("Polkadot.*.*.*.Blocks.*.BlockNumber");

        const handleMessage = (msg) => {
            const parts = msg.subject.split('.');
            const blockNumber = sc.decode(msg.data);
            console.log(`${msg.subject}: ${sc.decode(msg.data)}`);
        }
                
        (async () => {
            for await (const msg of sub) handleMessage(msg)
        })();
    </script>
</body>
</html>

(don’t judge my JavaScript skills plz, thnx)
Open the file in your browser and check the console.log output. Should be something similar to this:

It should output a log message anytime a new block is created on any of the configured chains, including the block number. This isn’t the end of it though, for Polkadot, we can also get the data in the same way, all you need is to change this line const sub = nc.subscribe("Polkadot.*.*.*.Blocks.*.BlockNumber");
to const sub = nc.subscribe("Polkadot.*.*.*.Blocks.*.BlockContent");. And you’re good to go streaming live the block contents (without some fields) for Polkadot:

This works for all the chains (but currently I’ve activated access to just the 3 mentioned above). Without going into the details of topics, schemas and other data nerd things, the main message here is that these things work and are easy to deal with: precisely what you want for monitoring things in real time. With this we can monitor everything, from block data, to nodes data to test results, and provide the flexibility downstream to mix and match the data without any constraints.

For the CLI fans, if you install the nats-cli and run the following command, you’ll have the data streaming to your terminal in no time:

nats -s wss://dev.dotsentry.xyz:443/ws --user=test --password=test subscribe "Polkadot.*.*.*.Blocks.*.BlockNumber"

Will subscribe you to the current blocks:

It should also be possible to make this example work with websocat/jq and get the data directly streamed to your terminal and transform it to your liking. This data is meant to be accessible to anyone (dotsentry emacs package wen?)

Jokes aside, the same way that this works, you can connect anything on the receiving end. A database, another service, another chain, even Grafana or other dashboard tools. Try it out even by integrating it into https://observablehq.com/ and make a pretty visualization. Again, this same data can be made available without any overhead to a real-time SQL database like ClickHouse, to query things on the fly and you guessed it: define alerts.

This example above runs in ClickHouse, is connected natively to NATS (no glue code) and allows me to keep track of which validator changed their commission. A similar set of services will monitor bridges. It works the same way as above, leveraging a storage function as data input in this case.

Neat, no?

Hosted UI

Of course if you don’t want to bother with any of that, there is a version running here:
Block Production | DOT Sentinel

Give it a few seconds and it picks up or refresh the window if you don’t see anything, my frontend skills are solala, so the UI can definitely be improved a lot.

PS: does this scale? I think it does, some notes on benchmarking can be found here.

Putting it all together and further use cases

This type of system could accommodate growing needs of the ecosystem and more and more complex “sanity checks” and flexible monitoring, it can also be linked to historical storage of blocks (like DotLake) to come up with more interesting metrics.

Additionally, it’s flexible enough that you don’t need to learn PromQL (Prometheus query language) or anything else, use what you know and you’re good to go to monitor things and you’re free to use your own alerting channels as you wish. It literally can work from a single HTML file.

Architecting such a system comes with some challenges though, especially around things called schemas (data types) that each service expects to work right, but without going too much into details for the sake of this post, these are problems that have solutions and can be figured out along the way.

The examples above use block data, but the same system can be used to store real-time data that comes from tools like Nebula (DHT scraper) and store next to the block data, allowing a few more interesting queries. This will be added shortly after the first version is live. With this, it’s now possible in real time to view the failed extrinsics on multiple networks, define metrics for “strange activity” as well as putting in place distributed systems for monitoring the chains. ( @xlc shared his thoughts on this here too: Generalized multichain monitoring solution )

What is important?

In order to not mistake the forest for the trees, we need to first agree what is important to monitor and what makes for a problem we want to watch out for, before scaling everything up. Defining what happens once a problem is detected is more important than how we’re detecting problems. That way, we can agree what we should aim for in terms of SLA/SLOs to increase reliability and improve developer experience by removing any guess work. In fact, it’s better to have a short impactful list of things to monitor and test, than an infinite number of tests that nobody understands or cares about.

The technical solution is a detail compared to this, having clear accountability and responsibility is the goal.

Which is why I need your input. Among the categories of things to monitor, we’ve identified the following:

  • Level 1: Basic Health and Operational Continuity
  • Level 2: System Performance and Efficiency
  • Level 3: Economic and Governance Health
  • Level 4: Security and Anomaly Detection
  • Level 5 (optional for this system): User and Community Engagement

I’ve started listing a few below and need your input: what are we missing? What can we add? With this current setup, virtually anything can be monitored but we can first focus on things that, if they change too much or too fast, would indicate a problem.

Chain health metrics

  • Block Production: Frequency and regularity of block creation.
  • Chain Finality: Time taken for blocks to be considered final.
  • POV sizes: Block utilization.
  • Transaction Throughput: Number of transactions processed per second.
  • Transaction Latency: Time taken for transactions to be confirmed.
  • Transaction Success/Failure Rates: Ratio of successful transactions to failed ones.

Network and Connectivity Metrics

  • Peer Connectivity: Number of peers each node is connected to.
  • Network Latency: Time taken for data to travel between nodes.
  • Bandwidth Usage: Amount of bandwidth consumed by the network.

Consensus and validation metrics

  • Validator Performance: Effectiveness of validators in producing blocks and participating in consensus.
  • Slashing Events: Occurrences of validators being penalized for misbehavior.
  • Stake Distribution: Distribution of staked tokens among validators.

Treasury metrics

  • Average spend

XCM metrics

  • XCM Message Success Rate: Proportion of successful cross-chain messages.
  • XCM Queue Length: Number of messages in the cross-chain message queue.
  • XCM Latency: Time taken for cross-chain messages to be processed.

API versioning (on live systems)

  • Function / logic changes: Was there an update that impacts functions that I’m depending on?
  • Regressions

From the user perspective. Oliver started here already with dotpong, check it out.

Other metrics / Anomaly detection

  • Staking Rewards: Distribution and changes in staking rewards.
  • Token Circulation: Velocity and volume of the native tokens being transacted.
  • Irregular Activity Patterns: Unusual changes in transaction volumes or block production rates.
  • Runtime Upgrades: Occurrences and impacts of runtime upgrades.

We think that this service needs to be public and visible on a single page. Ideally the current state NOW and the past 7 days should be visible right away to get an idea of what’s going on. Ideally, it should be connected to all of the chains in the ecosystem.

What do you think? I already snatched the dotsentry.xyz domain to test something out, but if you’d like access to an alpha version of the feed and build your own UI, you can already get started. If you want access to the NATS stream and publish things, reach out to me.

Additional goodies / stretch goals

Given the flexibility of this system we can add a few more goodies:

  • On the same page, it should be possible to query anything, including multiple chains at the same time
  • Raise an issue directly on the page and know if an incident is being currently taken care of, and where the discussion is happening
  • A timeline to show upcoming updates to the networks
  • A historical view of end to end tests, uptime and other statistics exactly like Oliver TY describes in this post: Stabilizing Polkadot - #21 by OliverTY.
  • A historical view of our technical indicators for the different levels
  • A bounty (?) to add more tests and/or host NATS leaf nodes / monitoring services or even make the data stream available to anyone in the ecosystem at no charge. (while not completely “decentralized”, NATS can be set up to be distributed across multiple entities).

PS: any issues with the demo, please ping me or write a message here. I’ll update the article in case links change etc. Also special thanks to the Dwellir team for providing access to RPC nodes for this V0.


Conclusion

All in all, the goal of this initiative isn’t to come up with a “Status page for Polkadot”. It is rather to build a power tool for the ecosystem, that can provide a real-time 360 view of the network’s health, with some advanced capabilities for querying and probing the network.

Setting this up and working through the plan will also help structure our response when incidents happen with the hope that they will be as few as possible and as straightforward as possible to fix.

This initiative is part of our focus within tech: Stabilizing Polkadot and under the guidance of @pierreaubert we will make this happen. Having a technical solution is a first step, agreeing how we will organize and respond will be the real success.

In order to increase trust in our systems, it’s paramount we put things in place that notify us immediately if things go awry.

Of course, prevention is usually cheaper than the cure, so this isn’t the end of things. Our teams will share more in the following weeks/months around what we want to do downstream in terms of testing, code coverage and more, to make the monitoring a mere formality and not a temporary fix.

Please let us know your thoughts, ideas and input. Unlike the previous post: Pieces of a decentralized data lake which was meant as an inspiration, we’re building this one with the goal of being the first to detect any future issues (timeline ~ start H2 2024) :alarm_clock:

15 Likes

That’s great. I am sure every parachain have built their own monitoring service so does Acala and always feel it is not good enough. Great to see we have a team working on a scalable and open source solution that can once more reduce the overhead of maintaining and operating a chain.

Can’t wait to try it and integrating it with Acala.

3 Likes

Thanks for the detailed breakdown. As you acknowledge monitoring is more than a pain point.

One omission that came immediately to mind is a fit-for-purpose data structure. I don’t believe you can naively pump metrics around and hope to have informative/useful monitoring system.

Fortunately, this is a common problem that talented people have worked hard to solve, partially, if not completely:

I’m sure things have evolved since I looked at this space. But that should get you a running start.

1 Like

Thanks for the kind words @xlc , we’ll make sure to share things as fast and as best as we can and will send around and write how-tos. Ideally this thing would first be helpful in detecting issues quickly.

Which brings me to your point @taqtiqa-mark , it is certainly valuable and will be considered once we have settled on a definitive architecture. Right now we’re mainly pumping data, metrics will follow later: the first stage of the process is speed in implementation and exhaustiveness in requirements. I believe that shoe-horning things too early in a data model might do us a disservice and land us in a local optimum, especially given the heterogenity of things we want to measure and as you put it: it’s very hard work getting it right. It’s still extremely important, just at a later stage of the process: good enough will do for the V0/MVP. I note down on my side to ping you once we’re there, I’m sure your input will be key to help us get this right.

I’m a quant/finance refugee. You have the near the total of my knowledge in the above. I’d expect your CS/Eng staff would already be all over this?

Certainly! We hope to exceed your expectations :sunglasses:

First of all, I am impressed by the soundness of the solution and how advanced it is already. Big kudos for your team and you!

Have you considered merging the block listener and the block fetcher in one step/block? For example, by using the chainHead method of the new JSON RPC spec and decoding every new block pinned by the subscription. Also, I am sure Polkadot-API provides first-hand support for this use case of decoding the block at the tip of the chain. Bear in mind that this assumes the node implements the new JSON RPC spec

I am not sure if this would bring any performance improvement beyond making the tech stack/architecture more simple but if it is worth it, happy to support with the task.

1 Like