Proposal: Polkadot P2P Network Health Weekly Reports

The ProbeLab team, which spun out of Protocol Labs (inventor and core sponsor of libp2p and IPFS), is focusing on measurement and benchmarking of protocols at **the P2P layer ** of Web3 platforms.

So far, the team has focused on developing metrics for IPFS (see https://probelab.io), but has looked into other libp2p-based networks and one of them has been Polkadot. We have been running our crawler and accompanying tooling for the P2P layer of Polkadot since the beginning of 2023.

After processing the raw data, we create weekly reports such as this one: https://github.com/plprobelab/network-measurements/blob/a7641238352c46b43f09fe72bf18c66af7faeb6d/reports/2023/calendar-week-41/polkadot/README.md. The corresponding reports for IPFS can be found here: https://github.com/plprobelab/network-measurements/tree/master/reports.

What do the reports show and why is it important

The reports show several general purpose metrics by crawling the P2P network more than 300 times a week. Apart from node counts in the network, the reports show statistics on the following:

  • connectivity of nodes, which can indicate issues with regard to unavailable, overloaded or misconfigured nodes [link to plot],

connectivity

  • availability of nodes (i.e., whether they’re online or offline) together with classification of availability per parachain client [link to plot], which indicates network health,

availability-parachain

  • agent version breakdown and protocols, from which one can infer update rates to new releases, or when is a good time to break backward compatibility and save engineering resources (e.g., when very few nodes are left in an old version) [link to plot],

  • errors connecting to nodes which can indicate faulty software or configuration, or other issues with the network [link to plot],

errors

  • overall peer churn, as well as per client, which can indicate issues with node availability or misconfiguration [link to plot],

errors

  • geolocation info [link to plot] together with the relevant classification of availability of nodes in each region [link to plot], as well as agent version per region [link to plot],

Ask

Running the tools, processing and storing the data and analysing the results is clearly incurring infrastructure and engineering costs that the team has supported up to now.

Going forward, we can only continue to run the tools and produce these reports for a service fee of $2500 per month, plus a one-off, upfront cost for the initial setup of $5000. The service fee includes:

  • running and covering the cost of the infrastructure,

  • maintaining and updating the tools and infrastructure as needed (e.g., dependency updates and database maintenance),

  • have an expert from our group inspect the results nd flag any “notable” items that need further attention on a weekly basis.

We plan to submit a proposal under the “Small Spenders” track, but in the meantime, we’d love to get feedback on the content of this post, the weekly reports, as well as new metrics that are of interest to the community.

In the future we plan to integrate those reports at https://probelab.io to have that as a central point of reference, but this will come as a separate discussion item.

Feedback Request

With this post, we would like to invite the Polkadot community to give feedback on these reports, quantify the value that they bring to the ecosystem and ultimately vote on whether maintenance and publication of the reports should continue for the cost mentioned above.

We are also very keen to receive feedback and suggestions on other metrics that would be of interest to the community so that we can plan ahead and develop them.

5 Likes

I do have a few questions:

Could you explain precisely how you determine whether a node is “online”, “offline”, “dangling”, etc.? In other words, what is the crawling algorithm?

Which protocol name(s) are you using when crawling the DHT? (as a heads up, /dot/kad is deprecated (has been for at the very least a year) in favor of /<genesis-hash/kad).

What is “Inter Arrival Time”?

In the crawl errors, what is the number of crawls that didn’t error? I see that there are 20 to 30 errors (per hour? per day? clearly not cumulated since it goes down), but we have no idea whether or not it is a lot.

I can more or less guess what the various connection errors mean, except for “context deadline exceeded”. What does that mean?


Your statistics show that around 90% of peers are “offline”, which is a bit alarming to say the least.

I’m however raising an eyebrow. While I don’t have a crawler, starting a Polkadot node and looking at its logs shows that it has no trouble establishing connections to other peers (it might have troubles finding a peer that accepts opening a gossiping channel, but that’s a different story). This is purely anecdotal evidence, but I think that we would easily notice if 90% of the DHT entries pointed to unreachable nodes.

2 Likes

Hi @tomaka,

All of these classifications refer to 7d of repeated network crawls. One crawl of the Polkadot network takes ~20mins (most of the time is spent waiting for “dead” peers to respond). In the 7d period perform a crawl every 30mins which means in total we perform 336 crawls/7d. The classifications are defined as follows:

Classification Description
offline A peer that was never seen online during the measurement period (always offline) but found in the DHT
dangling A peer that was seen going offline and online multiple times during the measurement period
oneoff A peer that was seen coming online and then going offline only once during the measurement period
online A peer that was not seen offline at all during the measurement period (always online)
left A peer that was online at the beginning of the measurement period, did go offline and didn’t come back online
entered A peer that was offline at the beginning of the measurement period but appeared within and didn’t go offline since then

Which protocol name(s) are you using when crawling the DHT? (as a heads up, /dot/kad is deprecated (has been for at the very least a year) in favor of /<genesis-hash/kad ).

We’re using /dot/kad. Does it mean we’re missing out on peers as they form a genesis-hash namespaced DHT?

What is “Inter Arrival Time”?

This applies to dangling peers. If a peer comes online at 6am every day, stays online for 2h, and then goes offline, its inter-arrival time is 24h (the time between two arrivals in the network) and churn is 2h. This metric could reveal periodicities.

In the crawl errors, what is the number of crawls that didn’t error? I see that there are 20 to 30 errors (per hour? per day? clearly not cumulated since it goes down), but we have no idea whether or not it is a lot.

These are 20 to 30 errors per crawl. The total number of crawls is the total number of dialable peers. So we were semi-consistently able to connect to ~1800 peers and semi-consistently found ~50 peers that we could not crawl. Crawling here means, we couldn’t issue the FIND_NODE RPCs to drain their buckets.

I can more or less guess what the various connection errors mean, except for “context deadline exceeded”. What does that mean?

This is a measurement deadline from our end and is set to 1 minute. So if we’re not able to dial a peer within 1min we abort interacting with that peer and “throw” a “context deadline exceeded”. This error can be interpreted as another io_timeout.

Your statistics show that around 90% of peers are “offline”, which is a bit alarming to say the least.

I’m however raising an eyebrow. While I don’t have a crawler, starting a Polkadot node and looking at its logs shows that it has no trouble establishing connections to other peers (it might have troubles finding a peer that accepts opening a gossiping channel, but that’s a different story). This is purely anecdotal evidence, but I think that we would easily notice if 90% of the DHT entries pointed to unreachable nodes.

Hard to comment on that. Since most of the errors are timeouts, are there any mechanisms in place that prevents a random peer (like our crawler) to connect and instead lets it time out? Do Polkadot nodes log all failed connections?

Cheers,
Dennis

The offline percentage looks a bit strange… :thinking:

Are you guys filtering out bogon addresses? Are you supporting ipv6 as well?

Are you trying to connect to every multiaddress of every peer? Only one multiaddress?
In other words, does “offline” mean “every multiaddress has failed”?
Does a failed multistream-select or Noise handshake count as “offline”?

Maybe! It’s unclear to me whether this would give a different result given that the behavior is hidden behind many layers of abstractions. What is sure is that you’re using a deprecated protocol name that will be removed in the future.

No

If you enable the proper logging yes. I haven’t worked on the source code of the official implementation for over a year, but I guess that it should still be --log sub-libp2p=debug.

We’re only filtering out private addresses based on:

We support IPv4/IPv6+tcp/quic/quic-v1/ws/webtransport


I just looked a little closer into our latest crawl and found that three IP addresses make up ~4.5k of the offline peers. This means there are ~4.5k unique peer IDs in the DHT that are associated with only three different multi addresses.

Multiaddress Unique PeerIDs in the DHT
/ip4/3.141.x.y/tcp/7001 1619
/ip4/18.191.x.y/tcp/7001 1515
/ip4/3.21.x.y/tcp/7001 1374

All three are hosted on AWS in North America. I’m happy to share the full IP addresses privately.

While this doesn’t explain the 12k unreachable peers it is a start.

All multiaddresses and “offline” means every multiaddress has failed. A failed multistream-select of noise handshake would also be counted as “offline”. However, if the latter occurs we would receive a “failed to negotiate security protocol” (or something along those lines), and if this was a common error it would show up in this “Connection Errors” graph: https://github.com/plprobelab/network-measurements/blob/a7641238352c46b43f09fe72bf18c66af7faeb6d/reports/2023/calendar-week-41/polkadot/README.md#errors

Maybe! It’s unclear to me whether this would give a different result given that the behavior is hidden behind many layers of abstractions. What is sure is that you’re using a deprecated protocol name that will be removed in the future.

Alright, going forward the implications would be great to clarify! If it isn’t more than just the protocol identifier it’s easy to spin up another instance of our crawler with that adjusted configuration :+1:

These are the different /kad protocols we have found since February this year:

Table
Found At Protocol ID
2023-02-21 /dot/kad
2023-02-21 /91b171bb158e2d3848fa23a9f1c25182fb8e20313b2c1eb49219da7a70ce90c3/kad
2023-02-21 /7d70779de39aaf0a72f0edb85a0b7b3f83d29d01ce0ab403d8989d5a8102cb77/kad
2023-02-21 /38bfbec1898c3b9a91fe23b3249a3f9c93b193a2100acc88d204f245acd1a36b/kad
2023-02-21 /e1ea3ab1d46ba8f4898b6b4b9c54ffc05282d299f89e84bd0fd08067758c9443/kad
2023-02-21 /64d25a5d58d8d330b8804103e6452be6258ebfd7c4f4c1294835130e75628401/kad
2023-02-21 /aac8ce35b070b1f483ca40368dca46e1f770c421b559cd95f6ea1d798e020158/kad
2023-02-27 /0f7417b7e7fdc2df9236caaceebdca6b49a7839affb5372a934d91c8c7753c52/kad
2023-02-27 /d53fc93d530671973b1f7815aa558eeff4233d1dff64e78a321424fb9af0130e/kad
2023-03-18 /b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/kad
2023-03-19 /rococo/kad
2023-04-04 /74ebb247ec01ebfc6492a0c149383c1fe97ecc7c7f80fa268b9c586a29c4a16c/kad
2023-04-04 /89c5e915adde2e409a2310213cf7418a40b593cb225b22685747d16ee3eabb99/kad
2023-05-04 /ksmcc3/kad
2023-06-12 /ffc50145fbb3313c37b71868af0e653dd9e640dfb5488d92337acad8acc5a0cf/kad
2023-06-13 /adcd81cf2dbe80cbd8b8646fb199f13139d7972777acee30435ca149a4877d38/kad
2023-06-13 /88ac0ed3add59e428bf0f89bb14fb9f54ef53ee1da4bea01626c710fc4aaa2fb/kad
2023-06-23 /unqeast/kad
2023-06-26 /zeitgeist-polkadot/kad
2023-06-30 /unqwest/kad
2023-07-07 /12180370bddbf087cf3d82b0340b28f44f68e20774c3b922584a5ce61639c8b6/kad
2023-07-10 /b79cf509e99deab8142bc2a330614d614ce3dbc03c0b7a9bd0069f01019c0239/kad
2023-07-11 /9859e8ffa46a0c300e62fc979b2163b58f4354c6183c5a13f2d6814c5f898e74/kad
2023-07-19 /619f3c844caae8899048f7bf33dc395d9b28a0761988a969db25a17b85337716/kad
2023-07-20 /ee53982a32e270750ec3ac2810fe0da41587fdd6c7204043c26f3de14944af18/kad
2023-07-30 /cf1336a8db848cf818962cc108478649aadcf8aedc14127a2222c2a992f2e01b/kad
2023-08-25 /6408de7737c59c238890533af25896a2c20608d8b380bb01029acb392781063e/kad
2023-11-02 /cadcde064285314337e7f215fa0557d8ac7dabfd70001a134d01c73a2b0fff96/kad
2023-11-03 /ae6988666bb6c966d707c924ecf7bacb13e56253e7695ca647e822f966c33e25/kad
2023-11-03 /74c25bcdcd598c741633a697891bbbed22bc5041ae364429116c4844e27670c0/kad
2023-11-03 /91bc6e169807aaa54802737e1c504b2577d4fafedd5a02c10293b1cd60e39527/kad
2023-11-03 /sup/kad
2023-11-09 /85b49211d2cd57351cf835e261b57216c99305834d307b7291acb19b99c37b08/kad
2023-11-13 /b1d06c69544593722ec7ec70d2fd40e2b87c417f8603fd5c13ab88cd1bd56d21/kad
2023-11-14 /d8ef9fec6df0d48a7066529726e0641a62d8b3bc018c637756c679244b00049d/kad
2023-11-14 /c3404385f2b017c95b4be0c44b41ec90b0efb25e4654694afb8777d4ead10c01/kad
2023-11-14 /594384ebeb9bb02de4c2f153049c0ed7524c0b8eb6e26877f69d4c8d9a9add25/kad
2023-11-15 /8f7f87157cecf312ad31023e87470dd03aebd7e79fe7fb7093a8d7288be26d0b/kad
2023-11-16 /7389b78dbb3a94bf09aa53802c338131e242b78678f5959f170e4902ccf24422/kad
2023-11-16 /e553c7f33ab7dbbd8770a82169ac9ad7f49f53204e6d72d3a75f9df5d22e1183/kad
2023-11-17 /808c7e7200147ab5aefbc8133b18b729c800d64094151d7035341efcbaac9d2d/kad
2023-11-19 /f83966d723a250017070500e0a873e8e65560a6dadf026a70e29ba94e29370ba/kad
2023-11-20 /38d4d61dd2b6dc2ae1665af1f674c1d6108e2f321dd4fda8021450d135066080/kad
2023-11-22 /d97c0d8c02a2878f817b688d3397efa2584977f0332d0ba82303498110a0836f/kad
2023-11-22 /deadb0490a8470cbaf0e0d4b25d4ed7a34fe55d30f51be88fc14d8e43e7fa000/kad
2023-11-23 /88c6f970e468157db6d8391e55208c31aa69e815edd30a4160623d16a35bca48/kad
2023-11-27 /e9a0d0807082ac578d9b0a8e0019a0321f017ddedee356a5bae59e7e3849013e/kad
2023-11-30 /5c2aba289b1028ed0a466593c116fe29d5ff5d30b09b0ae9e34cb0d5fa42cd41/kad

Is there a mapping of the genesis hash to the network these nodes belong to?

Unfortunately no.

The protocol name contains the hash of the genesis block. So if you know of a specific chain, you can find block 0, and get its hash.
For example, Polkadot is 91b171bb158e2d3848fa23a9f1c25182fb8e20313b2c1eb49219da7a70ce90c3.

Kusama is b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe
Westend is e143f23803ac50e8f6f8e62695d1ce9e4e1d68aa36c1cd2cfd15340213f3423e

There’s no easy way to find this mapping (one will exist in the future). You’d have to look at each chain one by one.

All the protocol names with a word in them are the deprecated version.

Thanks everyone for engaging in the discussion. We’ll invite more discussion to the post and will also submit a “Small Spender” proposal soon. It seems there are things to look at in more detail, as far as I can tell.

That’s definitely something interesting to keep an eye on. Rotating PeerIDs might have consequences on the performance of the network.