A deep dive into overhauling litep2p's ping protocol to support periodic pinging, spec-compliant payloads, and a new SubstreamKeepAlive mechanism for connection lifecycle management

Fixing litep2p’s Broken Ping Protocol (And the Connection Keep-Alive Redesign It Required)

litep2p is the networking library behind Polkadot SDK. It’s a Rust implementation of libp2p, the protocol suite that handles everything from peer discovery to block propagation across the Polkadot network. Every validator, collator, and light client uses it.

One of the most basic protocols in libp2p is ping. You send 32 random bytes to a peer, they send the same bytes back, and you measure how long it took. Simple. Except litep2p’s ping implementation was broken in ways that actually mattered. This article is about [PR #416](ping: Conform to the spec & exclude from connection keep-alive by dharjeezy · Pull Request #416 · paritytech/litep2p · GitHub), where I fixed it. What started as a “simple” fix turned into a 30-commit, multi-month effort that ended up touching the connection lifecycle across every transport.

Here’s where ping sits in litep2p’s architecture, for context:

Ping and Identify are “infrastructure” protocols. They exist to support the connections that the real application protocols (Kademlia, notifications, etc.) use. That distinction will become important later.

Before getting into what was broken, it’s worth clarifying two separate mechanisms that are easy to conflate: ping and keep-alive.

Keep-alive is litep2p’s built-in mechanism for detecting healthy but idle connections. Even if a connection is perfectly fine at the transport level, if no useful application data is flowing over it, litep2p should eventually close it. One reason is resource management: every open connection consumes file descriptors and memory. But there’s also a security angle. If a swarm of malicious nodes can occupy all of a validator’s connection slots by just connecting and doing nothing useful, that validator gets cut off from the network. This is a resource starvation attack.

Ping is a separate protocol that consumers of litep2p (like Polkadot) can use to implement connection health checks. It answers a different question: not “is this connection doing useful work?” but “is the remote peer actually alive and responding?” A connection can be idle (no application traffic) but healthy (peer is reachable), or active (substreams open) but dead (peer crashed and hasn’t responded in 30 seconds). Keep-alive handles the first case; ping handles the second.

The reason these two mechanisms interact, and why this PR had to touch both, comes down to what happens when ping becomes periodic. But I’ll get to that.

What Was Actually Wrong

The old ping implementation had three problems. The obvious one: it only pinged once, when a connection was first established. After that one ping, nothing. The connection could die silently and litep2p wouldn’t notice until some higher-level protocol tried to use it and failed.

But there were two more subtle issues. The payload was always 32 zero bytes instead of random data. And the response was never checked. A peer could echo back garbage and we’d still count it as a successful ping. The libp2p spec is clear about both of these: random bytes, verified on return.

Here’s the difference:

For Polkadot specifically, the one-shot ping is a real problem. Validators need to be reachable for GRANDPA finality. If a connection silently dies and the node doesn’t notice, it thinks it has peers it doesn’t actually have. Collators have the same issue with their relay chain connections. And smoldot light clients running in browsers deal with flaky connections constantly (tab suspension, network changes). In all these cases, you want to find out a connection is dead in seconds, not whenever Kademlia happens to try it.

There was also the issue to make connection keep alive mechanism respect active substreams which flagged a separate but related problem: litep2p’s keep-alive mechanism didn’t distinguish between ping substreams and application substreams. As described above, this distinction matters. If ping activity alone can keep a connection alive, an attacker only needs to respond to pings to hold connection slots open indefinitely, without doing any useful work.

How It Went

I want to talk about the design process because the final implementation looks nothing like what I started with, and I think the reasons why are interesting.

First Attempt: Tasks Everywhere

My initial approach was to spawn separate async tasks for each peer, one for outbound pings and one for inbound responses. They communicated with the main protocol loop through channels. It worked, but it was a headache. Every peer connection meant two tasks to manage, and cleaning them up correctly during connection closure was fiddly. Too many moving parts for something that should be simple.

Second Attempt: Streams

I scrapped the task-per-peer approach and switched to Tokio’s StreamMap, a keyed collection where each peer gets one entry for outbound pings (“pinger”) and one for inbound (“responder”). The main event loop just polls the StreamMap alongside transport events. No spawning, no channels, no cleanup coordination. The StreamMap handles it.

This was the right shape. But it surfaced a new problem.

The Part I Didn’t Expect

As soon as I made ping substreams long-lived (to support periodic pinging and substream reuse), connections stopped getting cleaned up. A connection with nothing going on except ping and identify would stay open forever, because litep2p treated all substreams as reasons to keep a connection alive.

This is the paradox: the protocol meant to detect unresponsive peers was preventing idle connections from being cleaned up.

That’s when dmitry-markin got involved and designed the SubstreamKeepAlive mechanism. He focused on getting it right across all the transports (TCP, WebSocket, WebRTC), which each have their own substream lifecycle quirks. I’ll explain how it works below.

What Changed

Four things, all connected.

  1. Periodic Pings with Substream Reuse

The core change. Instead of opening a substream, sending one ping, and closing it, we now keep the substream open and reuse it. The implementation creates an infinite async stream from the substream. On each iteration it sleeps for 5 seconds, generates 32 random bytes, sends them, waits for the echo (20-second timeout), checks that the echoed bytes match, and reports the measured round-trip time.

Three improvements over the old code: random payloads (spec-compliant), payload verification (catches misbehaving peers), and substream reuse (avoids re-negotiating the protocol on every ping).

The 20-second timeout is intentional. The old 10-second timeout was too aggressive for high-latency paths, but 20 seconds still catches genuinely dead peers quickly enough.

  1. Inbound and Outbound Separation

The outbound side (pinger) and inbound side (responder) do different things and are handled separately.

The pinger is an infinite stream: sleep, send, wait, verify, report, repeat. The responder is a loop: receive a payload, echo it back, wait for the next one. Both live in a StreamMap keyed by peer ID, which is nice because inserting a new stream for a peer that already has one just replaces the old entry, automatically enforcing the spec’s “at most one outbound stream per peer” rule. The event loop polls transport events, responders, and pingers in a single `select!`. Nothing blocks on anything else.

3\. Error Recovery

When a ping fails (timeout, wrong payload, broken substream), the pinger stream for that peer gets removed and we open a fresh substream. The new substream is marked as a retry, which means the first ping on it waits one interval before firing. This prevents hammering a peer right after a failure.

  1. SubstreamKeepAlive

This is the piece that turned a ping fix into a connection lifecycle redesign.

The problem, again: litep2p’s keep-alive mechanism treated every open substream as a reason to keep the connection alive. So a connection that only had ping and identify running, with no actual application traffic, would never close. Multiply that by hundreds of peers and you’re leaking resources for connections nobody is using.

The fix: each protocol is now tagged at registration time as either “keep-alive: yes” (application protocols like notifications, request-response, Kademlia, Bitswap) or “keep-alive: no” (infrastructure protocols like ping and identify). Only “keep-alive: yes” substream activity resets the keep-alive timer and keeps the connection active.

The implementation uses Rust’s ownership model in a way I think is pretty clean. Connections hold either a strong reference (active, can’t be closed) or a weak reference (inactive, will close when nothing else holds it). When a “keep-alive: yes” substream opens, it gets a permit, which is a strong reference. When the substream closes, the permit drops, the strong reference goes away. If that was the last one, the connection can close. No timers to manage, no manual cleanup. The type system does the work.


Playing Nice With Older Nodes

Polkadot can’t upgrade every node at once. During rollout, new nodes need to interoperate with old ones that still use the one-shot ping behavior. This constraint drove several specific decisions.

Why 5 seconds? Old litep2p has a 10-second timeout on inbound ping substreams. If the first ping doesn’t arrive within 10 seconds of the substream opening, the old node kills the substream. By pinging every 5 seconds, we always make it under the wire.

The alternating failure pattern. When a new node pings an old node, the old node’s inbound handler echoes back the first ping, then does one more read to detect stream closure. That second read swallows the next ping payload without echoing it. Handler exits, nobody is reading the substream anymore. Our ping times out, the retry kicks in with a fresh substream, and the cycle repeats. Every other ping fails. That’s why the code notes that `max_failures` must be at least 1 until the network fully adopts this change.

Old-style inbound. When an old node pings us, it sends one zeroed payload and immediately closes the substream. Our responder handles this fine. It echoes the payload back, sees the stream close, and exits cleanly.

We tested all of this against litep2p-to-litep2p, litep2p-to-smoldot (the light client used in browser-based Polkadot access), and over WebRTC, which has its own substream lifecycle (FIN/FIN_ACK handshake).

What This Means for Polkadot

Dead peer detection actually works now. With pings every 5 seconds, a node notices a dead connection within about 25 seconds worst case (one ping interval + one 20-second timeout). Before, there was no detection at all. You’d find out when block announcements or GRANDPA messages stopped arriving. For validators, where GRANDPA needs 2/3+ participation for finality, detecting a lost peer quickly is the difference between a temporary blip and a finality stall.

Continuous RTT data. Every 5 seconds, every connected peer gets a latency measurement. This didn’t exist before. It could feed into peer selection (prefer lower-latency peers for time-sensitive work), though that’s not implemented yet.

Connections actually close when they should. The keep-alive classification means a connection that’s only being pinged will eventually time out and close. Before, it would stay open forever. In a network where nodes routinely connect to each other for one Kademlia query and then have nothing else to say, this matters for resource usage. Fewer zombie connections means more file descriptors and memory available for connections that are doing useful work.

Spec compliance. litep2p now conforms to the libp2p ping spec. This matters for interoperability with other libp2p implementations. Smoldot (the light client used in browser-based Polkadot access) is a concrete example we tested against. It also means any future implementation that follows the spec will interop correctly with litep2p’s ping, which wasn’t guaranteed before.

Wrapping Up

What looked like “just make ping periodic” ended up requiring me to rethink how litep2p decides whether a connection should stay open. The ping fix and the keep-alive redesign had to happen together.

The reason is subtle. Keep-alive in litep2p is triggered by *opening new substreams*, not by traffic on existing ones. In the ideal case, where current litep2p talks to current litep2p, ping opens one long-lived substream and reuses it, so after the initial open, it wouldn’t reset keep-alive at all. But during the transition period, when new nodes talk to old nodes, the alternating failure pattern (described above) means ping is constantly retrying and opening fresh substreams. Each retry resets the keep-alive timer, which means a connection doing nothing useful except failing and retrying pings would stay open indefinitely. That’s why ping had to be explicitly excluded from keep-alive. The issue is not the long-lived substream in the steady state, but the retry behavior during backwards-compatible operation.

As dmitry-markin pointed out, this classification system opens the door for further refinement. For example, eventually notification protocol payloads (not just substream opens) could also trigger keep-alive, giving a more accurate picture of whether a connection is genuinely active.

The final shape: an infinite-stream pattern for substream reuse, a StreamMap for per-peer state, a retry mechanism for transient failures, and a permit-based keep-alive system built on Rust’s ownership model. Roughly 30 commits, a lot of back-and-forth in review, and tested across three transport types and two different libp2p implementations.

I wrote the ping protocol changes: periodic pinging, substream reuse, payload randomization/verification, retry logic, and the stream-based architecture. Dmitry Markin contributed the SubstreamKeepAlive integration across TCP, WebSocket, and WebRTC transports.

Note: this is my first post on the Polkadot Forum, so I’m a bit constrained in how I can format this. The full article is available in this gist link

Summary

This text will be hidden

1 Like