Litep2p Network Backend Updates

lexnv · September 11, 2024, 12:33pm

In this post, we’ll review the latest updates to the litep2p network backend and compare its performance with libp2p. Feel free to navigate to any section of interest.

Section 1. Updates

We are pleased to announce the release of litep2p version 0.7, which brings significant new features, improvements, and fixes to the litep2p library. Highlights include enhanced error handling, configurable connection limits, and a new API for managing public addresses. For a comprehensive breakdown, please see the full litep2p release notes. This update is also integrated into Substrate via PR #5609.

Public Addresses API

A new PublicAddresses API has been introduced, enabling developers to manage the node’s public addresses. This API allows for adding, removing, and retrieving public addresses shared with peers through the Identify protocol. It aims to address or reduce long-standing connectivity issues in litep2p.

Enhanced Error Handling

The DialFailure event now includes a DialError enum for more granular error reporting when a dial attempt fails. Additionally, a ListDialFailures event has been added, which lists all dialed addresses and their corresponding errors in the case of multiple failures.

We’ve also focused on providing better error reporting for immediate dial failures and rejection reasons for request-response protocols. This marks a shift away from the general litep2p::error::Error enum, improving overall error management. For more details, see PR #206 and PR #227.

Configurable Connection Limits

The Connection Limits feature now lets developers control the number of inbound and outbound connections, helping optimize resource management and performance.

Feature Flags for Optional Transports

With Feature Flags, developers can now selectively enable or disable transport protocols. By default, only TCP is enabled, with the following optional transports available:

quic - Enables QUIC transport
websocket - Enables WebSocket transport
webrtc - Enables WebRTC transport

Configurable Keep-Alive Timeout

Developers can now configure the keep-alive timeout for connections, allowing more control over connection lifecycles. Example usage:


let litep2p_config = Config::default()
    .with_keep_alive_timeout(Duration::from_secs(30));

Section 2. Performance Comparison

To gauge performance, we ran a side-by-side test with two Polkadot nodes — one using the litep2p backend and the other using libp2p — on the Kusama network. Both nodes were configured with the following CLI parameters: --chain kusama --pruning=1000 --in-peers 50 --out-peers 50 --sync=warp --detailed-log-output.

While network fluctuations and peer dynamics introduce some variability, this experiment offers an approximation of how the two network backends perform in real-world scenarios.

CPU Usage

One of litep2p’s key advantages is its lower CPU consumption, using 0.203 CPU time compared to libp2p’s 0.568, making it 2.78 times more resource-efficient.

Network Throughput

Litep2p handled 761 GiB of inbound traffic, while libp2p processed 828 GiB, giving libp2p an 8% edge in this category. However, litep2p outperformed libp2p in outbound traffic, handling 76.9 GiB versus libp2p’s 71.5 GiB, providing litep2p a 7% advantage for outbound requests.

Sync Peers

The chart below shows the number of peers each node connected with for sync purposes. Litep2p maintained more stable sync connections, whereas libp2p exhibited periodic disconnection spikes, which took longer to recover from. This may be due to litep2p’s increased network discovery via Kademlia queries.

Request Responses

While both backends showed similar numbers of successful request responses, libp2p outperformed litep2p in this area.

Litep2p encountered more outbound request errors, primarily due to substreams being closed before executing the request.

Preliminary CPU-constrained parachain testing resulted in worse performance for litep2p, for more details see Issue #5035.

With recent improvements in error handling, we expect to address these issues in future releases.

Other Performance Metrics

Warp Sync Time

The warp sync process saw litep2p completing in 526 seconds, compared to libp2p’s 803 seconds, indicating a significant performance gain for litep2p. The warp sync time was measured using the sub-triage-logs tool and you can find more details in PR #5609.
Kademlia Query Performance

The Kademlia component facilitates network discoverability. In an experiment to benchmark network discoverability, litep2p located 500 peers (about 25% of the Kusama network) in 12-14 seconds, while libp2p completed the same task in 3-6 seconds.

The experiment still produces quite a lot of noise and we’ll have a closer look at this once we have a better benchmarking system. In the meanwhile, the subp2p-explorer tool was used for this experiment. The bench-cli tool can also spawn a local litep2p network to reproduce this experiment, providing additional opportunities for optimization.

A special thanks to Dmitry for his exceptional work on litep2p, @alexggh for testing litep2p from the parachain perspective, and @AndreiEres for his efforts in improving benchmarking systems to help drive further network optimizations

lexnv · November 7, 2024, 11:44am

We’re excited to announce litep2p version 0.8.0, which introduces support for content provider advertisement and discovery in the Kademlia protocol, aligning with the libp2p spec. This enables nodes to publish and discover specific content providers on the network. Alongside this feature, the release brings notable improvements in stability, performance, and memory management.

For a full list of changes, refer to the litep2p changelog.

Content Provider Advertisement and Discovery

With this release, Litep2p now supports content provider advertisement and discovery using the Kademlia protocol, allowing content providers to publish records to the network, and enabling other nodes to locate and retrieve these records with the GET_PROVIDERS query. This feature is crucial for storing parachain bootnodes in the relay chain DHT.

    // Start providing a record to the network.
    // This stores the record in the local provider store and starts advertising it to the network.
    kad_handle.start_providing(key.clone());

    // Wait for some condition to stop providing...

    // Stop providing a record to the network.
    // The record is removed from the local provider store and stops advertising it to the network.
    // Please note that the record will be removed from the network after the TTL expires.
    kad_provider.stop_providing(key.clone());

    // Retrieve providers for a record from the network.
    // This returns a query ID that is later producing the result when polling the `Kademlia` instance.
    let query_id = kad_provider.get_providers(key.clone());

Connection Stability

The release includes several improvements to enhance the stability of connections in the litep2p library:

Connection Downgrading: Inactive connections are now downgraded only after extended inactivity, reducing interruptions and improving long-term stability.
Enhanced Peer State Management: A refactored state machine with smoother transitions enhances the management of peer connections, preventing issues like state mismatches that could lead to rejected connections.
Address Store Improvements: Address tracking is now more precise, with a new eviction algorithm to manage unreachable addresses and better control memory usage.

Optimizations

Improved Dialing Logic: Dialing across TCP, WebSocket, and Quic is now more resource-efficient, with canceled attempts immediately terminating to save resources.
Kademlia Data Handling: Data handling is now more efficient by replacing unnecessary data cloning with reference-based retrievals for Kademlia messages.
Memory Leak Fixes: Addressed memory leaks across TCP, WebSocket, and Quic transports, especially in canceled connections. Unremoved pending operations were resolved in both the ping and identify modules. See the relevant PRs: #272, #271, #274, #273.

I want to extend my thanks to everyone who contributed to making this release possible. Special thanks to Dimitry @dmitry-markin for his outstanding work on implementing the content provider advertisement and discovery feature, and to Alex @alexggh for his dedicated testing efforts and for detecting a high memory consumption. Thanks also to Andrei @sandreim for his valuable suggestions on investigating memory cloning, and to Andrei @AndreiEres for his ongoing commitment to enhancing benchmarking!

lexnv · November 14, 2024, 3:06pm

This v0.8.1 release includes key fixes that enhance the stability and performance of the litep2p library. The focus is on long-running stability and improvements to polling mechanisms.

For a full list of changes, refer to the litep2p changelog.

Long Running Stability Improvements

This issue caused long-running nodes to reject all incoming connections, impacting overall stability.

Addressed a bug in the connection limits functionality that incorrectly tracked connections due for rejection.

This issue caused an artificial increase in inbound peers, which were not being properly removed from the connection limit count.

This fix ensures more accurate tracking and management of peer connections #286.

Polling implementation fixes

This release provides multiple fixes to the polling mechanism, improving how connections and events are processed:

Resolved an overflow issue in TransportContext’s polling index for streams, preventing potential crashes (#283).
Fixed a delay in the manager’s poll_next function that prevented immediate polling of newly added futures (#287).
Corrected an issue where the listener did not return Poll::Ready(None) when it was closed, ensuring proper signal handling (#285).

Dashboards

This dashboard provides a comprehensive view of litep2p’s performance compared to libp2p. Here, litep2p demonstrates a remarkable speed advantage, handling notifications with various payload sizes 10x to 30x faster than libp2p.

The next dashboard highlights the improved connection stability of a long-running node with a high connection load (500 inbound and 500 outbound connections). Litep2p is represented by the green line, showcasing its stability, while libp2p is represented by the yellow line.

Finally, the CPU consumption dashboard reveals a significant reduction in CPU usage for litep2p, using half the CPU resources compared to libp2p. Here, litep2p is represented by the yellow line, and libp2p by the magenta line.

As always, thanks @dmitry-markin for in-depth reviews and suggestions, @AndreiEres for implementing the dashboards that show a significant notification performance improvement #6455!

lexnv · December 13, 2024, 10:13am

We are exited to announce that litep2p is running for 14 days on all parity-owned Kusama Validators! This milestone marks a significant step towards making litep2p the default network backend. Deploying litep2p across multiple validators has enabled us to identify additional areas of improvement and address edge-cases that were previously undetected when using a single non-validator node or our testing stacks.

Ecosystem Validator Involvement

Our next phase involves gradually rolling litep2p to ecosystem validators on Kusama. This transition will begin shortly after the next Polkadot 2412 stable release.

Further details will be shared in the Kusama Validators Room, so stay tuned for updates!

Releases

Since our last announcement, we’ve embraced a more rapid release cycle and are excited to share three new versions: v0.8.2, v0.8.3, and v0.8.4. Below is a summary of the improvements and fixes introduced in each release.

Release v0.8.2: Enhance Security And Stability

This release ensures that the signature payload of the crypto/noise protocol is verified before processing. This critical security measure prevents potential attacks, such as impersonation of peer IDs.

The release also fixes a debug_assert! condition that was causing incorrect assumption in cases of rapidly opening and closing connections.

req-resp: Fix panic on connection closed for substream open failure (#291)
crypto/noise: Verify crypto/noise signature payload (#278)
transport_service/logs: Provide less details for trace logs (#292)

Crypto/Noise Protocol

The crypto/noise protocol plays two roles: encryption and authentication of peer identity during communication.

The encryption is handled by a set of Diffie-Hellman keys (public PkDH and secret SkDH) generated for each connection.

These keys are distinct from the node’s PeerID keys (Pk and Sk), which identify the node in the network.

Authentication is achieved by verifying the signature of the payload sent by the peer.

Alice signs a message containing the PkDH using the Sk (private PeerID keys) and sends it to Bob.

Bob verifies the signature using Alice’s Pk (public PeerID keys) and the same message containing the PkDH, ensuring Alice’s ownership of the PeerID keys.

Release v0.8.3: Fixing Memory Leaks

This release resolves subtle memory leaks in the Notification and RequestResponse protocols, both caused by improper handling of substream IDs for closed connections.

These issues were identified thanks to insights from the Litep2p metrics implementation (PR #294). We plan to integrate metrics to monitor the internal state of protocols and expose this data in future releases:

req-resp: Fix memory leak of pending substreams (#297)
notification: Fix memory leak of pending substreams (#296)

Release v0.8.4: Improving Resilience of MDNS

We addressed an issue where one of our five validators malfunctioned due to an MDNS component failure. The MDNS component was not resilient to failures when submitting an MDNS query on the multicast address.

This release aims to fix that and improve the Identify protocol by reducing delays in processing outbound events.

mdns/fix: Failed to register opened substream (#301)
identify: Replace FuturesUnordered with FuturesStream (#302)

Topic		Replies	Views
New JSON-RPC API: Q4 2024 Update Tech Talk	1	309	October 28, 2024
Polkadot-API 2024 2nd update Ecosystem	1	248	August 23, 2024
Polkadot-API 2023-Q4 Update Ecosystem ux	4	961	February 5, 2024
Archive RPC-V2 Methods Tech Talk	0	46	December 12, 2024
Polkadot-API 2024-Q1 update Ecosystem	4	1259	July 26, 2024