Beyond the Broadcast: Simulating P2P Gossip and Visibility Lag

In my first post, I showed how I built a minimal Consensus Lab to strip away all the Substrate macros and actually see the protocol’s DNA (SCALE + Blake3) in action.

But I’ll be honest the first version had a pretty big lie built in. My “network” was just a global broadcast. One node authored a block and every other validator saw it in the next slot. That was fine for a quick demo, but it’s completely unrealistic.

In a real Polkadot-like network, blocks propagate in hops through a neighbor mesh. So moving on I killed the perfect broadcast and added real P2P gossip topology.

The Problem with Infinite Speed

The old broadcast model hid the two most interesting problems in consensus: network diameter and visibility lag. Two validators can both be online with high stake, but if they’re not direct peers, one might still be completely blind to the other’s latest block when the next slot starts.

That’s exactly when forks get born and stay alive longer.

The Refactor: Real Neighbor Mesh + Flood Control

I rewrote the NetworkSimulator to use a directed mesh. Each node now only talks to its configured neighbors.

// Only gossip to immediate peers
pub fn gossip_send(&mut self, sender_id: &str, msg: Message, slot: Slot) {
    let arrival = slot + self.hop_latency;   // I set this to 1 for now
    if let Some(peers) = self.neighbors.get(sender_id) {
        for peer in peers {
            self.nodes.get_mut(peer).unwrap().push_back((arrival, msg.clone()));
        }
    }
}

Of course, as soon as I did that I immediately created infinite gossip loops (A → B → A → B…). So I had to add proper flood control inside the node:

pub fn import_block(&mut self, block: Block) -> bool {
    let hash = block.hash();
    if self.seen_blocks.contains(&hash) {
        return false; // flood protection
    }
    self.seen_blocks.insert(hash);
    self.blocks.insert(hash, block);
    true
}

The “Time Engine”: Discrete-Event Simulation

One thing that people often miss when building simulators is how to handle Time. We don’t use std::thread::sleep. Instead, we use a discrete-event queue where messages have an “Arrival Slot.” The node only sees the message when the simulation clock reaches that slot.

// The Time Engine: Draining the ingress queue based on current_slot
pub fn poll_ingress(&mut self, node_id: &str, current_slot: Slot) -> Vec<Message> {
    let mut messages = Vec::new();
    if let Some(queue) = self.nodes.get_mut(node_id) {
        while let Some((arrival_slot, _)) = queue.front() {
            if *arrival_slot <= current_slot {
                let (_, msg) = queue.pop_front().expect("Queue front must exist");
                messages.push(msg);
            } else {
                break; // Message hasn't "arrived" yet in physical time
            }
        }
    }
    messages
}

The “Line” Experiment (node_0 ↔ node_1 ↔ node_2)

To test the new layer I set up a simple line topology in the simulation:

// In main.rs
let mut simulator = NetworkSimulator::new(hop_latency: 1);
simulator.add_neighbor("node_0", "node_1");
simulator.add_neighbor("node_1", "node_0");
simulator.add_neighbor("node_1", "node_2");
simulator.add_neighbor("node_2", "node_1");

Then I ran the same 20-slot test as before.

Here’s what the logs looked like :

[INFO] ---------------- Slot 1 ----------------
[INFO] [node_0] ⚡ Authored block at height 1 (hash: d30c23..b9a2)

[INFO] ---------------- Slot 2 ----------------
[INFO] [node_1] ⚡ Authored block at height 2 (hash: 6518e1..1d2f)
[INFO] [node_2] ⚡ Authored block at height 1 (hash: 5f672c..c389)

[INFO] ---------------- Slot 4 ----------------
[INFO] [node_0] ⚡ Authored block at height 3 (hash: 9d5256..54ad)
[INFO] [node_2] ⚡ Authored block at height 3 (hash: 4c2d21..2fe7)

... (collisions kept happening)

[INFO] Simulation complete.
[INFO] Node node_0 canonical head: 4638fd..916f (Blocks discovered: 24)
[INFO] Node node_1 canonical head: a4f919..9797 (Blocks discovered: 25)
[INFO] Node node_2 canonical head: b79b53..56ef (Blocks discovered: 25)

The key discovery: At the end of the run, node_0 was still missing one block that node_2 had authored (the one at height 16 with hash b79b53..56ef). Because of the two-hop distance, that block simply hadn’t reached node_0 yet when the simulation ended.

That’s visibility lag in action. In the old global-broadcast version this never happened. Now it’s obvious why forks can survive much longer in real networks.

Why This Changed My Mental Model

Networking isn’t just “plumbing” networking is consensus.

You can have perfect slot leadership and perfect longest-chain rules, but if your visibility lag is bigger than your slot time, you’re basically guaranteed high fork density. The P2P layer decides how fast the network converges, not just the consensus logic.

The Partition Paradox: Inefficiency in Probabilistic Finality

Now that the gossip mesh was working, I moved to the really fun (and painful) part: network partitions. I wanted to literally cut the link between node_1 and node_2 mid-simulation and watch the state roots diverge until the chain breaks.

I severed the connection between Node 1 and Node 2 from slot 5 to slot 15. During this partition, Node 0 and Node 1 formed a local majority partition, while Node 2 operated in isolation. Both partitions continued to produce valid blocks.

The “paradox” manifests upon network healing. When the connection was restored at slot 15, the import_block routine triggered a chain reorganization. Because Node 0 and Node 1 had a higher combined block density, Node 2’s isolated chain segment was entirely discarded.

Here are the formal metrics from the simulation run:

========================================================
  SUBSTRATE CONSENSUS LAB: RESEARCH REPORT 
========================================================
MODEL DEFINITION:
- Slots Simulated:   20
- Validator Nodes:   3
- Model Type:        Probabilistic BABE-lite
- Fork Choice:       Recursive Longest-Chain

QUANTIFIED OBSERVATIONS:
- Total Blocks Authored:   25
- Max Chain Height:        15
- Slot Collisions (Forks): 8

PROTOCOL IMPLICATIONS:
- Chain Inefficiency:      66.67% (wasted work)
- Fork Density:            0.40 forks/slot
- State Divergence:        2 nodes at max height
========================================================

The system produced 25 blocks over 20 slots, but the canonical chain only reached a height of 15. This yields a massive Chain Inefficiency of 66.67%, with wasted work being discarded during the re-orgs caused by visibility lag and the partition heal.

Implications for sc-consensus-babe:
Substrate’s BABE logic behaves similarly but utilizes GRANDPA to finalize blocks. Without GRANDPA, BABE will eagerly re-org deep into history upon reconnecting to a longer, heavier chain. The simulation highlights exactly why a finality gadget is required: to cap the depth of potential re-orgs caused by healed partitions, converting probabilistic assumptions into deterministic settlement.

What’s Next?

Now that the gossip mesh is working, I’m moving to the really fun (and painful) part: network partitions. I want to literally cut the link between node_1 and node_2 mid-simulation and watch the state roots diverge until the chain breaks.

If you’re still just spinning up the node-template and calling it a day, you’re missing where the real drama of protocol engineering happens.

the full code base is open for inspection here: GitHub - Kanasjnr/substrate-consensus-lab · GitHub

Happy forking! :microscope: