The Double-Edged Sword of Finality: Simulating Runtime Upgrade Failures in Substrate

Runtime upgrade bugs are consensus’s kryptonite. If a state migration bug hits 2/3+ of validators, GRANDPA finality locks in the corruption permanently mathematically irreversible. I ran two 3-validator experiments to prove it: minority corruption gets quarantined, but majority corruption seals the wrong state forever. This is why Substrate requires governance delays, staging testnets, and try-runtime validation before mainnet deployment.


The Story So Far

In Part 5, I proved that state root verification locks down the world state. Nonces prevent replay, deterministic execution prevents non-determinism, and signature checks prevent forgery. Even a malicious block author can’t sneak in false data once you commit to a state root.

Today, I’m tackling Substrate’s forkless runtime upgrade system the ultimate test of these consensus guarantees.

The nightmare scenario: What happens if a state migration contains a bug? Does consensus catch it, or does finality lock it in forever?

Let’s find out.


The Setup – Forkless Upgrades & State Migrations

How Substrate Runtime Upgrades Work

In Substrate, the runtime is stored on-chain as a WASM blob. To upgrade it, governance dispatches an extrinsic that replaces the blob. When the state structure changes between versions, developers write migration scripts that execute inside the State Transition Function (STF) before execution begins in the upgrade block.

Adding the Upgrade Extrinsic

I started by extending the extrinsic enum in types.rs:

#[derive(Debug, Clone, Encode, Decode, TypeInfo, Serialize, Deserialize, PartialEq, Eq)]
pub enum Extrinsic {
    Transfer { from: String, to: String, amount: u64, nonce: u64, fee: u64 },
    SetState { key: Vec<u8>, value: Vec<u8> },
    UpgradeRuntime { version: u32 },  // ← NEW
}

The Migration Script (With Bug Injection)

Inside the STF in runtime.rs, i implemented two execution paths one correct, one deliberately buggy:

Extrinsic::UpgradeRuntime { version } => {
    log::info!("Upgrading runtime to version {}", version);
    
    if !self.inject_migration_bug {
        // ✓ CORRECT: Update version, give balance bonus to all accounts
        self.state.0.insert(b"RUNTIME_VERSION".to_vec(), bincode_serialize(version as u64));
        
        let mut updates = Vec::new();
        for (k, v) in &self.state.0 {
            if k.starts_with(b"balance:") {
                if let Some(balance) = bincode_deserialize(v) {
                    updates.push((k.clone(), balance + 1000));
                }
            }
        }
        for (k, new_balance) in updates {
            self.set_write_balance(&k, new_balance);
        }
    } else {
        // ✗ BUGGY: Wrong version, missing balance updates
        log::warn!("MIGRATION BUG INJECTED! Executing invalid state transition.");
        self.state.0.insert(b"RUNTIME_VERSION".to_vec(), bincode_serialize(9999 as u64));
    }
}

The buggy migration:

  • Sets an invalid version (9999)
  • Skips the balance bonus entirely
  • Produces a divergent state root

The Experiment Loop

My main loop (in main.rs) does this:

  1. Target bug injection on specific validators
  2. Upgrade broadcast at Slot 5
  3. Block validation and rejection tracking
  4. GRANDPA finalization monitoring
for slot in 1..=total_slots {
    // At slot 5, governance proposal executes
    if slot == 5 {
        let ext = Extrinsic::UpgradeRuntime { version: 2 };
        log::warn!("[Slot 5] >>> UpgradeRuntime {{ version: 2 }} broadcast from node_0");
        nodes[0].tx_pool.submit(ext.clone()).ok();
        net.gossip_send("node_0", Message::Extrinsic(ext), slot);
    }

    for (i, node) in nodes.iter_mut().enumerate() {
        let messages = net.poll_ingress(&node.id, slot);
        for msg in messages {
            match msg {
                Message::Block(b) => {
                    if let Some(reorg_depth) = node.import_block(b.clone()) {
                        metrics.record_reorg(reorg_depth);
                    } else {
                        // Block rejected due to state root mismatch
                        metrics.record_invalid_block_rejected(&node.id);
                        if !divergence_recorded {
                            metrics.record_divergence_at(slot);
                            log::warn!(
                                "[Slot {}] STATE DIVERGENCE — migration produced different state roots!",
                                slot
                            );
                        }
                    }
                }
                // ...
            }
        }

        // Detect if a buggy node finalizes corrupted state
        if buggy_node_indices.contains(&i) 
            && node.finalized_height > prev_finalized[i] 
            && node.finalized_height >= 4 {
            log::error!(
                "[Slot {}] *** CORRUPT STATE FINALIZED by {} at height {} — GRANDPA sealed wrong world state! ***",
                slot, node.id, node.finalized_height
            );
        }
    }
}

Network Topology

I used a simple line topology for clarity:

node_0 <-> node_1 <-> node_2

Experiment E – The Minority Bug (1/3 Validators Affected)

Scenario: Inject the migration bug into only node_2 before triggering the upgrade.

What the Logs Showed

[Slot 5] >>> UpgradeRuntime { version: 2 } broadcast from node_0
[node_0] Upgrading runtime to version 2
[node_1] Upgrading runtime to version 2
[node_2] Upgrading runtime to version 2
[node_2] MIGRATION BUG INJECTED! Executing invalid state transition.

[Slot 8] STATE DIVERGENCE CONFIRMED migration produced different state roots!
[node_1] INVALID BLOCK STATE ROOT: d722f6..b67e
[node_2] INVALID BLOCK STATE ROOT: 63a8b1..6bdc

The Quarantine Mechanism

At Slot 5, all nodes execute the upgrade:

  • node_0 & node_1 → correct migration → $R_{\text{honest}}$
  • node_2 → buggy migration → $R_{\text{corrupt}}$

Immediately after:

  • node_1 and node_0 reject node_2’s blocks: INVALID BLOCK STATE ROOT
  • node_2 rejects honest blocks for the same reason

Finality Verdict:

  • node_0 [HONEST]: finalized height 13
  • node_1 [HONEST]: finalized height 11
  • node_2 [BUGGY]: finalized height 0 (isolated it finalizes corruption locally, but honest nodes reject every block it produces)

Why node_2 stays at 0: Node_2’s state root is corrupt. Every block it builds has the wrong root. Node_0 and node_1 refuse to import those blocks. Node_2, in turn, rejects honest blocks because their state roots don’t match. It’s mutual incompatibility the minority validator is mathematically exiled.

Safety Analysis

Metric Value
Honest nodes 2
Supermajority threshold 2 (≥2/3)
Corrupt state finalized? NO
Outcome Network self-heals

Key Insight: When the bug affects < 33% of validators, the Byzantine Fault Tolerance threshold kicks in. Honest nodes automatically quarantine the buggy node and continue consensus on the clean chain.


Experiment F – The Majority Bug (2/3 Validators Affected)

Scenario: Inject the bug into node_0 and node_1. Only node_2 runs the honest migration.

This is the disaster scenario.

What the Logs Showed

[Slot 5] >>> UpgradeRuntime { version: 2 } broadcast from node_0
[node_0] MIGRATION BUG INJECTED! Executing invalid state transition.
[node_1] MIGRATION BUG INJECTED! Executing invalid state transition.
[node_2] Upgrading runtime to version 2

[Slot 7] *** CORRUPT STATE FINALIZED by node_0 at height 4 — GRANDPA sealed wrong world state! ***
[Slot 10] *** CORRUPT STATE FINALIZED by node_1 at height 4 — GRANDPA sealed wrong world state! ***

The Trap Closes

At Slot 5:

  • node_0 & node_1 execute buggy migration → $R_{\text{corrupt}}$
  • node_2 executes correct migration → $R_{\text{honest}}$

They hold 2 out of 3 votes = the supermajority. They aggregate precommits and reach finality on the corrupt state.

Finality Verdict:

  • node_0 [BUGGY]: finalized height 13
  • node_1 [BUGGY]: finalized height 11
  • node_2 [HONEST]: finalized height 0 (isolated)

Safety Analysis

Metric Value
Honest nodes 1
Supermajority threshold 2 (≥2/3)
Corrupt state finalized? YES :warning:
Outcome Majority seals corruption

Key Insight: If a migration bug affects ≥2/3 of validators, GRANDPA works perfectly as designed it just finalizes the wrong state. The immutable wall now protects corrupted data.


The Double-Edged Sword of Finality

Finality is completely indifferent to correctness.

In Part 4, i implemented GRANDPA as the “Immutable Wall” a defense against deep re-orgs when partitions heal.

But this experiment reveals the dark side:

  1. GRANDPA Doesn’t Verify State: It only checks if 2/3 of validators signed. It has no idea if a state transition is correct.
  2. Irreversibility: Once finalized, honest nodes refuse to re-org past that point. The chain cannot self-heal.
  3. Minority Exile: The honest minority gets mathematically quarantined and cannot force a chain reorg.

The Paradox

  • ✓ Finality protects you from attackers forking the chain
  • ✗ Finality also protects bugs from being undone

Why Substrate Has These Safeguards

This simulation shows exactly why the Polkadot SDK employs strict upgrade processes:

1. Governance Delay (Enactment Period)

Runtime upgrades don’t execute instantly. There’s a multi-day delay between a referendum passing and code deployment. This gives:

  • Node operators time to review the proposal
  • Community time to raise concerns
  • A window to coordinate a rollback if needed

2. Staging Testnets (Westend, Rococo)

Migrations must run on staging networks with production state snapshots before proposing to Mainnet (Polkadot/Kusama). This catches state root divergences under realistic conditions.

3. Try-Runtime

The try-runtime tool is Substrate’s primary defense layer:

try-runtime on-runtime-upgrade \
  --runtime ./path/to/runtime.wasm \
  --snapshot ./state_snapshot.bin

It executes the migration off-chain against a real database snapshot and verifies the state root matches expectations before deployment.


The Full Series – Consensus Isn’t Magic

Over six parts, I built a consensus sandbox from scratch. Here’s what the experiments proved:

Part Topic Key Finding
1 Slot Collisions Forks emerge when multiple leaders propose at the same slot
2 Network Latency Forks stay alive when information propagates slowly
3 Partitions Unbounded rollbacks occur when the network fragments
4 GRANDPA Finality Finalized blocks stop re-orgs but only if honest majority exists
5 State Roots & Transactions State verification neutralizes malicious authors
6 Runtime Upgrades Migration bugs trap consensus if majority-infected

Consensus is not magic. It’s a balancing act between:

  • Network topology (who talks to whom)
  • State execution (what gets computed)
  • Finality bounds (what becomes irreversible)

Mess up any one, and you either leak liveness (the chain stalls) or leak safety (wrong state gets locked in).


Full Code on GitHub

All 6 experiments with detailed metrics and logging:

:backhand_index_pointing_right: Kanasjnr/substrate-consensus-lab


What’s Next?

This series was a deep dive into why Substrate is designed the way it is. The experiments revealed:

  • The mathematical necessity of BFT thresholds
  • Why finality is a double-edged sword
  • How state execution and network topology are intertwined

Want me to extend this? Ideas:

  • Simulate validator slashing for equivocation
  • Measure finality latency under different network conditions
  • Test mixed-mode scenarios (some honest, some Byzantine, some lagging)
  • Implement Babe-Grandpa in full (i just simplified both here)

Questions? Corrections? Let’s discuss in the comments.