Recover corrupted Staking ledgers in Polkadot and Kusama

gpestana · August 26, 2024, 3:51pm

TLDR: This post motivates the deployment of a new migration in the fellowship runtimes to fix the corrupted ledgers in Polkadot and Kusama. It also explains why/how the ledgers got corrupted, how they can be restored and the ledger’s final state once the migrations are executed.

For more information about this issue and recovery process, check:

Double bonding corruption explained
Ledger corruption cases and recovery walkthrough
Timelines of corrupted ledgers
PR#3639 Staking ledger bonding fixes
PR#3706 Extrinsic to restore corrupt staking ledgers
Detailed plan to recover ledgers in Polkadot
Detailed plan to recover ledgers in Kusama
Generate and test referenda calls to recover ledgers in Polkadot and Kusama (note: this mechanism is not going to be used, see section Recovering the corrupted ledgers below).

Background

Note: in the context of Polkadot’s staking, a “ledger” is a data structure that keeps track of data and metadata of a staker in the system, such as the amount of stake bonded, associated stash accounts, etc.

Throughout the blocks #19551000… #20181758 in Polkadot and #21570000…#22515962 in Kusama (i.e. through releases v1.1.0 to v1.1.3), the staking logic did not prevent a controller from becoming a stash of another ledger (introduced by removing this check). Given that the remaining of the code expects that never happens, bonding a ledger with a stash that is a controller of another ledger may lead to data inconsistencies and data losses in bonded ledgers. For a more technical detailed explanation of this issue, check this hackmd.

In a nutshell, when fetching a ledger with a given controller, there could be some paths where the wrong ledger was returned, which could lead to unexpected/wrong ledger states.

This PR and the v1.1.3 release upgrade in Polkadot and Kusama, fixed this regression and blocked the corrupted ledgers to avoid further corruption.

Recovering the corrupted ledgers

The extrinsic Staking.restoreLedger has been introduced as a mechanism to automatically i) restore and ii) unlock the ledgers that are corrupted. This extrinsic has been introduced in PR#3706 and it restores the corrupted ledger depending on the current corruption type and path (see this walkthrough to check all the potential corruption cases).

For a detailed explanation and recovery strategies, check the following docs:

The current Staking.restore_ledger is missing an important check that ensures that the the final state of the restored ledger does not have more active stake than the current free balance of a stash (see this patch in the staking pallet). To avoid having to wait for the whole polkadot-sdk and fellowship-runtimes release train for the patch, there are currently 2 options that only require release of the fellowship runtime:

1. Option A: Runtime migration

Deploy one-time migrations in the fellowship runtime which calls into Staking.restore_ledger for the list of corrupted ledgers and performs the additional checks missing in the current pallet-staking. The migration consists of:

For every ledger that needs recovery:
1.1. Calls into Staking.restore_ledger to restore the ledger
1.2. Performs the remaining checks missing in the current pallet-staking

Check the PR against the fellowship runtimes with the migrations for Polkadot and Kusama.

2. Option B: Temporary deployment of fixer pallet in fellowship runtimes

Another option to is to add a temporary pallet to the fellowship runtime that exposes an extrinsic that:

Performs checks of whether the ledger needs to be recovered;
Calls into Staking.restore_ledger to restore the ledger
Performs the remaining checks missing in the current pallet-staking

This temporary extrinsic can be called by any signed origin. The checks in 1. will ensure that only ledgers that are corrupted and whitelisted can be mutated and recovered.

Rewarding the corrupted ledgers retroactively

The ledgers that have been blocked due to corruptionmay not have be able to partake as a staking nominator/validator. In order to compensate the owners of the corrupted ledgers, we propose rewarding the ledger account with funds from the treasury. The calculation and rewarding referenda will be discussed on a separate thread.

Juan_CDe · January 10, 2025, 12:12pm

Unfortunately, one of the affected accounts, 12gmcL9eej9jRBFT26vZLF4b7aAe4P9aEYHGHFzJdmf5arPi, continues to experience balance inconsistencies that originated before the issue above was resolved. Below is a detailed explanation of the balance issue, its current status, and potential solutions:

Origin

Account 13SvkXXNbFJ74pHDrkEnUw6AE8TVkLRRkUm2CMXsQtd4ibwq (shortened as 13Sv) bonded approximately 20k DOT, using 12gmcL9eej9jRBFT26vZLF4b7aAe4P9aEYHGHFzJdmf5arPi (shortened as 12gm) as its controller.
After the runtime upgrade to v1.1.2, this issue allowed 12gm to bond itself ~7k DOT.
Because of the corrupt ledger state of these two accounts, the runtime incorrectly calculated the transferable balance of 13Sv as the free balance of 13Sv (~20k DOT) minus the frozen balance of 12gm (~7k DOT).
Then, 13Sv added ~13k DOT to its bonded balance, which resulted in an increase to the frozen balance of 12gm to ~20k DOT, again due to the corrupted ledger state.

Current Status

Because the free is less than frozen balance, 12gm doesn’t have any transferable balance available to pay for fees in order to unstake and rectify the situation.

Potential Solutions

A) Increase the free balance of 12gm
Either the user or the Treasury deposits ~13k DOT into 12gm, turning the transferable balance positive and restoring account functionality.

B) Adjust the frozen balance
Set the frozen balance of 12gm back to the correct value or 0 DOT (effectively unbonding the funds).

C) Apply the extrinsic forceUnstake
Force the account to unstake, allowing the user to stake again properly in the future.

Recommendation

After consideration, the proposed solution would be to adjust the frozen balance to 0 DOT along with the right number of consumers and providers (Solution B):

It does not require root origin to execute it (unlike option C)
It does not involve the Treasury or create any significant cost for the user (unlike option A).

Ultimately, the decision lies with the community and the Technical Fellowship, who can discuss and determine the most optimal approach for including it in an OpenGov proposal or runtime upgrade, ideally as part of the v1.4.0 upgrade.

bkchr · January 10, 2025, 5:09pm

I would propose to just do this as an extra proposal on the whitelisted caller track. Is not anything controversial and just fixing some old issues.

Topic		Replies	Views
Polkadot Digest 22 Apr 2024 Digest	0	165	April 22, 2024
Transaction repeatedly cancelled Ecosystem	8	300	October 10, 2024
Polkadot Digest 22 June 2023 Digest	0	372	June 22, 2023
Polkadot Digest 10 Aug 2023 Digest	0	293	August 10, 2023
Polkadot Digest 9 Nov 2023 Digest	0	311	November 9, 2023