Kusama v1.2.0 coretime migration XCM failure: a postmortem

seadanda · April 22, 2024, 4:59pm

Overview

The Kusama Relay Chain upgrade to v1.2.0 (1002000) included a migration to transition to coretime containing four XCM programs which are sent from the relay chain to the Coretime Chain. Each XCM contained Transact instructions to populate the state of the Coretime Chain. The calls in one of the messages failed to execute as the specified maximum weight was less than what was needed to run the extrinsic.

Basically it means that the message announced that it would use less weight than what it would actually use and this made the XCM message fail. This is a security measure to ensure that XCM messages can not stall a chain, but in this case was set too low for it to actually run.

The calls in this XCM were to set_lease (an extrinsic in the broker pallet on the coretime chain). This was intended to migrate the parachains who have an active lease from the relay chain to the Coretime Chain to allow the parachains who won auctions to be awarded coretime until their lease ends. The leases were thus not populated due to the failed XCM. There was no immediate downtime or any other impact for parachains, but the absence of leases in state means that parachains with existing leases would have had to buy coretime on the open market or lose their core next month.

Background

The Kusama Coretime Chain was added in polkadot-fellows/runtimes#212 and the transition to coretime consisted of three steps. These were posted as three referenda with coordinated enactment times:

Referendum 372 registered and onboarded the Coretime Chain as a parachain on Kusama.
Referendum 373 upgraded the Kusama relay chain to v1.2.0 (1002000) on 18th April, and included the migration in question.
Referendum 375 was scheduled to run 6 hours later and set up some final details on the Coretime Chain and started sales.

The start sales call takes the total number of cores, subtracts the reservations (system parachains and any pool cores) and leases (parachains who still have leases from auctions) and offers the remainder in the upcoming sale.

Detection

When the runtime upgrade enacted, it was immediately found by parity engineers that the XCM message to set the leases had been sent succesfully but failed to execute: message 0x03e9fb5a7e35307b75c5eae606dd5838a5f20cad1a2c3c21d6fb5686cd26d82f. However, there was no way to fix this in storage before the sales were started six hours later.

The failed message can be found in state in the DMP queue to the Coretime Chain (1005) at block hash 0x1cf07a92effd82b8f40d3546268aa08f1092ea7f2ea72be89a664144eb0dd40c (block 22,790,001).

This could be reproduced by re-executing the block on coretime that executed the xcm messages. With logs set to trace, it was clear that the hardcoded ref_time was too low.

Root cause

The XCM Transact instruction requires the weight limit to execute the call to be specified in require_weight_at_most by the sending runtime. The call in question refers to one in another runtime, and weights differ between runtimes. Directly importing the weights from the coretime chain runtime would create a dependency on the entire runtime. Additionally, because the communication between the relay and coretime chains is bidirectional, a circular dependency cycle would result. Thus a hardcoded value is used.

The coretime pallet used by the relay chain has a hardcoded value for the four XCMs which allows a maximum weight of Weight(ref_time: 170000000, proof_size: 20000).
This weight must be greater than or equal to the maximum weight of the three calls used by the relay chain: reserve, set_lease and notify_core_count.

It can be calculated from the weights file for the broker pallet in the coretime chain runtime where we have:

Weight::from_parts(10_941_000, 0)
	.saturating_add(Weight::from_parts(0, 1951))
	.saturating_add(T::DbWeight::get().reads(3))
	.saturating_add(T::DbWeight::get().writes(1))

from which we can get the ref_time and proof_size. A read corresponds to Weight::from_parts(25_000_000, 0) and a write corresponds to Weight::from_parts(100_000_000, 0). Therefore:

ref_time = 10941000 + 3 * 25000000 + 1 * 100000000 = 185941000
proof_size = 1951

The actual call to set_lease thus used Weight(ref_time: 185941000, proof_size: 1951), so the hardcoded value for the ref_time was set about 9% too low.

Kusama fix

Since there is a week between running the start_sales extrinsic and the sales actually beginning on the open market (the interlude), we had a short window in which to fix the state to include the information about the parachains, make a release and put it to referendum. The deciding time in a referendum on Kusama is up to 14 days, but can be faster with high enough approval and support.

There were two options to fix this on Kusama since start_sales had already run:

Adding the leases directly to storage on the coretime chain and modifying storage in several other places to account for this
Writing a migration to “replay” the calls inside the failed XCM and restarting the sale

Either option would need to be implemented, tested and put up for a referendum, and voted through all within 7 days. They would both require a referendum on the Whitelisted caller track. Neither option required a runtime upgrade for Kusama.

Option 1 would require modifying multiple storage locations by hand using a combination of kill_storage and set_storage.
Option 2 would deal only with the public API of the broker pallet, was testable before deployment using try-runtime, and additionally allowed the inclusion of the fix to a small edge case bug affecting several parachains with very short leases (ending in the first sales period) which meant that they would not be marked as allowed to renew. Without a runtime upgrade, more storage would need to be edited by hand to allow these parachains to renew.

Therefore it was decided to choose option 2:

Bump the broker pallet to fix a bug where very short leases (ending in the first 28 days after running start_sales) could not be renewed.
Write a migration which does three things:
1. Imports a hardcoded list of leases into the state (generated using a script and cross checked with the failed XCM in the state at block 22,790,001)
2. Adds three cores to ensure there are enough cores for everyone to renew while also leaving the intended 3 for the open market
3. Restarts sales
Add try-runtime checks to ensure that both the migration and the renewals fix work.

The fix was added in polkadot-fellows/runtimes#276, reviewed and then released in the polkadot-fellows/runtimes@v1.2.2 release the next day.
Kusama referendum 380 is up for voting on the whitelist track after being whitelisted by fellowship referendum 72.

Polkadot and future release mitigation

This was an avoidable issue and was due to human error/oversight and a gap in testing/tooling. However, this is particularly sensitive to error due to the hardcoded weight values required in the Transact instructions. Therefore after fixing the state on Kusama, several mitigations and improvements can be put in place to avoid a similar mistake to be made for Polkadot, or even for divergences between the hardcoded weights in other places in the CoretimeInterface and updated weights when benchmarks are rerun for either runtime.

This particular hardcoded value for set_leases has since been exposed as configurable in the relay chain runtimes, rather than being hardcoded in the pallet itself. This allows for limits to be increased based on benchmarked weights for each individual runtime. This has been implemented in paritytech/polkadot-sdk #4189.

A further improvement to cover all hardcoded weights is to add emulated tests to ensure that these weights are sane, which would protect against increases when benchmarks are rerun for the other runtime and new weights are generated.

The value should be set to the maximum benchmarked weight of the three calls in the given network, with an additional buffer of 5-10% to allow for the expected small variations in benchmarking one runtime before release. This is the way it is calculated in the Coretime Chain on Rococo and Westend, where the calls are used throughout the operation of the chain.

With the hardcoded weights as they are, the runtime upgrades of the relay chain and coretime chain are essentially coupled if the benchmarked weights have changed more than the buffer percentage.

A potential long term fix for this could be to rework the Transact instruction to decode the call at the point of execution and get the weight of that call automatically. This would remove the need for hardcoded weights in Transact instructions altogether.

seadanda · April 23, 2024, 8:40am

The runtime upgrade and migration completed successfully this morning and both issues are confirmed as fixed.
The events can be seen here.

All leases have migrated and no action is required from parachain teams.

Any lease which was due to end in the next two periods (56 days) can now be found in state in broker.AllowedRenewals, and can be renewed starting from (Kusama Relay) block 23,264,640.

The sale start has moved to (Kusama Coretime) block number 137,346 a week from today, where 3 bulk cores can be bought on the open market.

WilliamFreude · April 26, 2024, 7:02am

Thanks for the detailed write up. Are there any plans on creating a Chopsticks CI for future runtime upgrades to catch these bugs as part of the release process?

seadanda · May 2, 2024, 3:55pm

I’m not sure about (short term) fellowship plans for running Chopsticks in CI, but manual Chopsticks tests for sure. Potentially also expanding opengov-cli to make running Chopsticks tests part of the flow of creating a referendum. Improving tooling and test coverage for this class of issue (XCM weight issues) is definitely required.

Topic		Replies	Views
XCM Upgrade To v3 Ecosystem xcm	0	582	November 17, 2023
Kusama Parachain Lease Unlock Tech Talk	2	81	December 12, 2024
Multichain XCM integration test Tech Talk xcm	11	856	March 29, 2023
Polkadot Digest 29 Sep 2022 Digest	0	274	September 29, 2022
Improve XCM development and release process Tech Talk xcm	5	623	March 2, 2024