There are few ways to brick a Substrate chain and panic/overweight in mandatory hook execution is one of them. Therefore, we should avoid code execution in mandatory hooks (on_initialize / on_finalize) as much as possible.
While we want to avoid panic/overwieght for extrinsics, it is less bad as collators can blacklist bad one (after a failed attempt to bundle it). It could be a active DoS vector of the chain, but can be mitigated.
There are mainly three category of sources of non extrinsic triggered executions:
Periodic business logic execution
Incoming XCM execution
Delayed execution (e.g. referenda enactment) via pallet-scheduler
Note that #1 and #2 can be refactored to use pallet-scheduler for execution so we really just need to make #3 to be safe.
Pallet scheduler execute the scheduled calls in on_intitalize hook, which means any panic will brick the chain. There will now way to construct a valid block without trigger the panic path and no way to inject other code execution before the panic path to potentially rescue the bad execution.
This property can be useful for some critical logic, but not strictly required by most of the use cases. For example, the enactment of a referendum can usually be delayed for a few blocks without causing issues. Dispatch of an incoming XCM are expected to be queued and delayed anyway.
For those executions that doesn’t have strict execute at a particular block requirements, we may better to offload them to a different pallet. i.e. the safe-scheduler.
I initially came up the idea of safe scheduler pallet at orml#481. The core idea is that instead of execute all the non-extrinsic triggered logic in on_initalize, we simply put them into a queue, and use offchain worker + unsigned tx to trigger them. This means any panic/overweight will only mark such unsigned tx to be unbundlable. It will not impact block production and therefore reduce the impact of the damage.
This wasn’t a such big concern before as it is relatively easy to proof a runtime cannot panic and the compute time of hooks are not unbounded. However for parachains it is now possible to overweight due to storage access, which can be hard to detect and rescue (it is hard to tell the size of the item without reading it, but after read it, it could already be too late).
In general I like the idea of making the scheduler use unsigned extrinsics to schedule its work. It protects against panics and calls that are using too much weight. The only problem would be that when some call is always panicking we would always try to push the same work and it would always panic. As long as the scheduler requires some privileged origin, I think we can assume that there would not be any kind of dos attack using the scheduler.
You said that you are using an offchain worker to schedule the calls. Maybe we could add a similar function like inherent_extrinsics that the block builder could call to get these kind of transactions from the runtime.
One good thing about unsigned tx is that if the block is full, they can just stay in the tx pool waiting to be included in next block. The usual tx priority & longevity API can be used to manage them as well.
I really like this idea. We have also recently discussed scheduler with @gpestana and a few others and wished that it had more capabilities.
Some thoughts:
I have used the OCW+Unsigned code path in staking and it works well, but it does require a lot of boilerplate. We should think of wrappers around it to make it more programmable.
Alternatively, we have not really used the Task api in substrate, which spawns a new wasm instance and panics are therefore is less of a deal. Imagine that the scheduler pallet, on_inititalize will start executing its scheduled scheduled tasks each in a new wasm instance. If any of them panics, the main runtime will survive and can remove that task. This essentially gives us the property that you want to achieve via unsigned transactions, but entirely in the runtime.
another feature that I’d like to see is something akin to scheduled_on_idle. Imagine you want to schedule a task to happen after block N, if there is space for it. This will be useful for things like automatic staking rewards. It doesn’t need to happen at a certain block, just after a certain block.
If I can give a bit of additional information, in Moonbeam we only allow schedule from democracy. When we schedule heavy work, our script always split to ensure the extrinsic pov and the execution time/pov are always < 25% of a block allowed limit (it creates many batches).
However, we are worried about XCM execution yes, specially because of the EVM access which can’t too easily be controlled and have a weight/pov conversion that is not very precise.
Moving to an idle scheduler can introduce some issues however, this could allow a block producer to control/delay the execution of an XCM (front-running, censoring,…).
We thought about making XCM (specially the XCM->EVM) getting queued and executed through extrinsic, but that brings the same issues.