It’s been a while since I posted about the project I’ve been into for the last several months, so now it’s time to shed some light on what it is all about and where it goes.
TL;DR
I’ve created a Wasm executor from scratch. It is not a general-purpose executor; its only function is executing PVFs. Its architecture allows for full determinism across all the 64-bit platforms (although only x64 is supported right now) at the cost of some performance hit. It is still in the PoC stage but is already working and validating real parachain candidates in my test environment right now. The source code is here.
Why one more Wasm executor?
Arguments on PVF execution determinism have a long story. Discussions on whether it’s possible to achieve that determinism while keeping Polkadot spec open are still ongoing. My position didn’t change: it’s not possible to achieve the execution determinism controlling neither VM spec nor VM implementation. As Wasm spec is not controlled by us and doesn’t provide any means to achieve determinism, it’s not achievable. That easy.
Still, obliging all Polkadot implementors only to use a single reference implementation would be wrong. So, after some discussions, we see two possible ways of using it:
-
To include the executor spec into the Polcadot spec. That is, declare that the Polkadot PVF host requires a Wasm executor bound by stricter rules than ones defined in Wasm spec, and state those rules clearly. Thus, implementors can use the reference implementation or create their own VMs that must comply with our subset of the Wasm spec. Considering the full determinism that spec brings, it’s easy to create a conformance testing suite to help implementors. It’s not an easy way, but someday, we’ll see Wasmtime and Wasmer implementing Polkadot-complaint execution mode.
-
Proposed by @eskimor: Leave everything as it is, but when a candidate validation fails, rerun a PVF using the reference implementation to ensure the failure is deterministic. Thus, implementors are still free to choose the primary VM implementation and only use the reference implementation in controversial cases.
How bad is the performance of this implementation?
Not great, not awful. Expectedly, the compilation is 10+ times faster than Wasmtime, but when it comes to execution, there’s currently nothing to brag about:
Still, it’s in the ballpark of acceptable values, and it must be considered that the current PoC code is only performing the very basic set of optimizations. I believe it’s totally possible to improve the picture significantly. It wouldn’t perform better than optimizing compilers, or as well as optimizing compilers, but it could perform much better than now.
Some tech background
It’s a two-pass compiler (actually three-pass right now, but that will be fixed) focusing on determinism much more than performance. There is no register allocation, LICM, or constant loading, and all the functions generated are ABI-complaint, so no trampolines are needed. No machine-code-level optimizations at all. It’s as simple as it could be.
Some more details for those who's interested
The first pass translates from Wasm to IR (I call it “poor man’s IR”). The purpose of IR is to uncover implicit stack operations performed by Wasm. The IR VM has two integer registers, two floating-point registers, and a stack; everything is 64-bit wide. (Sidenote: you’ll see three integer registers in the code. It’s redundant and will be fixed).
So, from a simple Wasm code (i32.add (i32.const 1) (i32.const 2))
roughly the following IR will be generated:
move SRA, 1
push SRA
move SRA, 2
push SRA
pop SRA
pop SRB
add SRA, SRB
push SRA
At this point, an additional optimization pass is performed, which is explicit right now but will be integrated into the first pass in the future. It’s easy to see that push SRA / pop SRA
pair is redundant. The optimization pass eliminates it and also changes push SRA / pop SRB
to move SRB, SRA
.
The second pass is code generation, where IR instructions are translated directly to machine code instructions, and a good part of them is translated 1:1. The IR registers are just mapped to CPU registers. That is, push SRA
is translated to push rax
on x64. Thus, the machine code makes heavy use of the machine stack. That hits performance significantly but opens the way to full determinism, as the machine stack depth is always known at every execution point and is the same on every platform.
Quick Q&A
Q: Was that fun?
A: Absolutely!
Q: What about gas metering?
A: This implementation can adopt gas metering at a minimal cost, as the execution is deterministic and basic block weights are known at compile time.
Q: Does it feature a logical stack metering instrumentation we use now with Wasmtime?
A: It doesn’t need one. In this implementation, the value stack depth is the native stack depth. The only limit needed is the native stack limit, which the OS easily enforces.
Q: Why a separate executor instead of integrating this implementation into the Substrate executor?
A: The Substrate executor tends to be a general-purpose Wasm executor for blockchains. As such, it brings in some noticeable overhead. On the other hand, the PVF execution logic is much simpler than it is for relay chain runtimes. A narrow-focused tool has more leverage than a general-purpose one; it can be easily adjusted to perform its single task better.
Q: What’s next?
A: The plan is to continue working on bugfixes, stability, security, Polkadot integration, and so on in the hope of bringing our deterministic future closer while discussing the future destiny of this implementation.
Q: What if it’s never adopted?
A: I won’t cry. Well, maybe just a little bit. It was a super interesting experience to make an ISA-to-machine-code thing, something I hadn’t been doing for the last 15 years, and I learned so much about Wasm and Polkadot internals in this way that I feel it’s already paid off.