Deterministic PVF executor

It’s been a while since I posted about the project I’ve been into for the last several months, so now it’s time to shed some light on what it is all about and where it goes.

TL;DR

I’ve created a Wasm executor from scratch. It is not a general-purpose executor; its only function is executing PVFs. Its architecture allows for full determinism across all the 64-bit platforms (although only x64 is supported right now) at the cost of some performance hit. It is still in the PoC stage but is already working and validating real parachain candidates in my test environment right now. The source code is here.

Why one more Wasm executor?

Arguments on PVF execution determinism have a long story. Discussions on whether it’s possible to achieve that determinism while keeping Polkadot spec open are still ongoing. My position didn’t change: it’s not possible to achieve the execution determinism controlling neither VM spec nor VM implementation. As Wasm spec is not controlled by us and doesn’t provide any means to achieve determinism, it’s not achievable. That easy.

Still, obliging all Polkadot implementors only to use a single reference implementation would be wrong. So, after some discussions, we see two possible ways of using it:

  • To include the executor spec into the Polcadot spec. That is, declare that the Polkadot PVF host requires a Wasm executor bound by stricter rules than ones defined in Wasm spec, and state those rules clearly. Thus, implementors can use the reference implementation or create their own VMs that must comply with our subset of the Wasm spec. Considering the full determinism that spec brings, it’s easy to create a conformance testing suite to help implementors. It’s not an easy way, but someday, we’ll see Wasmtime and Wasmer implementing Polkadot-complaint execution mode.

  • Proposed by @eskimor: Leave everything as it is, but when a candidate validation fails, rerun a PVF using the reference implementation to ensure the failure is deterministic. Thus, implementors are still free to choose the primary VM implementation and only use the reference implementation in controversial cases.

How bad is the performance of this implementation?

Not great, not awful. Expectedly, the compilation is 10+ times faster than Wasmtime, but when it comes to execution, there’s currently nothing to brag about:

Still, it’s in the ballpark of acceptable values, and it must be considered that the current PoC code is only performing the very basic set of optimizations. I believe it’s totally possible to improve the picture significantly. It wouldn’t perform better than optimizing compilers, or as well as optimizing compilers, but it could perform much better than now.

Some tech background

It’s a two-pass compiler (actually three-pass right now, but that will be fixed) focusing on determinism much more than performance. There is no register allocation, LICM, or constant loading, and all the functions generated are ABI-complaint, so no trampolines are needed. No machine-code-level optimizations at all. It’s as simple as it could be.

Some more details for those who's interested

The first pass translates from Wasm to IR (I call it “poor man’s IR”). The purpose of IR is to uncover implicit stack operations performed by Wasm. The IR VM has two integer registers, two floating-point registers, and a stack; everything is 64-bit wide. (Sidenote: you’ll see three integer registers in the code. It’s redundant and will be fixed).

So, from a simple Wasm code (i32.add (i32.const 1) (i32.const 2)) roughly the following IR will be generated:

move SRA, 1
push SRA
move SRA, 2
push SRA
pop SRA
pop SRB
add SRA, SRB
push SRA

At this point, an additional optimization pass is performed, which is explicit right now but will be integrated into the first pass in the future. It’s easy to see that push SRA / pop SRA pair is redundant. The optimization pass eliminates it and also changes push SRA / pop SRB to move SRB, SRA.

The second pass is code generation, where IR instructions are translated directly to machine code instructions, and a good part of them is translated 1:1. The IR registers are just mapped to CPU registers. That is, push SRA is translated to push rax on x64. Thus, the machine code makes heavy use of the machine stack. That hits performance significantly but opens the way to full determinism, as the machine stack depth is always known at every execution point and is the same on every platform.

Quick Q&A

Q: Was that fun?
A: Absolutely!

Q: What about gas metering?
A: This implementation can adopt gas metering at a minimal cost, as the execution is deterministic and basic block weights are known at compile time.

Q: Does it feature a logical stack metering instrumentation we use now with Wasmtime?
A: It doesn’t need one. In this implementation, the value stack depth is the native stack depth. The only limit needed is the native stack limit, which the OS easily enforces.

Q: Why a separate executor instead of integrating this implementation into the Substrate executor?
A: The Substrate executor tends to be a general-purpose Wasm executor for blockchains. As such, it brings in some noticeable overhead. On the other hand, the PVF execution logic is much simpler than it is for relay chain runtimes. A narrow-focused tool has more leverage than a general-purpose one; it can be easily adjusted to perform its single task better.

Q: What’s next?
A: The plan is to continue working on bugfixes, stability, security, Polkadot integration, and so on in the hope of bringing our deterministic future closer while discussing the future destiny of this implementation.

Q: What if it’s never adopted?
A: I won’t cry. Well, maybe just a little bit. It was a super interesting experience to make an ISA-to-machine-code thing, something I hadn’t been doing for the last 15 years, and I learned so much about Wasm and Polkadot internals in this way that I feel it’s already paid off.

10 Likes

I might have said this at some point, but for the record I don’t think this is a sensible idea.

1 Like

Indeed, neither do I, that was just one of the possible use cases we discussed in Rome, not all of them were sensible :grin: But I thought it’s worth mentioning, it can lead someone to a better idea.

1 Like

Nice! 10x faster compilation speed might become quite an essential feature for on-demand/instantaneous!

1 Like

This is completely reasonable and fine to do!

The point is that we need a working logical/emulated stack implementation. The problem with the current stack implementation is that it depends on the native target. This means depending on the machine/hardware, the stack layout will look differently. This is problematic and doesn’t make it deterministic. I mean we can probably use some very small stack size, for which we are sure that in the end we will never overrun this at any machine/hardware on the native side.

Not sure I follow, but I think we’ll need to discuss that in detail anyway. I’ll start preparing the documentation and architecture description soon; the code is nearly undocumented right now.

In the current implementation, I have a mixed value/call stack. Its depth is deterministic with respect to source Wasm code. You can establish a limit, and it will either overflow or not at the same execution point on every host.

When it comes to other architectures, like aarch64, due to the ABI difference, upholding that determinism is not an obvious but totally solvable task. I already have some thoughts on achieving that, and I modeled it “on paper” but not in real code yet. The point is when you’re handling stack alignment and stack parameter passing, you have to obey not only the concrete ABI rules but an intersection of all the supported ABIs. In that case, the deterministic overflow still holds. I’m not sure this explanation is clear; I think I need to create a separate write-up on that. It’s hard to explain without some drawings and pseudocode.

Besides, it’s totally possible to have separate value and call stack. That would give us a little more control over the execution, and we could handle value stack and call stack overflows separately. I haven’t implemented it because it comes with a performance hit (like all the good things in life), and I’m not sure we even need so much leverage here, but I can implement it as a feature, for example.

Interesting! I have a few questions/comments.

  1. So do you propose us getting rid of wasmtime completely for PVFs and always use a custom VM?
  2. Is your prototype sandboxed in any way? Do you have bounds checking?
  3. Are we fine with the performance hit that this will bring? How much of a performance hit would be acceptable? You could definitely get it go run faster, but getting it to run as fast as wasmtime most likely won’t be possible (if you don’t want to significantly compromise on the compilation speed and complexity fronts; you are limited by the inherent nature of WASM being a stack machine); the limit’s probably wasmer singlepass, and that alone will be tricky to achieve. (e.g. wasmer singlepass’s codegen backend alone is significantly larger than the whole of my PolkaVM)
  4. If you’re going to rewrite this in a production-ready fashion it probably doesn’t make sense to start completely from scratch. You could reuse a lot of what I’ve done for PolkaVM - e.g. my assembler crate is very nice and is essentially production ready right now, and the sandboxing is mostly too.
  5. You shouldn’t underestimate the effort required to write a new production ready VM. (: I wrote my proof-of-concept of PolkaVM which could run real-world programs in two days, but getting it into a production-ready state is taking significantly longer.
  6. One idea I was thinking about (but haven’t experimented at all with) was maybe try to “recompile” WASM programs into PolkaVM programs and run them under PolkaVM, as a way of being able to transparently migrate everything we currently have on chain. I don’t know how feasible that would be, but if we want to have a custom WASM VM anyway then maybe such an experiment could be bumped up in priority? We could then join forces and use PolkaVM as an execution engine and platform abstraction layer. (Assuming such a thing would be practical, which I’m not sure about at this point.)
  7. Any stack depth limit we have should be completely independent of the underlying CPU architecture. So I think the way to handle it is to 1) have a logical stack depth limit that we know can use at most X bytes, and 2) always have the native stack size be at least X bytes big so that we know we will never hit it.
  8. Couldn’t we limit the logical stack depth of the PVFs by just instrumenting them, and run them on wasmtime as always? We could inject the relevant stack metering code into the WASM blob, and get the code to deterministically trap if it’s exceeded. Have we tried this already?
1 Like

I also tried to grab your PoC code and run my “standard” benchmark with it to evaluate performance compared to everything else (as I’m quite curious how fast a naive WASM VM would be), but alas it looks like you’re still missing some opcodes. (:

thread 'main' panicked at 'not yet implemented: opcode F64Load { memarg: MemArg { align: 3, max_align: 3, offset: 96, memory: 0 } }', pvf-executor/src/raw.rs:643:36

Ideally, yes, but some mixed usage is also considered.

I’m yet to add red zones for linear memory and stack. Otherwise, we’re considering sandboxing as an external task wrt executor (see the corresponding project board). For sure, some sandboxing may be integrated into this implementation as well.

I remember Jeff saying that, ideally, we’d like to have a 0.5-sec backing timeout, and right now, we live with 2 sec. So, the performance fits well into the current one but is right on the border of the ideal one. Further performance improvement would fit it into the ideal timeout, although I agree that’s not an easy task. I played around with Wasmer single-pass, it looks good, but it moves the other direction, that is, they’re solving an issue completely different from ours, with a focus on performance much more than on determinism.

Thanks, I’ll have a look! I thought about using iced-x86 for code generation but ended up just emitting raw bytes (and I’m yet to convince myself that there’s something wrong with that approach; I love its simplicity). But some parts of code, sophisticated enough, like function call sequence, would definitely benefit from using an assembler layer.

Oh yes, I understand I’m not releasing it in a month. I though about step-by-step integration, first gated by an experimental feature, like we did with rocksdb/paritydb.

That’s interesting and should be evaluated. One concern is that the compiler performing conversion between stack-machine and register-machine architectures is the exact point where the determinism is lost, that’s why I decided to implement it as an almost pure stack machine in the first place. But I’m not saying it’s impossible, it requires deeper research.

What you’re describing is the exact approach we’re using right now! Our stack limiter is used for PVFs. But it only measures the value stack (which, in the native code execution stage, is a pure abstraction in the case of Wasmtime), and the machine stack is limited by the execution worker. Because of the non-deterministic nature of Wasmtime, the call stack limitation cannot be deterministic (it will overflow in different execution points on different versions of Wasmtime and different platforms).

What I tried to implement is to get rid of the necessity to have a logical limit at all. In my case, the native stack is a mixed value/call stack, and the value stack is not abstracted, those values really reside on the machine stack. Thus, its overflow is always deterministic. The only limitation imposed is the platform bitness. It only works on 64-bit platforms, but within that bitness, it should be fully deterministic.

Yes, it doesn’t support all the opcodes yet. It has some historical background :slight_smile: At the beginning of the development, I (as well as other teammates) was sure that using floating point is forbidden in parachain runtimes, as it is in relay chain runtimes, so the idea was to only implement a subset of Wasm that is needed for parachains. But when I started experimenting with real parachain runtimes, I found out that’s not the case, and parachains are still using a small subset of FP operations brought in by their library dependencies (you may probably remember my forum post on that). After that finding, I had to implement FP operations but only did that for that small subset required by parachain runtimes. I will support all of them, of course, I just don’t need them right now for experiments and testing. (NB: if/else construction is not supported yet either. It’s never generated by LLVM compilers, they always do conditional branching instead, so I postponed it as well).

If the plan is to completely replace wasmtime for PVFs then I think it’d make sense for you to e.g. just mostly copy-paste what I did for PolkaVM, because overall that’ll reduce the complexity of the whole system.

I think it just fundamentally makes sense to tightly integrate the sandboxing with the VM itself.

Yeah, indeed. That’s the tricky part with WASM - it’s hard to get determinism, good performance and fast compilation all at the same time. This is precisely one of the reasons why I think migrating to RISC-V for the new CorePlay stuff makes sense, because we can have it all at the same time. (Among other things.)

Well… considering how much of a clusterfuck x86/AMD64’s encoding is I would strongly encourage you to not just emit raw bytes. (: That makes it a lot harder to contribute if you’re not familiar with the encoding, and it’s really easy to make a mistake which might result in nasty security vulnerabilities. You don’t want to waste your time constantly auditing whether your instructions are encoded correctly!

I considered it myself but I didn’t use the iced-x86 crate in PolkaVM for a few reasons:

  1. It’s an extremely heavy dependency. Because it supports every instruction it absolutely tanks your compile times and opening some of the docs.rs pages for it are borderline impossible and hang your web browser.
  2. It has a ton of features I don’t and won’t need, and you won’t need either.
  3. It has lackluster support for labels and RIP-relative code.
  4. It’s AMD64-only. I also want to support Aarch64 in the future, so I want an assembler that will support both.

So what I did is that I made my own light DSL, and I just use iced-x86 to test that my DSL is correct. And if I need to emit a new instruction it’s very easy to add it in.

You could still make it a stack machine when targeting PolkaVM! (:

One possibility here would be that you wouldn’t emit raw AMD64 machine code, but you’d emit PolkaVM machine code (which is almost the same as normal RISC-V machine code semantically), and PolkaVM would do the rest of the work translating it further, taking care of the low level architecture-specific details, the sandboxing, etc.

If we’re going to use PolkaVM for CorePlay and make it an integral part of Polkadot’s tech stack I think it could make sense to “retarget” what we currently have on top of it, if that’s possible. That would significantly reduce the amount of work you’d have to do and would help with long-term maintainability.

Either way, I think we should run such an experiment before committing to one way or another.

Indeed, that is true! Currently we work around this problem by just making sure that effectively there’s always enough native stack available so that it will always hit some other limit of WASM/wasmtime and never be able to hit it, right?

We don’t care and will never support 32-bit platforms, so this is not a problem.

AFAIK yes, but it’s emitted by wasm-opt, so it’s necessary if you want to run those binaries.

(And if you want to compare performance you do also want to check if running wasm-opt on the blobs makes a difference, and if so then how much exactly.)

Ping me when you do! Or I could provide you with my test .wasm blob so that you could implement whatever extra instructions it needs (it most likely won’t be much, as you have most of the stuff implemented already).

Yeah that sounds very reasonable and tempting. The problem is I need a really precise control over the stack layout to preserve determinism. Not sure if it’s possible (but not saying it’s impossible) with PolkaVM. I’ll describe the problem in a separate writeup so we can figure that out together.

In principle, yes, that’s right. In reality, the limit is still not perfect. A malicious actor can still create code that would overflow the machine stack without going over the logical bound. That’s one more problem with optimizing compilers doing register allocation.

Yes, it’s possible. Under PolkaVM you have access to a bog standard stack just like on bare metal, but it’s fully encapsulated inside of guest memory and fully deterministic, and stack overflow is always safe and is just a normal trap.

@bkchr @koute I’ve created a small write-up on how I approach the deterministic stack problem in this implementation, please have a look: Achieving cross-platform stack determinism for PVF execution - HackMD

Anyone considered placing register allocator hints into the compiler wasm? You pre-run the register allocation of wasmtime at compile time to add the hints, and then a modified wasmtime would fail if the hints were not present at final artifact build time?

I guess this maybe more complex than some of the other efforts…

The problem is that the register allocation is the exact point where the determinism is lost. While you have all the values on the stack, it’s deterministic. While you have all the values in registers, it’s deterministic. But when you’re out of registers and start producing spillslots on the stack and tossing values in and out, that’s where the determinism dies. Values do not have their determined homes anymore, and those “homes” will differ between executors, their versions, and architectures. That’s why I’ve opted out of doing register allocation at all in the first place.

Still, the approach isn’t totally impossible. If the implementation could guarantee deterministic register allocation, it could result in a significant performance benefit. But that requires much more effort than just a naive stack machine, so I see that as a possible pathway to improve performance in the future. Currently, considering I’m working on that alone, if I dive into that right away, I’ll drown.

I’d meant assuming x86 but I guess M1 differs, so yeah… Alright thanks!