Announcing PolkaVM - a new RISC-V based VM for smart contracts (and possibly more!)

We can use exactly the same trick that Sergei came up with and do asynchronous gas metering, and the implementation should be fairly trivial. (Mostly because I knew I’ll be doing it right from the start, so the VM is designed to make it easy.)

So essentially there are two approaches to gas metering:

  • The traditional way: at each basic block consume gas and synchronously check if we’re out of gas; if we are, then trap. This will deterministically always trap at the same place.
  • The asynchronous way that Sergei came up with: at each basic block consume gas, but do not check it right away. Instead check it 1) periodically from a separate auxiliary thread, and 2) check it once the guest program finishes. This removes an extra branch at each basic block, so it’s a lot faster. The major downside here is that it’s not deterministic where/when exactly it will trap. What is deterministic however is whether the whole execution finishes successfully or fails due to running out of gas, so if our executions are atomic anyway (that is, partial executions don’t change the state) then this approach can be used.

Recently I’ve been talking with Gav about some of his ideas, and unfortunately what he wants to do is fundamentally incompatible with asynchronous gas metering due to the exact interruption not being deterministic, so for that we might be forced to use the slower synchronous gas metering.

However, I have also been thinking about whether we couldn’t come up with a third way to do it, and I have a crazy idea I want to try out.

Here are the properties we’d like to have in an ideal gas metering scheme:

  • It’s synchronous and always deterministically traps immediately when we run out of gas.
  • It’s cheap, ideally a single machine instruction.
  • It’s branchless, and the CPU will automatically notify us when we run out of gas.
  • It doesn’t waste general purpose registers.
  • It doesn’t unnecessarily pollute the CPU’s branch predictor, the TLB, the cache, etc.
  • It doesn’t compete for CPU resources (execution ports, etc.) with the guest program.

So here’s the crazy idea: maybe we could do gas metering in floating point?

Unlike the CPU’s integer unit the FPU can be configured to automatically trap on underflow/overflow/etc., on modern superscalar CPUs the FPU runs concurrently with the integer units, and we don’t support floating point in our VM anyway so this part of the CPU is essentially completely unused and wasted. So… why don’t we just use it for gas metering? If it works it could be the holy grail - fast, deterministic and synchronous.

Now, don’t get too excited. I don’t know yet if this is possible. I haven’t seen anyone do this before. (Although I’ve seen a research paper where they did successfully use the FPU to speed up bounds checking of memory accesses.) It’s just a random idea I had that I will have to test out. I’m just putting it out here since you’ve asked. :stuck_out_tongue:

7 Likes

Afaik parachains should not care about asynchronous gas metering not having a deterministic trap. We do have the time overruns billing, but we already envision that permitting non-deterministic traps.

In principle, we could’ve CPU consumption vary oddly dependent upon what tx get included, like block (A,B,C) consumes less CPU than block (A,C). I’d expect parachain developers work around these cases, like randomized algorithms like LP solvers often have witnesses which make everything much faster anyways It need not cause problems I think.

I suppose a smart contract might cares if you’ll execute the remaining tx in the block, but why bother? If one tx runs out of gas, then we could consider the whole block invalid, no?

Now the FPU trap sounds like a cool trick anyways.

You are right that we don’t need this for parachains. However, we will need this for Coreplay. Aka future use cases. We want to interrupt the tasks, snapshot the registers + memory and then we want to continue from this snapshot the next time. This requires that the collator and the validators validating the task stop at exactly the same instruction.

1 Like

Could the block not simply declare when it ends? You’re trying to squeeze in as much as possible I guess, but…

If the FPU trick “mostly” works, then we know even better when the block ends, which improves block utilization. This remains true even if the FPU trick does not work perfectly.

If otoh the block declares its endpoint, then we could use our other determinism tricks plus the FPU trick, even if the FPU trick does not work perfectly. This could become important, like if the FPU trick behaved slightly different on different architectures.

The good thing about RISC-V is its smaller and simpler instruction set comparing with WASM. So when the trust in the sanity of a new implementation is crucial, this could make a big difference.

How should it declare when it ends? For declaring when it ends we need deterministic metering :upside_down_face:

Then it wouldn’t be deterministic and we have failed :eyes:

As long as you control the execution environment (we do) and don’t use instructions which have implementation defined precision (we won’t) the IEEE floating points are perfectly deterministic and portable. Of course, this might not be actually practical to implement on all architectures, but we don’t need to support all architectures with pedal-to-the-metal performance. As long as the semantics of how it’s supposed to work are clearly defined we could always us a fallback approach on architectures where exploiting the FPU to do this is not possible.

And remember that AMD64 covers 99.95% of production nodes today (according to the telemetry), so while we shouldn’t design ourselves into a corner where we can only reasonably support AMD64, performance-wise this is all we care about right now.

Alright this sounds promising.

Internally code would be responsible for its own termination, and handle its state commitments correctly, ala parachains, but then our time overruns design enforces termination within the desired time period. It’s worse than almost free deterministic metering of course.

You want coreplay/corejam to expose partially long running blocks to consumers? Aka not just NPoS.

I’ve glaced only briefly at the coreplay/corejam stuff, assuming its only abstraction layers for internal consumption. If exposed externally then it seemingly encourages non-scalable design patterns, like dumping large amounts of data onto the relay chain. Ideally, we ship XCMP before exposing this, maybe even more general parachain-parachain state channels, because it’d suck if large outputs became a popular design pattern. This is really another conversation.

1 Like

Is an RTOS strictly necessary for handling the tasks of the “blob” ? Wouldn’t it be possible to detach the state machine from OS dependancies ?

Sorry, I don’t understand the question. Can you rephrase and elaborate in more detail?

The state of the program execution inside of the VM is completely OS and hardware independent, and doesn’t need a real-time OS in any way. OS facilities are essentially only needed for sandboxing and for making the generated native machine code executable.

1 Like

I’m not a dev but just have come across RISC Zero project that is basically a zero-knowledge verifiable general computing platform based on zk-STARKs and the RISC-V microarchitecture. So maybe we could add their works into consideration for the desing of PolkaVM. This is the link to github GitHub - risc0/risc0: RISC Zero is a zero-knowledge verifiable general computing platform based on zk-STARKs and the RISC-V microarchitecture.

1 Like

I haven’t posted in a while, but there has been are ton of new changes to PolkaVM, so I thought I’ll update you guys on what’s been going on! Here are some of the highlights.

It can run DOOM!

Yes, we now have a DOOM port which runs fully within the VM at full speed!

(This is an actual screenshot I took of DOOM running under PolkaVM right now.)

You can play it yourself here. It even runs with full audio support, and you can finish the whole game like this!

As a bonus, I also hooked it up to our CI for testing, so now we’re probably the only VM in existence which runs DOOM on its CI. Gotta get rid of all of those demons bugs!

A generic sandbox

Our Linux sandbox is blazingly fast and very secure. Unfortunately it’s not portable, and not everyone uses Linux. So now we also have a generic wasmtime-like sandbox which should be portable across all operating system. It is, of course, not as secure, but at least should run almost at full speed!

This is currently only supported on macOS.

A benchmarking framework

I’ve added a benchmarking framework where we can trivially add new benchmarks and new virtual machines to cross-compare the performance.

We currently have two benchmarks: Pinky, which is my own NES emulator, and prime sieve. I will be adding more (suggestions welcome!).

We have support for benchmarking the following VMs:
- PolkaVM (obviously)
- Wasmtime
- Wasmer
- Wasmi
- Wazero
- Wasm3
- CKB VM (This is another blockchain RISC-V VM!)
- PVF Executor
- Native (runs the benchmark as native code)

Major performance improvements

I’ve also done a ton of performance improvements. I’ll let the benchmark results speak for themselves here (sorted from the fastest to the slowest; this is the Pinky benchmark that you can find in the repository, but the other benchmark gives similar results):

  • Native: 3.774ms
  • PolkaVM: 6.033ms
  • Wasmtime: 6.054ms
  • Wasmer (singlepass): 14.307ms
  • Wazero: 21.621ms
  • Wasm3: 109ms
  • Wasmi (register): 114ms
  • CKB VM: 140ms
  • Wasmi (stack): 212ms

Yes, we are as fast as wasmtime now, while still being being orders of magnitude simpler, more sandboxed, faster to compile, and with guaranteed linear time compilation. And there are still plenty of avenues left to optimize, so I believe I might be able to improve upon this even further.

Gas metering

Gas metering was implemented and is working! Here are some performance numbers:

  • PolkaVM (no gas): 6033us
  • PolkaVM (async gas): 6350us
  • PolkaVM (sync gas): 7276us
  • Wasmtime (no gas): 6054us
  • Wasmtime (sync gas): 8624us
  • Wasmtime (epoch interruption): 7633us

The asynchronous gas metering is almost free and only slows down this benchmark by ~5%, while synchronous gas metering slows it down by ~20%.

For comparison, for Wasmtime its synchronous gas metering slows the benchmark down by ~42% (so twice as much as our implementation!), and even its epoch interruption based metering (which is closer to our asynchronous gas metering in how it works, although not exactly) slows the program down by ~26%, more than even our fully synchronous metering!

You might ask here - what’s the difference between synchronous and asynchronous gas metering? Well, there’s essentially only a single difference: when exactly the gas is checked. Let me explain.

Asynchronous gas metering will only periodically and asynchronously check whether we still have gas remaining while the program is running, so the program can run slightly longer than it would otherwise, and the exact point when it is interrupted is not deterministic, but whether the computation as a whole finishes under a given gas limit will still be strictly enforced and deterministic. On the other hand the synchronous gas metering will always check the remaining gas immediately, and interrupt the execution right away.

So if what you’re doing is transactional and doesn’t have any effect unless it finishes whole anyway - that’s great, now you can use the faster asynchronous metering! And in fact most use cases (like smart contracts!) don’t actually need full synchronous gas metering and can get away with the cheaper asynchronous one.

Thanks for coming to my TED talk

And that’s all of the exciting highlights!

So what happens now? Well, considering we’re now as fast as wasmtime (of course remember that this is not going to be true in every case, depending on the exact benchmark and the hardware, at least not yet) we’re now considering moving everything to PolkaVM, including the runtimes and the PVFs!

Although this is of course not going to happen tomorrow nor next month, and it still requires a ton of work, we now do have a lot of evidence that this can be done with little to no compromises. But more on that in a future Tales from the PolkaVM post! Stay tuned, same time, same channel!

20 Likes

Those are incredible results. Like once in a generation brake through. I would love to see the benchmarking framework extended with compilation time (I assume we eliminate startup overhead with caching).

For contracts the incredible runtime performance gains are most useful if we don’t pay for them with more compilation time over a rewriting interpreter (like wasmi). I know PolkaVM is not yet optimized for it but I saw some numbers that we are already much faster in compilation than wasmer. I think this would be the next step to truly proof that PolkaVM can satisfy all our use cases.

Afaik there is no reason we can’t be as fast in compilation as a Wasm rewriting interpreter (correct me if I am wrong). After all we have to do less work. It is just a mapping. Wasm interpreters do have to deal with blocks and other annoyances. So I am very optimistic.

1 Like

Could we add some optimized cryptographic code to the benchmarks?

A variable base multi-scalar multiplication from curve25519-dalek maybe? I guess cargo bench for curve25519-dalek could just be run in the VMs?

An arkworks curve sounds relevant of course, but dalek is single threaded on all platforms, so it’s easier to get an honest benchmark, and tells basically the same story

Impressive results, @koute !

In all honesty some months ago I did not expect that PolkaVM could increase its performance as significantly as it just did. The concept behind the JIT with its 2 phased compilation process for off-chain purposes and to mitigate problems with pure RISC-V binary format is solid and fits the intended use case for CorePlay and also for smart contracts once startup times are optimized. I agree with @Alex that there should be no inherent problem to fixing compile times and hopefully also with respect to startup times (spinning up cached processes).

On my wishlist for the benchmarks:

  • The Wasm wizard research engine which outshines its competitors with respect to startup times since it is a non-rewriting interpreter written partially in Assembly to be as fast as possible. It is unrealistic that PolkaVM or any other rewriting system (such as wasmi) will compete with its startup time but I think it is interesting to have it in the set of benchmarks to showcase the technical limits of Wasm itself.
  • As @Alex said startup (compilation + spinning up) benchmarks are especially needed for smart contract usage.
  • Realistic smart contract usage benchmarks. From what I know @Alex is already working on this.

Despite the enormous PolkaVM wins, for me personally it is a bit sad that wasmi is going to be discarded soon given that the new register-machine engine got so close to earning the title of the fastest spec-compliant Wasm interpreter out there. (Which currently is Wasm3.) However, I support to unify efforts on a single VM which should be PolkaVM in Parity’s case.

2 Likes

Could you explain a bit? Would love to hear your thoughts on this :open_mouth:

Sure. But FYI, for now that it’s probably going to be slower, as we only support 32-bit arithmetic right now.

Alex already knows this but to everyone else reading this thread, yes, this is planned and will be added shortly. (:

Well, it depends on how exactly you define soon, because, point one, we’re not switching overnight, and point two, unless/until we have a migration plan figured out we still need to support what’s out there right now in a backwards compatible way.

So maybe less of “discard” and more of a “we want to finally finish it”. (:


Anyway, if anyone wants to add a new VM to the benchmarks or a new benchmark I invite you to make a PR as I’ve made it pretty easy. (:

How to add a new VM

First, here is the directory with the VM backends. So just add a new module here for your new backend (which boils down to just implementing a simple trait), and then as a second step you add it to this macro invocation and into this function and you’re done, the new VM is hooked into the bechmarks.

How to add a new benchmark

As far as adding new benchmarks go, it’s also very easy. Go here and make a new bench-yourbenchmark directory (probably by just copy-pasting an existing one). The way benchmarks are defined is standardized, and is really simple; here’s a minimal benchmark:

#![no_std]
#![no_main]

include!("../../bench-common.rs");

// This is how much memory the benchmark wants to be able to allocate.
// Adjust this according to what your benchmark needs.
global_allocator!(256 * 1024);

#[polkavm_derive::polkavm_export]
#[no_mangle]
pub extern "C" fn initialize() {
    // ...put initialization here, if necessary...
}

#[polkavm_derive::polkavm_export]
#[no_mangle]
pub extern "C" fn run() {
    // ...run the benchmark...
}

Then just add it to the workspace in Cargo.toml and to the build script in build-examples.sh, and you’re done. (If built the benchmark will be automatically detected by the benchmarking tool, so nothing else needs to be done.)

I am saying this because people are doctoring around Wasm Singlepass compilers ever since it existed. A whole generation of smart contract blockchains was betting on a performance uplift coming from Wasm over EVM. It never happened. Now it seems that it can finally be delivered.

4 Likes

Wow those are awesome results! been following the whole thread and didn’t expect it could get this good :saluting_face:

I guess this doesn’t include performance in interpreted mode? now that we are talking even about runtimes I was wondering how would the VM behave compiled to WASM running in a browser light node?

1 Like

It will work. However, our interpreter is not currently optimized, so it’d probably be slow-ish. There are ways of speeding it up though, if necessary. Either by making the interpreter faster (and we do have an expert - Robin - in making fast interpreters) or we could just add a simple recompiler which would emit either WASM or asm.js-like JavaScript to accelerate the execution.