Announcing PolkaVM - a new RISC-V based VM for smart contracts (and possibly more!)

(This is a continuation my previous forum topic; I’m posting this as a new thread to start with a clean slate, and for extra visibility.)

After many weeks of work I have the pleasure to announce the release of PolkaVM 0.1, a tech preview of our new VM for smart contracts!

Disclaimer: this is not production-ready and still very heavily a work-in-progress! It’s not meant for any end use just yet, but you’re welcome if you want to help contribute to the code or the design, or just simply follow our progress.

So without further ado, here’s the repository if you want to take a look: GitHub - koute/polkavm: A fast and secure RISC-V based virtual machine

(But please consider reading the rest of this post before diving into the code!)

Introduction

So why are we doing this, and what do we hope to gain? I have covered this in the first post of my previous topic, so I won’t repeat myself here. Please read it if you’re new! (Just the very first post.)

So what’s new?

(There are a lot of technical details here so feel free to skip sections in which you’re not interested in. I’m writing this both as a showcase, and as a laundry list of otherwise undocumented technical details which some people might want to know about.)

Better performance

Here are the performance numbers from my initial experiment: (lower frametime/higher FPS is better!)

  • wasmi: 108ms/frame (~9.2 FPS)
  • wasmer singlepass: 10.8ms/frame (~92 FPS)
  • wasmer cranelift: 4.8ms/frame (~208 FPS)
  • wasmtime: 5.3ms/frame (~188 FPS)
  • solana_rbpf (interpreted): 6930ms/frame (~0.14 FPS)
  • solana_rbpf (JIT): ~625ms/frame (~1.6 FPS)
  • My previous experimental RISC-V WM: ~25ms/frame (~40 FPS)

And here is the performance of PolkaVM 0.1 as of today, on the very same test program:

  • PolkaVM 0.1: 10.7ms/frame (~93 FPS)

Compared to my previous experimental VM the performance increased from ~25ms/frame to ~10.7ms/frame, which is as fast at the current gold standard of WASM singlepass VMs (Wasmer)!

Suffice to say this was a complete (but very pleasant) surprise to me, because I haven’t yet even started to work on deliberately optimizing the performance. I’m not entirely sure what exactly made it faster, but the machine code emitted by the VM is now of overall higher quality, so that probably contributed to the speedup.

We’re still nowhere near where wasmtime is, but we’ve closed the gap a little bit. I will be of course making more performance improvements as time goes on, and I already have several potential ideas that I want to try to make it even faster.

Security and sandboxing

Security is probably the most important aspect of the VM to get right. The VM must be absolutely watertight secure. Which is why making it secure was interwoven into its design right from the start.

First, for a little bit of perspective let me roughly describe how wasmtime does sandboxing. Essentially what wasmtime does to make sure the code running in the VM can’t access the host memory and steal your private keys is simple: it just checks that all of the memory accesses made by the guest program are valid.

It does this in a clever way (mostly to make it fast) by preallocating a bunch of empty virtual address space (that is, it reserves space for the memory allocation, but doesn’t actually allocate memory there) and lets the guest program access more than what it should have access to, because wasmtime has a guarantee that if the program does put its hands into a cookie jar it shouldn’t touch then the CPU will automatically catch that access, stop the guest program and let wasmtime know.

In general this works very well. wasmtime is a secure piece of software. But mistakes can happen, and if they do then there’s no backup, there’s no second line of defense. It’s game over.

This is why PolkaVM does sandboxing quite a bit differently:

  • Every guest program always runs in a separate process, and doesn’t have access to the host program at all. It can only communicate with it through a small chunk of shared memory. (With wasmtime both the host and the guest run in the same process.)
  • Every guest program has the whole lower 4GB of address space fully available to it, which simplifies how we can prevent it from accessing memory that it shouldn’t have access to (essentially we can always just use a 32-bit register to access the memory instead of fiddling with offsets and/or bounds checks, and we’re automatically guaranteed that it’s safe). The guest uses the lower 4GB of address space, and the VM uses the upper portion of the address space. We can do this because the VM spawns a specially crafted ELF file as the zygote of the guest process, where we have the full control over what goes where in memory. This also has the benefit of actually not requiring us to reserve a bunch of useless address space (which, while plentiful, does consume some resources and we can run out of it; it’s rare, but I’ve seen in happen on certain platforms, in Substrate!).
  • Every guest program is automatically namespaced and runs in a new container. This is exactly the same tech that projects like Docker use. So even if the attacker got full control of the guest process it won’t be able to access the filesystem, nor the network, nor any other processes.
  • Every guest program is sandboxed with seccomp, giving it access only to a handful of syscalls. Last time I counted Linux on AMD64 exposes around ~360 syscalls (that’s a lot!) while the VM only needs (as of today) exactly 9 of them (0.025% of the total!). This is a huge reduction in attack surface, especially considering that new or weird syscalls do often have security vulnerabilities (e.g. Google disabled the io_uring syscall across their entire fleet because it had more security holes than Swiss cheese).

The major point here is that we don’t just have a single line of defence here, we have multiple. If one of these security measures fails there’s another one to take its place, making any potential exploits significantly more difficult. (I’m not going to say “impossible” because this wasn’t audited yet.)

And all of these security measures are automatically handled by the VM and the host program doesn’t have to do anything special to activate them. (In contrast with our recent heroic efforts to extra sandbox wasmtime for PVFs. With PolkaVM we get this entirely for free since it was designed right from the start to be sandboxed like this.)

(Almost) zero dependencies

In my personal projects I always liked to reduce the number of unnecessary dependencies and in general be mindful of how much I’m pulling in. This is good for compile times, and also good for security (supply chain attacks!). I also applied this philosophy to PolkaVM.

Let me demonstrate this with a hacky script that I recently wrote, which calculates the total number of crates and lines of code for a particular project (you can find it here). It’s not entirely accurate (this is not a scientific comparison!), but it should be good enough.

Here’s how many dependencies/lines of code wasmtime has in total: (the main wasmtime crate with default feature flags, not including dev dependencies)

Total crates: 134
Total lines of code: 1169225

Yes, you’ve read that right; including wasmtime in your project pulls in over 1 million lines of code by default. Now, let’s see PolkaVM for comparison:

Total crates: 6
  Local: 5
  External: 1

Total lines of code: 15222
  Local: 12300
  External: 2922

Only ~15k total lines of code. (The only external dependency is the log crate.) Quite a difference compared to 1+ million, wouldn’t you say?

Now, I don’t mean to dogpile on wasmtime here. It’s an absolute wonderful, high quality piece of software, and this comparison is a little unfair (e.g. a significant chunk of the code that wasmtime pulls in are autogenerated bindings). But the general point still stands.

WASM-like import-export model

Here’s an example guest PolkaVM program:

#[polkavm_derive::polkavm_import]
extern "C" {
    fn get_third_number() -> u32;
}

#[polkavm_derive::polkavm_export]
#[no_mangle]
pub extern "C" fn add_numbers(a: u32, b: u32) -> u32 {
    a + b + unsafe { get_third_number() }
}

The way you access host functions here is similar to what you’d do for WASM - you just write an extern block with the prototypes. In this case though you also have to annotate it with our procedural macro so that our linker can find these imports.

The exports (that is, functions which you can call from outside of the VM) are similar too, and you also need to decorate them with a macro.

On the host side the API is inspired by wasmtime, but it is slightly different. Here’s how the host side for this particular example program looks like:

use polkavm::{Config, Engine, Linker, Module, ProgramBlob, Val};

fn main() {
    env_logger::init();

    let raw_blob = include_bytes!("guest.polkavm");
    let blob = ProgramBlob::parse(&raw_blob[..]).unwrap();

    let config = Config::from_env().unwrap();
    let engine = Engine::new(&config).unwrap();
    let module = Module::from_blob(&engine, &blob).unwrap();
    let mut linker = Linker::new(&engine);

    // Define a host function.
    linker.func_wrap("get_third_number", || -> u32 { 100 }).unwrap();

    // Link the host functions with the module.
    let instance_pre = linker.instantiate_pre(&module).unwrap();

    // Instantiate the module.
    let instance = instance_pre.instantiate().unwrap();

    // Grab the function and call it.
    println!("Calling into the guest program (through typed function):");
    let fn_typed = instance.get_typed_func::<(u32, u32), u32>("add_numbers").unwrap();
    let result = fn_typed.call(&mut (), (1, 10)).unwrap();
    println!("  1 + 10 + 100 = {}", result);

    println!("Calling into the guest program (through untyped function):");
    let fn_untyped = instance.get_func("add_numbers").unwrap();
    let result = fn_untyped.call(&mut (), &[Val::I32(1), Val::I32(10)]).unwrap();
    println!("  1 + 10 + 100 = {}", result.unwrap());
}

The compilation pipeline

The general architecture of how the programs are compiled for the VM was also defacto finalized. It’s essentially composed of two steps:

  1. A guest program has to be compiled with a standard compiler (e.g. rustc) and linked into a standard ELF file.
  2. Then that ELF file is relinked by an auxiliary middle-end linker supplied by us. (cargo install polkatool, and then call polkatool link) This runs entirely offline, and will further optimize the program, strip it down, repack it, and generate a .polkavm file blob which can be put on-chain.

Two main benefits of this approach is that we can just use an off-the-shelf compiler to write our programs, while simultaneously being able to explicitly control and optimize what will ultimately be uploaded on-chain, and significantly reduce complexity. (ELF is great, but it’s very flexible, complex, and on a smart-contracts scale also somewhat bloated.)

The program blob file format

The file format in which we store program blobs is custom, and was very heavily inspired by what WASM has. It is explicitly designed so that it is simple, easy to parse, fast to process (zero-copy parsing!) and extensible in a forward and backwards-compatible fashion.

It is essentially composed of a header, plus a list of sections, each with an ID and a length. The sections are guaranteed to always be in a certain order (like in WASM), and the sections can be easily skipped over thanks to their lengths. Sections which don’t affect operational semantics of the VM (e.g. debug info) are also clearly marked, and can be added/ignored/stripped without breaking compatibility.

The instruction encoding

The way we encode instructions is also custom. It is still very much RISC-V, but modified to better fit a virtual machine. The immediates are not limited to 12/20 bits anymore and instructions support full 32-bit immediates (because the encoding is now variable length), which means that some RISC-V instructions which previously had to be essentially split into two (e.g. loads/stores/jumps to addresses which don’t fit in a single instruction) are now encoded using a single instruction. It’s not only simpler, but also results in more efficient machine code.

The instruction set itself is also more stripped down and simpler. A few instructions were removed (there’s no more AUIPC, there’s only a single jump instruction) and the ecall instruction was replaced with an ecalli (note the extra “i” at the end) instruction. (This basically means that the hostcall numbers are encoded as an immediate, instead of fetched dynamically. This makes it easy to statically analyze/validate programs regarding their usage of hostcalls, and also avoids wasting a register at runtime when actually making a hostcall, which is important since we don’t have many registers to play with.)

The machine architecture

Unlike normal RISC-V the VM is a Harvard architecture machine (like WASM!) which means that the data and the code live in separate address spaces, or more specifically, the code cannot be read by the guest program at all. This is good for security, and also good for performance. (Because we don’t have to allocate the memory for the code and make it available to the program.)

Debug info support

I wasn’t originally planning on adding this right now, but I ended up doing so anyway as I needed the VM to be more debuggable to be able to fix some of the last remaining bugs. As of today the VM’s linker processes and outputs debug info for guest programs, including function names, source code locations, and inlined functions. However this isn’t yet used by anything except the VM’s execution tracing support.

Tracing support

The VM currently has a crude support for tracing guest program execution, where it’ll print out the currently executed instruction, as well as values of the registers, accessed memory, and the original source code location + line of where the instruction comes from. It can be enabled by setting the POLKAVM_TRACE_EXECUTION environment variable to 1 and enabling trace logs.

Guest program memory map

The guest allocates memory in pages. A single page is always 16KB. This is less than WASM’s 64KB to make tiny programs more efficient, and also larger than the conventional 4K (which normal AMD64 CPUs use) so that we can easily support M1 CPUs in the future (which use 16K pages natively).

Guest programs’ memory map is split into four parts, in order:

  • R/O data (always starts at 0x10000)
  • R/W data (always starts after R/O data, although I will be adding a gap here to make the interpreter easier to implement)
  • BSS (basically uninitialized memory of the program, always starts after R/W data; in the future this will be growable)
  • Stack (always starts at the top of the address space at 0xffffc000)

An interpreter

The VM itself has two backends - a native AMD64 backend which generates machine code, and an interpreter. If the host system doesnt’t support the native compiled backend then the interpreter will be automatically used.

Caveats

It’s not all sunshine and roses. Currently the VM still has some limitations.

  • It works, but there are almost no tests, and there might be bugs lurking. I will be adding a lot more tests in the future.
  • It works on my machine, but it might not work on yours. In particular, it might break on older versions of Linux where some of the APIs it uses for sandboxing might not be supported. I will potentially add fallback codepaths to the VM in the future to handle this.
  • On systems other than Linux running on an Intel or AMD CPU the VM will run in interpreted mode. This will be slow. I have plans to at least support aarch64 (arm64) in the medium-term, and possibly macOS and Windows in the long-term. (The problem with other OSes is that unlike Linux they either lack the features or their kernel interfaces are not stable, so replicating our sandboxing setup on them is a challenge without e.g. just running a full hardware level virtual machine and just run Linux in it anyway.)
  • It will only run guest programs compiled with an RV32E toolchain. This is not supported by Rust nor LLVM just yet (although there is a work-in-progress patch to the LLVM to add it). In the short term I have pushed a bunch of scripts to GitHub to make it possible to automatically build such a toolchain without having to patch things yourself nor figure out Rust’s build system, and I will be providing prebuilt binaries in the future. In the long-term once the LLVM patch is merged we can make sure that the target is supported by Rust out-of-box. (I suppose we could also try to fund some of the RISC-V people to get the patch pushed through faster.)
  • It currently requires that the guest programs are compiled with full debug info. This is an artificial limitation and will be removed in the future.
  • It currently requires that the guest programs are compiled with a very specific linker script. In the future I will reduce the scope of what this linker script affects, and maybe even remove it completely.
  • It currently requires that the guest programs are compiled with specific linker flags. In the future once the LLVM patch is merged I’d like to maybe try to add a PolkaVM-specific target to Rust that’d have those applied by default. (Although this isn’t really too big of a deal.)
  • The VM FFI only supports passing of at most 6 arguments (or half of that if they’re all 64-bit) through its FFI boundary, because that’s how many argument registers we have. If necessary this limitation could be removed by supporting passing of extra arguments on the stack.
  • Spawning of new module instances is slow, because right now I’m always spawning a new process when a module is instantiated. This will be very easy to fix by caching the VM workers; nevertheless I wanted to make note of it here so that no one experimenting with the VM is surprised.
  • The instruction encoding is not yet fully optimized. In particular, immediates (which are serialized as varints) are always serialized as unsigned numbers, which makes e.g. small negative numbers always consume 5 bytes of space. I’ll be changing those to use a zigzag encoding to rectify this in the future.
  • The VM crate contains the zygote binary, which is a prebuilt Linux binary that’s included in the crate as-is and executed at runtime. The binary needs nightly Rust to build (although this could be worked around with some extra pain) and ideally should be always manually reinspected when updated to make sure the compiler did generate it in a way we expect to look like. However, considering the recent serde fiasco I know this is something that will be frowned upon, which is why I will be making this binary fully reproducible and verified on the CI.
  • The behavior of the compiled backend and the interpreter are not yet exactly the same.
  • The division remainder instructions are not yet implemented in the compiled backend and will trap instead. (These are very rarely used though.)
  • There’s no way to customize the stack size yet without hex editing the program blob.

Future work

There’s still plenty of work to be done! You can see a full list of what I’m planning to work on if you go to the issues section in my repository (there are over 40 of those right now!). I’ve also added a “PolkaVM 1.0” milestone there where you can see what I think should be done before we release 1.0. (Although please note that this list is not necessarily final nor exhaustive, and is subject to change. Feel free to suggest changes!)

In general, the remaining work can be categorized as follows:

  • Implementing missing features. A big one here is gas metering, which is still missing and is essential for smart contracts.
  • Performance improvements. I’d like to make the VM at least as fast as wasmtime.
  • Ensuring correctness. Every permutation of every instructions, every boundary condition, every abnormal runtime behavior (e.g. out of bounds access, division by zero, etc.) must be tested.
  • Stabilization and standarization. Finalize the file formats, finalize the guest observable behavior, write a formal(ish) spec.
  • Improve the dev experience. Make it easy for an average dev to get a toolchain which can target PolkaVM. Support debugging. Support time travel debugging! (You know how in a normal debugger you can only go forward? In ours you’ll be also able to go backwards! Back to the future!)
  • Integration into substrate. Eventually we intend to use this in production for smart contracts. But I also want to experimentally add it as an alternative executor for full fat runtimes and PVFs. This doesn’t necessarily mean that we’ll switch those to PolkaVM too (that’s still a very open question, and right now I’m not suggesting that we do this), but it will be very interesting to do as an experiment and to see what happens. (Especially for performance, where that’ll give us an apples-to-apples comparison with wasmtime in the context of a real blockchain.)

Contributing

Would you like to help out with the implementation or the design work, or just chat about the VM in general? Contact me either through email (jan@parity.io) or through Element (@jan:parity.io). Let’s talk!

In general if you’d like to work on something please do make sure to ask me first! For example, not every issue in the repository is appropriate for people of all skill levels (even if it might appear simple), or some things I’d like to get done in a very particular way and I wouldn’t want anyone to waste their time on a change that ultimately won’t be accepted.

Appendix: Bonus section, or why does it smell like gas here?

I will be implementing gas metering soon, so here’s a fun little quiz for you.

Consider and compare the following three snippets of Rust code:

Snippet number 1:

let mut index = 0;
for _ in 0..64 * 1024 {
    xs[index] += 1;
    index += 1;
}

Snippet number 2:

let mut index = 0;
for _ in 0..64 * 1024 {
    xs[index] += 1;
    index += 2048;
}

Snippet number 3:

let mut index = 0;
for _ in 0..64 * 1024 {
    xs[index] += 1;
    index += 2053;
}

Assume the code will never panic when accessing the array. All of these snippets are doing exactly the same work, run exactly the same amount of times, and produce exactly the same assembly (except for the index increment). So here’s a million dollar question for you: which snippet will run the fastest? Which will run the slowest? And by how much? Or maybe they’ll run the same speed? Can you tell?

Well, here are the results; click to reveal the spoiler (the unit’s in milliseconds because I ran those multiple times):

  • Snippet 1: 6ms
  • Snippet 2: 114ms
  • Snippet 3: 10ms

So, did you expect this? The second snippet runs 19 times slower than the first one! Even though they’re essentially doing exactly the same amount of work, and executing the same code! What gives? And why I’m talking about this here?

Well, you see, here’s the thing with gas metering. We want gas metering to be deterministic, but also to reasonably model the amount of time a given code will actually take. And things like this make this, well, hard. In this particular example what happens is that the snippet 2 accesses the array in such a way that the memory accesses end up competing for the same slot in the CPU cache, evicting each other.

But the exact mechanism is not that important; there are plenty of other microarchitectural properties of a modern CPU which can significantly affect performance (this is just one example!). Rather, my point here is: what do we do here? Do we ignore this? Do we try to model this? Can we even model this? Maybe not necessarily exactly, but a little better than just benchmarking the instructions in isolation? Do we pessimize the gas calculations to assume the worst case? The average case? The best case? Do we just give up and cry?

Plenty of questions; very few answers. For now I’ll leave it up to you, dear reader, to answer them yourself. (:

38 Likes

I’m a bit confused by this point, because I thought that the RISC-V interpreter/JIT/whatever would run inside of the Wasm runtime. Has the objective changed so that it would completely replace Wasm for the runtime in the long term? If yes, we’re probably talking about a time frame of no less than a decade.

But it depends on Linux-only capabilities, right? Or does it work on Windows for example?

2 Likes

I’d say we shouldn’t because we cannot. That microarchitectural behavior may differ between different generations of CPUs, not even mentioning the Intel-AMD difference. Also, there’s a dichotomy between the amount of work the CPU does and the amount of time the CPU spends doing that work. The latter is even more volatile because of CPU frequencies, memory frequencies, etc.

I believe the best we can do is to abstract the CPU altogether and measure what we can measure easily and deterministically: the amount of work the VM does. Every VM instruction gets its weight (or gas price, whatever you call it), and that’s it, you just subtract those weights from the gas counter. There were a couple of excellent write-ups by @pepyakin on how that process can be heavily optimized, but I cannot find them right now, hope he’ll show up himself to provide links.

That approach may look more on the “give up and cry” side, but think about it wider: all the CPUs are different, and you either get precise measurements for an exact CPU and others will deviate dramatically from that reference, or you get precise measurements for every CPU existing, which doesn’t make sense because it renders the whole thing non-deterministic anyway.

2 Likes

The sandboxing implementation as currently implemented is Linux-only.

The zero dependencies claim was more of a “we don’t depend on much external Rust code”, and not a “this is completely OS independent”. We still depend on the OS, mostly for sandboxing.

As far as Windows support goes, there are a few options:

  1. Run an interpreter. This doesn’t recompile the program into native code and is very simple, so it doesn’t need any extra sandboxing. This already works and is supported today.
  2. Run the guest program in a wasmtime-like sandbox. This will also work on Windows, but it’s nowhere near as secure. This is currently not implemented, but I might add it in the future.
  3. Run the guest program under a system-level virtual machine (e.g. Hyper-V on Windows). This would be as secure as what we have under Linux, but will require highly OS-specific code. (And I’m not a Windows expert so I don’t know off the top of my head how feasible that is. I’d like to experiment with it in the future though.)

By definition a recompiler which emits native code cannot fully run inside of a WASM runtime.

The VM could run inside of a WASM runtime as an interpreter, but we want to run it outside (because we want to maximize the performance) and we’ll expose this through appropriate host functions that the runtime will be able to call to instantiate the VM.

If you’re wondering whether you’d be able to use this inside smoldot - yes. You’d just include it as a dependency, and it’d work out-of-box. If the target for which the VM is compiled is not explicitly supported by the recompiler it’ll just automatically fall back to use the interpreter. (It will be slower of course, but I assume light clients are not going to run heavy computations through it, right?)

For the runtimes it’s not completely out of the question, but at this point in time - no, there are no plans. Currently we’re doing this strictly for smart contracts, however the VM itself is pretty much a general purpose VM - it’s designed so that it should be able to check all of the boxes which smart contracts need (e.g. it’s already as fast as the fastest singlepass WASM VM), but it’s not constrained by those needs and could, in theory, be used for full fat runtimes.

So what’s planned is that we will experiment with running normal runtimes on it, and depending on the results we can have a “should we also switch our runtimes to it?” conversation, but at this point it’s not clear whether it’d be worth it. It’s pointless to switch just for the fun of it, and there must be clear benefits of doing so.

Indeed! For a single model of a CPU maybe it could be possible (but extremely hard), but for multiple? It’d be just impossible, since every model will be slightly different.

That said, there are certain microarchitectural properties that essentially all modern high performance CPUs share, and will have for the forseeable future. The example I cited in my post with cache associativity is something that, I think, will change (e.g. due do different cache sizes, different N-ary associativities, etc.), so it wasn’t the best example of what we could try to model.

However, for example, every modern high performance CPU is superscalar, and this is something that, I think, might be relatively easy to model. Of course the objective here is not to “model exactly what CPU does”, but to “model it a little bit more accurately”. So the simplest idea that we could try here: instead of looking at single instructions look at pairs of instructions, see if they’re independent, and if so assume the CPU can run them in parallel (and essentially every modern CPU will do this, with a few exceptions). So we’d still get 100% deterministic results, but they would match the reality a little bit better.

(Of course, how much this would help is still an open question and needs to be benchmarked.)

3 Likes

Hi!
Damian from Aleph Zero here.

I wanted to thank you for the amazing job you are doing with this, both on the software development side, and also on the educational side (explaining this to all us noobs :slight_smile: ). I wish I could contribute to the project – I will definitely go over the issues to see if there is anything I could tackle.

There is one question I have regarding the hypothetical replacement of wasmi by polkaVM as the pallet contracts VM. Currently (and what we expect in the future) on Aleph Zero, the vast majority of blockspace is used by contract calls. What we eventually want to happen, obviously, is that the whole blockspace is used, i.e., all the blocks are filled, at least, say 50%. Then the following becomes a serious consideration:

  1. The transaction weights for contracts execution will be set using benchmarks on a standardized hardware running linux, so with PolkaVM running natively
  2. The non-linux nodes will probably be forced to run a polkaVM interpreter (or maybe use some other tricks), so they will be significantly slower (10x ?).
  3. Consequently, the non-linux nodes will really struggle to keep up with the chain, which will basically make it infeasible to run non-linux nodes.

From what I know, currently there is actually reasonable support for running substrate nodes both on MacOS and Windows. The question is whether this can be preserved, especially for MacOS, as there are actually so many developers using it (maybe not necessarily for running validators, but maybe for local development chains).

Correct.

In general I do plan to have the native backend work on Windows and macOS eventually, but since our telemetry shows that (at this very moment) 99.75% of the nodes on Polkadot run on Linux this has very low priority.

If someone in the ecosystem really needs this then that could change the priorities though.

At very least it’s going to be 100% possible to support native execution on Windows and macOS with wasmtime-like sandbox, which means that it’s going to be just as fast as on Linux, just less secure.

For local development the interpreter I think should work just fine? Do you see a situation where it would be a problem?

1 Like

I have never heard of ISAs being compared via this “FPS” benchmark. Maybe I have missed something - could you give a few comments on what exactly it is testing?

This is explained in the original post from the previous thread so please take a look there if you’d like more detail. But in a nutshell, this is how fast the benchmark that I’ve used runs, and in this particular case the benchmark is my cycle-accurate NES emulator running a real game, which is why it’s given in FPS.

Anyhow, that was just a rough non-scientific back-of-the-envelope benchmark to confirm whether we can get into the same ballpark. (Different types of programs will have different performance envelopes, but they’re not going to be that different for what we want to run.) I will be adding more comprehensive and diverse benchmarks in the future.

1 Like

Cool! Thats a very nifty piece of code you’ve written. Would love to see a smartcontract-based benchmark soon as well! : )

this is awesome. end up building disassembler gui for .polkavm files.
https://polka.run/disassembler

Yeah, Alex is currently working on porting the contracts pallet and Ink!, so we might have some smart contract tests/benches in the near future. (:

Nice!

Please just keep in mind that 1) the file format is definitely going to change (so it will need updating in the future), and 2) the disassembly is also going to be extended. (Eventually I want to make it possible to reassemble the blobs from the disassembly, and also to make it possible to write new blobs from scratch in assembly, mostly for testing purposes. Which means that a lot more info will have to be included, not only disassembled instructions.)

1 Like

@koute really good job and such a nice timing as it was only recently that a massive zkVM project announced achieving an important milestone in implementing verifiable VMs based on converting RISC-V instructions into efficient ZK circuits. Relaying only on WASM, Polkadot could not benefit from such technologies easily. Have you got that project under radar?

1 Like

That is a very interesting project indeed, but it’s somewhat orthogonal to this, and that approach is a lot more experimental. Basically, for this VM I’m pretty sure we can get it work how we want it to work, and there are very few truly open questions as to its feasibility**, while risc0 has a lot more of those. I wouldn’t be surprised if a “traditional” VM like this would would be better for certain types of computations, and a ZK-style VM would be better for different types of computations. Maybe eventually we’ll end up using both? I don’t know.

Anyway, you might want to take a look at this thread where Sergei’s currently experimenting with risc0: Trustless wasm compilation with SNARKS?

** – essentially the only major open questions are, I think, 1) how much faster can we actually make it than it already is, and 2) how much more accurate can we make the gas metering. The rest is mostly just getting the engineering right.

There is also ZK-WASM | Delphinus Lab so its not like staying with WASM would make polkadot not benefit from the exploration of ZKPs for trustless compute.

I have a question due to the very promising performance offered here.

If one didn’t care for wasm’s ability to run in a browser, nor its overall toolchain/ecosystem, could RISC-V offer a more efficient bytecode for the runtime itself? I’ve heard WASM offers better security properties, but right now I’m solely curious if RISC-V interpretation/JIT may offer some performance edge over WASM. I don’t believe it matters in practice, because if you ignore security, you may as well AoT compile the WASM/RISC-V to native, but I am curious nonetheless.

Some of this may be due not to anything about the ISAs, yet solely the design decisions made by PolkaVM (such as page size).

Depends on what exactly you mean by “performance”. Execution performance? Compilation performance? End-to-end time it takes to load a program into the VM? The average case? The asymptotic case?

But in general, maybe. I don’t know yet. For execution performance at this point we’re not as fast as wasmtime is, but for other metrics (compilation performance, etc.) we should be faster.

The thing with RISC-V is that the impedance mismatch between it and the native machine code is significantly lower than for WASM (a lot of it is because RISC-V is, unlike WASM, a proper register machine), which is why you can get such high performance without adding a Cranelift-like heavyweight recompiler nor a ton of complexity to the VM itself.

In theory if we reduce this impedance mismatch even further (e.g. by preoptimizing the code offline in our linker) we could match and/or exceed even wasmtime’s performance, but whether that’s practical is still up in the air. As-is the performance is definitely good enough for smart contracts already, which is why we’re moving forward with productizing it.

I’d argue this is not true. (: The security is more of a property of the VM rather than the underlying bytecode. (Although some bytecodes are easier to secure than others due to how complex they are; e.g. RISC-V is easy, full AMD64 is hard.)

Well, the performance is mostly a consequence of the bytecode, plus the compilation architecture we’ve picked.

The page size almost certainly won’t affect the performance in a significant way, especially since this doesn’t actually change the hardware page size in any way, which is what can actually affect performance in a significant way in certain cases.

3 Likes

Thanks for the thorough reply :slight_smile:

Execution performance, interpreted/with JIT (so without LLVM/Cranelift recompilation).

The security properties was based on a comment I heard of wasm offering more potential for static analysis, though I can’t personally claim knowledge.

Thank you for clarifying re: performance so far.

We don’t actually care about wasm’s ability to run in a browser.
In theory, yes you could do that.
In practice, however, you need to copy a lot of data between the “host” VM (i.e. smoldot) and the runtime (for example, whenever the runtime wants to read from the storage, you need to copy the storage value from the host VM to the runtime VM), plus a ton of context switches, which makes it as slow as simply embedding wasmi within smoldot.

This could eventually be solved by the upcoming multi-memory proposal of the WebAssembly spec, but these proposals are moving at a snail’s pace and is unlikely to be available before a few years.

Very nice to see more Risc-v development being made!

Just finished watching the presentation for RISC-V at Sub0.

@Alex nice presentation!

One of the “todos” is to add Gas Metering to the VM.

I imagine there will be quite a large amount of overhead as a result of this… unless you do some fancy tricks like I hear from @pepyakin that is done with Wasm and fuel metering.

Can anyone briefly speak more on the approach(es) to adding gas metering here, and what kind of performance hit that might look like in the end? I saw the “bonus section” written by @koute, but not fully satiated.

1 Like