(This is a continuation my previous forum topic; I’m posting this as a new thread to start with a clean slate, and for extra visibility.)
After many weeks of work I have the pleasure to announce the release of PolkaVM 0.1, a tech preview of our new VM for smart contracts!
Disclaimer: this is not production-ready and still very heavily a work-in-progress! It’s not meant for any end use just yet, but you’re welcome if you want to help contribute to the code or the design, or just simply follow our progress.
So without further ado, here’s the repository if you want to take a look: GitHub - koute/polkavm: A fast and secure RISC-V based virtual machine
(But please consider reading the rest of this post before diving into the code!)
Introduction
So why are we doing this, and what do we hope to gain? I have covered this in the first post of my previous topic, so I won’t repeat myself here. Please read it if you’re new! (Just the very first post.)
So what’s new?
(There are a lot of technical details here so feel free to skip sections in which you’re not interested in. I’m writing this both as a showcase, and as a laundry list of otherwise undocumented technical details which some people might want to know about.)
Better performance
Here are the performance numbers from my initial experiment: (lower frametime/higher FPS is better!)
- wasmi: 108ms/frame (~9.2 FPS)
- wasmer singlepass: 10.8ms/frame (~92 FPS)
- wasmer cranelift: 4.8ms/frame (~208 FPS)
- wasmtime: 5.3ms/frame (~188 FPS)
- solana_rbpf (interpreted): 6930ms/frame (~0.14 FPS)
- solana_rbpf (JIT): ~625ms/frame (~1.6 FPS)
- My previous experimental RISC-V WM: ~25ms/frame (~40 FPS)
And here is the performance of PolkaVM 0.1 as of today, on the very same test program:
- PolkaVM 0.1: 10.7ms/frame (~93 FPS)
Compared to my previous experimental VM the performance increased from ~25ms/frame to ~10.7ms/frame, which is as fast at the current gold standard of WASM singlepass VMs (Wasmer)!
Suffice to say this was a complete (but very pleasant) surprise to me, because I haven’t yet even started to work on deliberately optimizing the performance. I’m not entirely sure what exactly made it faster, but the machine code emitted by the VM is now of overall higher quality, so that probably contributed to the speedup.
We’re still nowhere near where wasmtime
is, but we’ve closed the gap a little bit. I will be of course making more performance improvements as time goes on, and I already have several potential ideas that I want to try to make it even faster.
Security and sandboxing
Security is probably the most important aspect of the VM to get right. The VM must be absolutely watertight secure. Which is why making it secure was interwoven into its design right from the start.
First, for a little bit of perspective let me roughly describe how wasmtime
does sandboxing. Essentially what wasmtime
does to make sure the code running in the VM can’t access the host memory and steal your private keys is simple: it just checks that all of the memory accesses made by the guest program are valid.
It does this in a clever way (mostly to make it fast) by preallocating a bunch of empty virtual address space (that is, it reserves space for the memory allocation, but doesn’t actually allocate memory there) and lets the guest program access more than what it should have access to, because wasmtime
has a guarantee that if the program does put its hands into a cookie jar it shouldn’t touch then the CPU will automatically catch that access, stop the guest program and let wasmtime
know.
In general this works very well. wasmtime
is a secure piece of software. But mistakes can happen, and if they do then there’s no backup, there’s no second line of defense. It’s game over.
This is why PolkaVM does sandboxing quite a bit differently:
- Every guest program always runs in a separate process, and doesn’t have access to the host program at all. It can only communicate with it through a small chunk of shared memory. (With
wasmtime
both the host and the guest run in the same process.) - Every guest program has the whole lower 4GB of address space fully available to it, which simplifies how we can prevent it from accessing memory that it shouldn’t have access to (essentially we can always just use a 32-bit register to access the memory instead of fiddling with offsets and/or bounds checks, and we’re automatically guaranteed that it’s safe). The guest uses the lower 4GB of address space, and the VM uses the upper portion of the address space. We can do this because the VM spawns a specially crafted ELF file as the zygote of the guest process, where we have the full control over what goes where in memory. This also has the benefit of actually not requiring us to reserve a bunch of useless address space (which, while plentiful, does consume some resources and we can run out of it; it’s rare, but I’ve seen in happen on certain platforms, in Substrate!).
- Every guest program is automatically namespaced and runs in a new container. This is exactly the same tech that projects like Docker use. So even if the attacker got full control of the guest process it won’t be able to access the filesystem, nor the network, nor any other processes.
- Every guest program is sandboxed with seccomp, giving it access only to a handful of syscalls. Last time I counted Linux on AMD64 exposes around ~360 syscalls (that’s a lot!) while the VM only needs (as of today) exactly 9 of them (0.025% of the total!). This is a huge reduction in attack surface, especially considering that new or weird syscalls do often have security vulnerabilities (e.g. Google disabled the io_uring syscall across their entire fleet because it had more security holes than Swiss cheese).
The major point here is that we don’t just have a single line of defence here, we have multiple. If one of these security measures fails there’s another one to take its place, making any potential exploits significantly more difficult. (I’m not going to say “impossible” because this wasn’t audited yet.)
And all of these security measures are automatically handled by the VM and the host program doesn’t have to do anything special to activate them. (In contrast with our recent heroic efforts to extra sandbox wasmtime
for PVFs. With PolkaVM we get this entirely for free since it was designed right from the start to be sandboxed like this.)
(Almost) zero dependencies
In my personal projects I always liked to reduce the number of unnecessary dependencies and in general be mindful of how much I’m pulling in. This is good for compile times, and also good for security (supply chain attacks!). I also applied this philosophy to PolkaVM.
Let me demonstrate this with a hacky script that I recently wrote, which calculates the total number of crates and lines of code for a particular project (you can find it here). It’s not entirely accurate (this is not a scientific comparison!), but it should be good enough.
Here’s how many dependencies/lines of code wasmtime
has in total: (the main wasmtime
crate with default feature flags, not including dev dependencies)
Total crates: 134
Total lines of code: 1169225
Yes, you’ve read that right; including wasmtime
in your project pulls in over 1 million lines of code by default. Now, let’s see PolkaVM for comparison:
Total crates: 6
Local: 5
External: 1
Total lines of code: 15222
Local: 12300
External: 2922
Only ~15k total lines of code. (The only external dependency is the log
crate.) Quite a difference compared to 1+ million, wouldn’t you say?
Now, I don’t mean to dogpile on wasmtime
here. It’s an absolute wonderful, high quality piece of software, and this comparison is a little unfair (e.g. a significant chunk of the code that wasmtime
pulls in are autogenerated bindings). But the general point still stands.
WASM-like import-export model
Here’s an example guest PolkaVM program:
#[polkavm_derive::polkavm_import]
extern "C" {
fn get_third_number() -> u32;
}
#[polkavm_derive::polkavm_export]
#[no_mangle]
pub extern "C" fn add_numbers(a: u32, b: u32) -> u32 {
a + b + unsafe { get_third_number() }
}
The way you access host functions here is similar to what you’d do for WASM - you just write an extern
block with the prototypes. In this case though you also have to annotate it with our procedural macro so that our linker can find these imports.
The exports (that is, functions which you can call from outside of the VM) are similar too, and you also need to decorate them with a macro.
On the host side the API is inspired by wasmtime
, but it is slightly different. Here’s how the host side for this particular example program looks like:
use polkavm::{Config, Engine, Linker, Module, ProgramBlob, Val};
fn main() {
env_logger::init();
let raw_blob = include_bytes!("guest.polkavm");
let blob = ProgramBlob::parse(&raw_blob[..]).unwrap();
let config = Config::from_env().unwrap();
let engine = Engine::new(&config).unwrap();
let module = Module::from_blob(&engine, &blob).unwrap();
let mut linker = Linker::new(&engine);
// Define a host function.
linker.func_wrap("get_third_number", || -> u32 { 100 }).unwrap();
// Link the host functions with the module.
let instance_pre = linker.instantiate_pre(&module).unwrap();
// Instantiate the module.
let instance = instance_pre.instantiate().unwrap();
// Grab the function and call it.
println!("Calling into the guest program (through typed function):");
let fn_typed = instance.get_typed_func::<(u32, u32), u32>("add_numbers").unwrap();
let result = fn_typed.call(&mut (), (1, 10)).unwrap();
println!(" 1 + 10 + 100 = {}", result);
println!("Calling into the guest program (through untyped function):");
let fn_untyped = instance.get_func("add_numbers").unwrap();
let result = fn_untyped.call(&mut (), &[Val::I32(1), Val::I32(10)]).unwrap();
println!(" 1 + 10 + 100 = {}", result.unwrap());
}
The compilation pipeline
The general architecture of how the programs are compiled for the VM was also defacto finalized. It’s essentially composed of two steps:
- A guest program has to be compiled with a standard compiler (e.g. rustc) and linked into a standard ELF file.
- Then that ELF file is relinked by an auxiliary middle-end linker supplied by us. (
cargo install polkatool
, and then callpolkatool link
) This runs entirely offline, and will further optimize the program, strip it down, repack it, and generate a.polkavm
file blob which can be put on-chain.
Two main benefits of this approach is that we can just use an off-the-shelf compiler to write our programs, while simultaneously being able to explicitly control and optimize what will ultimately be uploaded on-chain, and significantly reduce complexity. (ELF is great, but it’s very flexible, complex, and on a smart-contracts scale also somewhat bloated.)
The program blob file format
The file format in which we store program blobs is custom, and was very heavily inspired by what WASM has. It is explicitly designed so that it is simple, easy to parse, fast to process (zero-copy parsing!) and extensible in a forward and backwards-compatible fashion.
It is essentially composed of a header, plus a list of sections, each with an ID and a length. The sections are guaranteed to always be in a certain order (like in WASM), and the sections can be easily skipped over thanks to their lengths. Sections which don’t affect operational semantics of the VM (e.g. debug info) are also clearly marked, and can be added/ignored/stripped without breaking compatibility.
The instruction encoding
The way we encode instructions is also custom. It is still very much RISC-V, but modified to better fit a virtual machine. The immediates are not limited to 12/20 bits anymore and instructions support full 32-bit immediates (because the encoding is now variable length), which means that some RISC-V instructions which previously had to be essentially split into two (e.g. loads/stores/jumps to addresses which don’t fit in a single instruction) are now encoded using a single instruction. It’s not only simpler, but also results in more efficient machine code.
The instruction set itself is also more stripped down and simpler. A few instructions were removed (there’s no more AUIPC, there’s only a single jump instruction) and the ecall instruction was replaced with an ecalli (note the extra “i” at the end) instruction. (This basically means that the hostcall numbers are encoded as an immediate, instead of fetched dynamically. This makes it easy to statically analyze/validate programs regarding their usage of hostcalls, and also avoids wasting a register at runtime when actually making a hostcall, which is important since we don’t have many registers to play with.)
The machine architecture
Unlike normal RISC-V the VM is a Harvard architecture machine (like WASM!) which means that the data and the code live in separate address spaces, or more specifically, the code cannot be read by the guest program at all. This is good for security, and also good for performance. (Because we don’t have to allocate the memory for the code and make it available to the program.)
Debug info support
I wasn’t originally planning on adding this right now, but I ended up doing so anyway as I needed the VM to be more debuggable to be able to fix some of the last remaining bugs. As of today the VM’s linker processes and outputs debug info for guest programs, including function names, source code locations, and inlined functions. However this isn’t yet used by anything except the VM’s execution tracing support.
Tracing support
The VM currently has a crude support for tracing guest program execution, where it’ll print out the currently executed instruction, as well as values of the registers, accessed memory, and the original source code location + line of where the instruction comes from. It can be enabled by setting the POLKAVM_TRACE_EXECUTION
environment variable to 1
and enabling trace logs.
Guest program memory map
The guest allocates memory in pages. A single page is always 16KB. This is less than WASM’s 64KB to make tiny programs more efficient, and also larger than the conventional 4K (which normal AMD64 CPUs use) so that we can easily support M1 CPUs in the future (which use 16K pages natively).
Guest programs’ memory map is split into four parts, in order:
- R/O data (always starts at 0x10000)
- R/W data (always starts after R/O data, although I will be adding a gap here to make the interpreter easier to implement)
- BSS (basically uninitialized memory of the program, always starts after R/W data; in the future this will be growable)
- Stack (always starts at the top of the address space at 0xffffc000)
An interpreter
The VM itself has two backends - a native AMD64 backend which generates machine code, and an interpreter. If the host system doesnt’t support the native compiled backend then the interpreter will be automatically used.
Caveats
It’s not all sunshine and roses. Currently the VM still has some limitations.
- It works, but there are almost no tests, and there might be bugs lurking. I will be adding a lot more tests in the future.
- It works on my machine, but it might not work on yours. In particular, it might break on older versions of Linux where some of the APIs it uses for sandboxing might not be supported. I will potentially add fallback codepaths to the VM in the future to handle this.
- On systems other than Linux running on an Intel or AMD CPU the VM will run in interpreted mode. This will be slow. I have plans to at least support aarch64 (arm64) in the medium-term, and possibly macOS and Windows in the long-term. (The problem with other OSes is that unlike Linux they either lack the features or their kernel interfaces are not stable, so replicating our sandboxing setup on them is a challenge without e.g. just running a full hardware level virtual machine and just run Linux in it anyway.)
- It will only run guest programs compiled with an RV32E toolchain. This is not supported by Rust nor LLVM just yet (although there is a work-in-progress patch to the LLVM to add it). In the short term I have pushed a bunch of scripts to GitHub to make it possible to automatically build such a toolchain without having to patch things yourself nor figure out Rust’s build system, and I will be providing prebuilt binaries in the future. In the long-term once the LLVM patch is merged we can make sure that the target is supported by Rust out-of-box. (I suppose we could also try to fund some of the RISC-V people to get the patch pushed through faster.)
- It currently requires that the guest programs are compiled with full debug info. This is an artificial limitation and will be removed in the future.
- It currently requires that the guest programs are compiled with a very specific linker script. In the future I will reduce the scope of what this linker script affects, and maybe even remove it completely.
- It currently requires that the guest programs are compiled with specific linker flags. In the future once the LLVM patch is merged I’d like to maybe try to add a PolkaVM-specific target to Rust that’d have those applied by default. (Although this isn’t really too big of a deal.)
- The VM FFI only supports passing of at most 6 arguments (or half of that if they’re all 64-bit) through its FFI boundary, because that’s how many argument registers we have. If necessary this limitation could be removed by supporting passing of extra arguments on the stack.
- Spawning of new module instances is slow, because right now I’m always spawning a new process when a module is instantiated. This will be very easy to fix by caching the VM workers; nevertheless I wanted to make note of it here so that no one experimenting with the VM is surprised.
- The instruction encoding is not yet fully optimized. In particular, immediates (which are serialized as varints) are always serialized as unsigned numbers, which makes e.g. small negative numbers always consume 5 bytes of space. I’ll be changing those to use a zigzag encoding to rectify this in the future.
- The VM crate contains the zygote binary, which is a prebuilt Linux binary that’s included in the crate as-is and executed at runtime. The binary needs nightly Rust to build (although this could be worked around with some extra pain) and ideally should be always manually reinspected when updated to make sure the compiler did generate it in a way we expect to look like. However, considering the recent serde fiasco I know this is something that will be frowned upon, which is why I will be making this binary fully reproducible and verified on the CI.
- The behavior of the compiled backend and the interpreter are not yet exactly the same.
- The division remainder instructions are not yet implemented in the compiled backend and will trap instead. (These are very rarely used though.)
- There’s no way to customize the stack size yet without hex editing the program blob.
Future work
There’s still plenty of work to be done! You can see a full list of what I’m planning to work on if you go to the issues section in my repository (there are over 40 of those right now!). I’ve also added a “PolkaVM 1.0” milestone there where you can see what I think should be done before we release 1.0. (Although please note that this list is not necessarily final nor exhaustive, and is subject to change. Feel free to suggest changes!)
In general, the remaining work can be categorized as follows:
- Implementing missing features. A big one here is gas metering, which is still missing and is essential for smart contracts.
- Performance improvements. I’d like to make the VM at least as fast as
wasmtime
. - Ensuring correctness. Every permutation of every instructions, every boundary condition, every abnormal runtime behavior (e.g. out of bounds access, division by zero, etc.) must be tested.
- Stabilization and standarization. Finalize the file formats, finalize the guest observable behavior, write a formal(ish) spec.
- Improve the dev experience. Make it easy for an average dev to get a toolchain which can target PolkaVM. Support debugging. Support time travel debugging! (You know how in a normal debugger you can only go forward? In ours you’ll be also able to go backwards! Back to the future!)
- Integration into substrate. Eventually we intend to use this in production for smart contracts. But I also want to experimentally add it as an alternative executor for full fat runtimes and PVFs. This doesn’t necessarily mean that we’ll switch those to PolkaVM too (that’s still a very open question, and right now I’m not suggesting that we do this), but it will be very interesting to do as an experiment and to see what happens. (Especially for performance, where that’ll give us an apples-to-apples comparison with
wasmtime
in the context of a real blockchain.)
Contributing
Would you like to help out with the implementation or the design work, or just chat about the VM in general? Contact me either through email (jan@parity.io) or through Element (@jan:parity.io). Let’s talk!
In general if you’d like to work on something please do make sure to ask me first! For example, not every issue in the repository is appropriate for people of all skill levels (even if it might appear simple), or some things I’d like to get done in a very particular way and I wouldn’t want anyone to waste their time on a change that ultimately won’t be accepted.
Appendix: Bonus section, or why does it smell like gas here?
I will be implementing gas metering soon, so here’s a fun little quiz for you.
Consider and compare the following three snippets of Rust code:
Snippet number 1:
let mut index = 0;
for _ in 0..64 * 1024 {
xs[index] += 1;
index += 1;
}
Snippet number 2:
let mut index = 0;
for _ in 0..64 * 1024 {
xs[index] += 1;
index += 2048;
}
Snippet number 3:
let mut index = 0;
for _ in 0..64 * 1024 {
xs[index] += 1;
index += 2053;
}
Assume the code will never panic when accessing the array. All of these snippets are doing exactly the same work, run exactly the same amount of times, and produce exactly the same assembly (except for the index
increment). So here’s a million dollar question for you: which snippet will run the fastest? Which will run the slowest? And by how much? Or maybe they’ll run the same speed? Can you tell?
Well, here are the results; click to reveal the spoiler (the unit’s in milliseconds because I ran those multiple times):
- Snippet 1: 6ms
- Snippet 2: 114ms
- Snippet 3: 10ms
So, did you expect this? The second snippet runs 19 times slower than the first one! Even though they’re essentially doing exactly the same amount of work, and executing the same code! What gives? And why I’m talking about this here?
Well, you see, here’s the thing with gas metering. We want gas metering to be deterministic, but also to reasonably model the amount of time a given code will actually take. And things like this make this, well, hard. In this particular example what happens is that the snippet 2 accesses the array in such a way that the memory accesses end up competing for the same slot in the CPU cache, evicting each other.
But the exact mechanism is not that important; there are plenty of other microarchitectural properties of a modern CPU which can significantly affect performance (this is just one example!). Rather, my point here is: what do we do here? Do we ignore this? Do we try to model this? Can we even model this? Maybe not necessarily exactly, but a little better than just benchmarking the instructions in isolation? Do we pessimize the gas calculations to assume the worst case? The average case? The best case? Do we just give up and cry?
Plenty of questions; very few answers. For now I’ll leave it up to you, dear reader, to answer them yourself. (: