Exploring alternatives to WASM for smart contracts

Background

A few months ago @Alex and @pepyakin entertained the idea of using eBPF for our smart contracts instead of WASM, the details of which are available in the following thread:

I won’t repeat the arguments already made in that thread (read it if you’re interested!). Suffice to say I do agree with the motivations presented within as to why we might consider something other than WASM for smart contracts.

So recently I got approached by @Alex regarding this and we’ve discussed it at length. Long story short, one of the ideas which were floated around was that maybe we could use RISC-V as the instruction set architecture of choice for our smart contract platform. I got intrigued by the idea, so I decided to run a little experiment to see how viable that would be in practice, and in this post I’d like to present to you my experience doing just that.

Alternatives to WASM, or what do we actually need?

First, the requirements. What do we actually want from our ISA? Here’s a non-exhaustive laundry list of what I think ideally we’d like to have, in no particular order:

  • Simple.
  • Easy to secure.
  • Easy to write a singlepass JIT compiler for (which still generates fast code).
  • Fast to execute.
  • Fast to JIT-compile (with the assumption that the generated code is fast).
  • Compact.
  • Portable.
  • Well defined, standardized and has an existing ecosystem.
  • Already supported by rustc and LLVM.
  • Guaranteed to be supported by rustc and LLVM into the future.
  • Has enough features to compile existing programs without much trouble.
  • Doesn’t have features (or has them but they’re optional) which we don’t need.

So considering the requirements the way I see it we have the following potential options available to us:

  • WebAssembly (which is what we’re currently using)
  • eBPF (used by Solana)
  • RISC-V (what I’m proposing here)
  • Our own custom ISA

Let’s quickly go over each and see which of the requirements each of those do not satisfy, either fully or partially.

WebAssembly

WebAssembly is:

  • Not simple. While it is laughably simple compared to something like x86 it’s definitely not simple enough considering what we need. (Again, see the pepyakin’s post for some of the details; link’s at the beginning of this post.)
  • Not easy to write a singlepass JIT compiler for (which still generates fast code). WASM is stack-based, and hardware ISAs are register based, so there’s a disconnect here. You need a register allocator, and writing a good register allocator is a really hard problem.
  • Not fast to JIT-compile. See previous point. wasmtime is notorious for JIT-compiling relatively slowly; something like Wasmer Singlepass makes it better, but it’s still not ideal.
  • The MVP variant of WebAssembly (the one without a lot of extra features we don’t need) could arguably not be supported forever by LLVM as web browsers catch up and no one actually uses the MVP target anymore. Even if it is still technically supported it might accumulate bugs if no one’s going to actually use it.
  • Has features we don’t need (e.g. floating point support).

eBPF

eBPF is:

  • Out-of-box can’t compile arbitrary programs, although Solana has a fork of rustc and LLVM where they’ve made it work.
  • Upstream rustc and LLVM only supports a limited form of eBPF; we’d need something like Solana’s version of it, which is not upstreamed.
  • Not guaranteed to be supported by rustc and LLVM into the future, assuming the Solana-like variant of eBPF.
  • Might not be fast-enough to execute even when JITed (see my experience with it later in this post).
  • Can be too simple and actually lack the features we’d want; for example, internal function calls and host function calls use the same instruction which complicates things.
  • Registers are all 64-bit which we don’t really need and e.g. makes wasmtime-like sandboxing of memory not possible.

Our own custom ISA

This doesn’t currently exist, but if it did I think it’d be safe to assume that we could fulfill all of the functional requirements; so what would be left are things like:

  • Not supported by rustc and LLVM. We’d have to write our own LLVM backend and get it upstreamed. I can’t overstate how huge is the amount of work this would require! Solana maintains their own fork of rustc and LLVM and they do want to upstream it, but several years later it still hasn’t happened.
  • Not standardized and has no existing ecosystem. We’d be the only users of this ISA (at least initially).
  • Not guaranteed to be supported by rustc and LLVM into the future. Us being the only users we’d have to maintain the LLVM backend ourselves indefinitely, even if we’d upstream it.

This option would be ideal from a functional perspective, but would require a truly massive amount of work and effort. I’d really rather avoid it if we can help it.

RISC-V

None…? From a cursory look it seems like RISC-V might actually tick all of the boxes we’d need! But does it really? That’s what we want to find out here!

So what is RISC-V?

(If you already know feel free to skip this section.)

RISC-V is an open and free instruction set architecture developed and maintained by the RISC-V Foundation. In a nutshell it has the following killer features which (in combination) make it unique as far as ISAs go:

  • It’s free and open-source. Anyone can use it without any license fees.
  • It’s very simple and orthogonal. At its bare minimum all of the required userspace RISC-V instructions fit on a single page.
  • It’s modular and extensible. Parts of the ISA (e.g. multiplication, floating point support, atomics, SIMD, etc.) are entirely optional and can be enabled/disabled at will.
  • It’s efficient and scalable. It is designed to scale from tiny low-power microcontrollers up to supercomputers with hundreds of CPUs.
  • It is already well supported in rustc and LLVM and is gradually picking up steam in the industry.

The unique properties of RISC-V make it possible to tailor it somewhat to our very specific requirements while simultaneously we can still benefit from all of the work that’s being done around it in the ecosystem.

Take a look at Wikipedia if you’re interested in more details. (I’d put a link but Polkadot forum barfs an error if there are more than 2 links in a post.)

The experiment

So what we’d like to know here is simple: would RISC-V actually be a good fit for smart contracts? And can it actually fulfill all of our requirements, not just on paper but also in practice? This is what I’ve decided to try and find out.

As the first step we need to find a benchmark on which we could evaluate RISC-V (and other alternatives)'s merits and performance. Ideally something that is at least somewhat smart-contract like in the type of work it does, but scaled up to the very extreme of what we’d reasonably run. And it just so happens I already had a good candidate that would fit the bill!

You see, 7 years ago I wrote a cycle-accurate NES emulator in Rust called Pinky. So I thought, hey, lets quickly make that bad boy no_std and use it as a benchmark! On first glance this might sound ridiculous - obviously no one’s going to be deranged enough to put an NES emulator on chain into a smart contract and play Super Mario Brothers with it. But I still think it’s a reasonably good pick for the following reasons:

  • It does actual useful (well, if you define “playing games” as useful) work instead of being a microbenchmark.
  • It’s relatively big, most likely on the upper end of what anyone would even attempt to compile into a smart contract, so it illustrates a sort of a worst case scenario.
  • The type of work it does is - in a way - similar to what a smart contract would do: it does almost no floating point math, it’s not memory bound and doesn’t shuffle a lot of data around, and it’s essentially mostly a bunch of logic and ifs and jumps.
  • The performance is easily interpretable: how close to 60 FPS can we get?

So now that we have our benchmark - a “smart contract” which generates frames of Super Mario Brothers (or any other NES game) - now’s the time to try to run it. So to get myself more familiar with RISC-V on a practical level I wrote a RISC-V interpreter. This took me less than one day.

Now, let me interject here for a bit and reiterate what just happened. I wrote an interpreter in less a single day, completely from scratch, that can run real software compiled into a real ISA generated by a real compiler. This is a big deal and is a testament to RISC-V’s simplicity! If you’d try to do that for any other real ISA, say, x86 for example, then you’d probably spend the whole week/month just trying to decode the instructions, never mind writing an actually fully functional interpreter in a day! Even the MOS 6502 interpreter that I wrote for Pinky (and 6502 is a CPU from the 70s, almost 50 year old!) is a lot more complex than my RISC-V interpreter!

So I have an interpreter, what now? I could compare it to the other alternatives as-is, but that’s a little bit boring. The fact that I wrote it in less than a day was promising, so I decided to take it a step further: let’s write an actual JIT compiler for it!

So that’s what I did. It took me two days of work to write a RISC-V JIT compiler. Completely from scratch. And did I mention it can technically take any arbitrary program which rustc generates and run it? But enough rambling; let’s look at the numbers! (Lower times are better.)

  • wasmi: 108ms/frame (~9.2 FPS)
  • wasmer singlepass: 10.8ms/frame (~92 FPS)
  • wasmer cranelift: 4.8ms/frame (~208 FPS)
  • wasmtime: 5.3ms/frame (~188 FPS)
  • solana_rbpf (interpreted): 6930ms/frame (~0.14 FPS)
  • solana_rbpf (JIT): ~625ms/frame (~1.6 FPS)
  • My RISC-V interpreter: ~800ms/frame (1.25 FPS)
  • My RISC-V JIT: ~25ms/frame (~40 FPS)

These results are… very interesting. My RISC-V JIT, which is a simple single pass recompiler with very little in way of optimizations, could probably fit within 1k lines of code and was written in two days, it generates code that is only 2.5x slower than Wasmer Singlepass, which in total has over 150k lines of code (not all of those are relevant, but still) and up until this point most likely had man-years worth of effort invested into it. Saying that these results are promising would be a gross understatement!

Another interesting result here is Solana’s eBPF JIT which really shocked me. It ended up being almost as slow as my RISC-V interpreter, and over six times slower than wasmi (also an interpreter)! Something went really wrong here. (And before you ask, I did disable metering in Solana’s JIT to make things fair.) This could be either because the JIT itself doesn’t generate good code, or possibly because LLVM doesn’t generate good code for eBPF, or both. What we need to remember here is that eBPF (and LLVM’s eBPF backend) was originally never meant to compile something like this, and it’s only due to Solana’s LLVM fork that it can do it in the first place. So it is entirely possible that their eBPF backend just simply generates bad code, which would translate to equally bad code after JIT compiling it. Nevertheless, this result makes using eBPF as the ISA of choice for smart contracts even more unappealing.

Code size

I’ve also done a comparison of the code size. All of the builds here are with lto = true, strip = true and codegen-units = 1 in Cargo.toml:

  • eBPF (-O3): 150k
  • eBPF (-Os): 140k
  • eBPF (-Oz): 117k
  • WASM (-O3): 80k
  • WASM (-Os): 73k
  • WASM (-Oz): 59k
  • WASM (-O3) + wasm-opt: 74k
  • WASM (-Os) + wasm-opt: 67k
  • WASM (-Oz) + wasm-opt: 54k
  • RISC-V (-O3): 92k
  • RISC-V (-Os): 83k
  • RISC-V (-Oz): 71k
  • RISC-V + C (-O3): 73k
  • RISC-V + C (-Os): 66k
  • RISC-V + C (-Oz): 57k

The RISC-V + C is RISC-V with the compressed instructions extension which adds alternative 2-byte encodings (where normally they use 4 bytes) for the most commonly used instructions.

Using the C extension makes RISC-V competitive with WASM, but considering RISC-V’s simplicity I think we could do better! What I mean by this is, we don’t actually need to store raw RISC-V bytecode; we could have our own custom encoding of it and store that custom encoded version of it. You can think of this as a simple compression scheme for RISC-V bytecode. I haven’t explored this, but it would most likely allow us to cut down the size even more at essentially no cost.

RISC-V: the good parts

So what were the good parts of RISC-V based on my experience writing my JIT recompiler?

  • Really simple. There are not many instructions - only 55 if I counted them right (if I didn’t - sorry, counting is hard!), but even this is a little misleading. The instructions can be grouped into roughly ~11 categories, and handling of all of the instructions in each category is essentially almost the same. For example, AND and XOR are encoded very similarly and have the same semantics, just doing a different bitwise operation.
  • The most basic RISC-V target with the M extension (for multiplication/division) is the bare minimum of what an ISA should have, and is pretty much exactly what we want functionality-wise. Floating point support, atomics, SIMD, etc. are in their separate extensions which we can completely ignore.
  • Has a dedicated instruction for making syscalls/hostcalls. (This is worth mentioning because eBPF doesn’t have one.)
  • Is 32-bit (well, it has both 32-bit and 64-bit targets) so we could use the same trick wasmtime uses to sandbox its memory accesses through clever use of virtual memory.
  • Is really easy to decode; instructions’ encoding is mostly sane (although some of the immediate encodings are a little crazy, but it’s nothing too bad) and instructions are always constant length.
  • Could most likely support a Harvard-style machine like WASM (which is nice for smart contracts as we wouldn’t have to copy the RISC-V code itself to memory and make it accessible to the smart contract; mentioning this because, again, eBPF doesn’t have this from what I can see)
  • The support for RISC-V in rustc seems very good, and is only going to get better as RISC-V gains adoption.

RISC-V: the bad parts

The biggest wart of RISC-V in the context of writing a JIT is that it has 32 general purpose registers (well, actually, only 31; one of those is the zero register which doesn’t really count), which does complicate things.

For your reference AMD64 (sometimes also called x86_64) which most of us are running has only 16 registers. So how do you map 32 registers into only 16? Well, you don’t. You need to spill things into memory. Empirically I’ve found that as long as you pin the most frequently used registers to physical registers and only spill those which are rarely used then that should not affect performance too much. Initially I spilled every register into memory on every access, and as I’ve started to gradually pin more and more RISC-V registers to actual AMD64 registers the performance improved, but only up to a point, resulting in diminishing returns.

There is a way around this though. RISC-V officially defines a subtarget called RV32E which only uses 16 registers, which would be almost perfect for us! Unfortunately this isn’t currently implemented in LLVM, but there’s a patch in progress to add it. In the worst case we could help out and get it over the finish line (either directly or by funding it). There’s also apparently support for it in GCC already, so using rustc’s GCC backend could also be an option. Writing a postprocessor which would convert full RISC-V code into RV32E and use that until LLVM supports it natively is also a possibility. We could solve this.

Future work

So what are the next steps? What still needs to be done? More experiments!

  • Follow up on RV32E: take the in-progress LLVM patch, make it work with rustc, and see how RV32E affects the generated code and how easy it is to JIT. Does the code get larger? By how much? Does it get slower? How much simpler the JIT gets if we can guarantee that only 16 registers are used?
  • See whether it’d be feasible to write a postprocessor that’d take full blown RISC-V code and transform it into RV32E. Is that easy to do? And how would that affect the performance of the resulting code?
  • Experimentally integrate it into Substrate and Ink!, and run some actual smart contracts on-chain.
  • Investigate a more compact encoding of RISC-V instructions and see how small we can make it.

Conclusions

After looking into eBPF and RISC-V in more detail and experimenting with them my conclusions are as follows:

  • I don’t think eBPF is a good fit. Yes, it’s simple, but it’s too simple, and it’s just too problematic in practice.
  • RISC-V exceeded my expectations. We should seriously consider it and investigate further.
  • I wouldn’t go as far as saying “we should switch to RISC-V” yet, but I’m close.
  • Considering RISC-V’s simplicity I could probably write a secure, production-ready JIT for it in a few months, possibly weeks.
38 Likes

Excellent write-up. This indeed seems like a great fit for our needs. :thinking:

Have you researched prior art for RISC-V JIT compilers? I am aware of this project, though it’s not intended to be production-ready. However, the talk it comes from is very informative, and it might be good to pick the author’s brain for techniques and previous work in this area.

1 Like

Only briefly; my main objective was get into the weeds and really dig down into the details to get an intuitive understanding of the problem space from the lowest level possible.

One thing I want to point out is that the technique I’ve used for my JIT is quite different from how JITs are usually written; you could probably think of it as a “static recompiler” rather than a true “just-in-time recompiler”. Full-blown JITs usually support things like self-modifying code and actually just-in-time compile; mine doesn’t! My “JIT” just translates everything in a single pass before executing it. This significantly cuts down on complexity and makes is really cheap, think potentially gigabytes of code translated per second if actually optimized to the bone.

We can do this precisely because we don’t need a full blown general JIT and we can tailor both the JIT’s input (e.g. use specific compiler flags, a specific linker script, etc.), and we can decide on how our runtime environment will look like.

In the end we might end up needing something more complex, I don’t know, but as an experiment I wanted to see how simple we can get and how that’d actually perform. And as it turns out it performs pretty good!

Yeah, this project is more JIT-ty than my JIT, if I’d say so. It’s a fun little experiment; especially the part where it calls into rustc to generate the code. (: But yes, we might end up using some similar techniques (e.g. recompiling on a per-function basis and applying some light/cheap optimizations); that’s still up in the air and would need to be tried out experimentally to see how it affects performance.

In the end our main objective here is to absolutely minimize the complexity while retaining great performance, so it’s always going to be somewhat of a tradeoff; we just need to figure out how the simplicity/performance curve looks like and where do we want to be on it.

2 Likes

Most JITs are JITs because they have to: It is unclear what of the input is data and what is code. However, since we are defining the VM we will define it in a way that we can statically recompile it. We should probably stop using the word JIT. Mostly wasmtime’s fault which is also not a JIT.

RV32E sounds awesome.

Is it sensible to have 64bit registers but 32bit usize for your sandboxing? I think rust supports some targets like that. On-chain custom cryptography likely runs faster on 64bit registers, maybe contracts do not care too much, but maybe nice for parachains.

64 instructions is already 6 bits, and you might want more like SIMD eventually, but you could maybe just have 1 byte instructions, no?

Could we make the architecture alignment to be the worst case of all accepted host architectures? If so, this could improve some hostcalls, no?

Hmm… this is an interesting idea. I guess since we can define how our execution environment works we could technically define that registers are 64-bit but only the lower 32-bit are used for addressing purposes? One problem I see here though is that something like an RV64E (a 64-bit variant of RISC-V with only 16 registers) doesn’t exist and the spec explicitly says that (I quote):

The E variant is only standardized for the 32-bit address space width.

There might be some other unforeseen problems here that I’m not seeing.

Do we know often do people use on-chain custom crypto? I’m guessing it’s probably not that often?

In general if you’re aiming for pedal-to-the-metal speed you’d just use SIMD for this, right? So if we’re talking about maximizing the speed of on-chain custom crypto then the conversation we need to have is not “we should have our registers be 64-bit” but “we should expose SIMD and have our registers be 64-bit”. Nevertheless I think for smart contracts specifically this is out of scope.

Using this for PVFs would be an entirely different conversation. (:

For now we’re only looking at this from smart contracts’ perspective and we have no plans whatsoever of even attempting to use this for PVFs. However, if it goes really well for smart contracts then it might be a conversation worth having. Unlike the relay chain runtime (which is trusted) the PVFs (which are untrusted) do share some similarities with smart contracts where we really care about things like Alcatraz-level security sandboxing, O(n) JIT compilation and in general just keeping things simple to make the whole thing predictable and secure.

Yeah, that’s roughly what I was thinking to try for the custom storage-only bytecode. RISC-V is constant length to simplify its execution semantics, but to store it we could just trivially use a variable-length encoding to cut down on the space it takes on-chain.

Not entirely sure what you mean by “architecture alignment” here but in general, yes, we’d like to design this so that it’d be easy to JIT at least for AMD64 and AArch64. (I don’t really see any other architectures becoming relevant in the next several decades, except for RISC-V.)

2 Likes

Hi,

wasmi author/maintainer here.

Thanks a lot for this cool write-up of your RISC-V experiment. I share your opinion that unlike eBPF RISC-V might be actually valuable and interesting for our use case.

I wanted to add some more problems of Wasm from my experience working on it concerning our smart contracts use case.

  • Wasm MVP was simple back in the early days where its whole spec could fit on a single page. However, even back then it already had roughly 170 different instructions. Wasm is extended in its functionality in a very rapid pace. For example, nowadays the Wasm spec already defines well over 500 different instructions from which roughly 250 are coming from the SIMD proposal. Other proposals are going to extend the simple Wasm type system, adding struct types for example (GC proposal), new control structures (exception handling), more complex function types etc… Note though that not all Wasm proposals are bad, some actually are useful to us. However, Wasm is not very modular with respect to its proposals. Plans for a Wasm 2.0 include most (if not all) standardized proposals mandatorily.
  • Generally people think that Wasm is an assembly language. In my opinion this is an incorrect angle to look at it. Wasm behaves more like a strongly typed high-level language where the so-called Wasm validation is similar to type checking of other strongly typed languages.
  • Since Mozilla’s massive bail out of Wasm, the Wasm ecosystem shifted towards cloud computing. More and more proposals are concerned about this topic and in recent Wasm conferences it can be seen that the cloud computing topic is very dominant compared to other use cases such as our embedded niche.
  • Not everything is bad though: For example, the Wasm CG is talking about profiles which could be used to restrict Wasm to a subset that is more suited towards a certain niche use case. Currently the discussion is mostly concerned about embedded profiles and deterministic executions - so very interesting for our use case.
  • As you already mentioned in your post the rapid development of Wasm is a problem for all Wasm runtime implementers. For example it would be very problematic to just support Wasm MVP and be fine with it since your runtime would likely fall apart over the years since the Wasm ecosystem around it would develop and there is a chance that Wasm MVP support and usage will rot. Therefore Wasm runtime authors kind of have to keep up with the pace, implementing most if not all of the standardized features or at least have fallback mechanisms. In wasmi I already see problems if the Wasm SIMD proposal becomes mandatory with Wasm 2.0. There are discussions about exactly this issue.

If this experiment is successful and we find a good way to transition Wasm smart contracts to RISC-V I actually do not see much reason to have both Wasm and RISC-V support for the contracts-pallet as there would be no real argument to be made for using Wasm. The only niche use case I could think about is using wasmi as a fallback for environments where the RISC-V JIT is not supported since interpreters are more universal than JITs. Another reason is upgradability since wasmi lives in the Substrate runtime whereas the RISC-V JIT would need to reside in the Substrate client since Wasm does not allow for JIT-like executions.

I also have a few questions about your RISC-V JIT plans:

  • You were talking about further improving the encoding of RISC-V in order to further shrink it. Do you already have concrete plans what to do and what could be improved concretely. From my very basic knowledge about RISC-V it already is quite compressed.
  • What platforms would you target for initial support for the RISC-V JIT? ARM? RISC-V itself? x86 probably?
  • Wasm was designed with all those safety checks in mind. I am lacking enough knowledge about RISC-V to understand how some of those safety properties are guarded in RISC-V, or are they simply not? E.g. memory accesses, uninitialized memory, etc. How would a RISC-V JIT handle those cases in theory?

Furthermore, for benchmarks I want to stress that we are always interested in the whole execution cycle, including parsing, (validation), translation of RISC-V/Wasm to the executed bytecode or machine code as well as the execution. Only measuring the raw execution speed is not enough and would yield non-valuable measurements since smart contracts usually execute very little fragments of their entire body. So usually execution time is dwarfed by the translation time. This is also the reason why our Wasmer singlepass experiment was unsuccessful even though the raw execution was roughly 10 times faster than wasmi. However, including the translation time wasmi was faster on the smart contract samples. My hope with a successful RISC-V JIT is that we could reduce fuel costs dramatically and thus enabled more compute intense smart contract use cases, shifting the entire ecosystem of smart contracts since more ways of expressing intent are affordable.

As a side note: You mentioned that eBPF uses the same instruction for host calls and internal calls which is bad. Wasm does the same as eBPF. Due to the encoding of Wasm it is possible to differentiate those calls in theory during translation. However, this is obviously not possible for indirect calls. In wasmi we do not yet profit from this optimization so I cannot tell how much of a performance improvement it would be for call intense workloads.

7 Likes

@RobinF Thank you for sharing your experience with WASM!

Even for that usecase we could probably write an optimized RISC-V interpreter that could live inside the runtime, although the question here is: how fast is it going to be? I think WASM could have the upper hand here - because of its higher level nature it might be easier to write a fast interpreter for it (I was really impressed with wasmi’s performance!). But I’m not sure. My RISC-V interpreter had absolutely zero optimizations done to it, so I could definitely speed it up, I just don’t know by how much.

I did not explore this in detail and don’t have any concrete plans yet, but I think it should still be possible to improve on it (RISC-V’s encoding is after all a compromise between compactness and make it easy to decode/execute it in hardware). It’s just an issue of empirically finding out how much of an improvement it’s going to be exactly.

Initially AMD64 (x86_64) since that’s what the majority is running, and then AArch64 for all of the M1 people.

A JIT which generates RISC-V would be fun to have, but RISC-V hardware availability is not there yet at this point in time, so having it wouldn’t really be practically useful for anything.

RISC-V inherently doesn’t deal with the issue of safety at all, just like any other hardware ISA. However, I believe we can make it safe, especially since we don’t need to run arbitrary existing RISC-V binaries and have full control over our execution environment (so e.g. we could use similar tricks which wasmtime uses to make WASM safe). In the end a RISC-V program will see a blob of linear memory (just like in WASM) which it will be able to access, and the VM/JIT will make sure that it can’t access anything outside of it.

And of course, since we’re operating on the level of assembly there isn’t really such a thing as undefined behavior, e.g. reading uninitialized memory on the assembly level is normal and can be done “safely”, although you’ll obviously read back whatever stale value just so happened to be there. (Though we won’t have any uninitialized memory; for determinism’s sake we’ll of course always clear the memory before execution.)

Yes, indeed! This is precisely why I wrote my JIT the way I did, and assuming the production-ready version of it is going to be similar then the compilation/translation step should be very cheap.

Yep. What works in WASM’s favor here however is that it is higher level (and that it’s quite easy to differentiate the host calls and internal calls based on their index, since host functions always come first), while eBPF really feels like a low level ISA in this regard and has just a call $address instruction with really no standard way of knowing whether $address points to a function defined within the VM or outside of it. (Or in other words, it pretends that both host calls and internal calls live inside of the same address space, while WASM doesn’t even have a concept of address space for code.)

1 Like

If you can provide me with a link to your interpreter I can take a look at it and see if there are some quick ways to improve its performance.

Indeed it seems that RISC-V with the 16 register restriction should be fairly easy to translate to machine code. It could be that even a simple Wasm interpreter might have more overhead than a singlepass RISC-V JIT. I am looking forward to future results!

Makes me dislike eBPF even more … thanks for the info!

Another question that just crossed my mind: Is there an (official) RISC-V spec testsuite? In wasmi we highly profit from the existing Wasm spec testsuite with its tousands of test cases not having to write our own very extensive testsuite in order to verify correctness of wasmi.

The risc0 project and their approach to optimizing their risc based zkVM is worth looking into. Their optimization problem is tangential to ours.

The concept of on-chain crypto is a bit like their need to have fast provable computation so recursive zero knowledge proofs become possible.
(the proof that you have proven something)

They have a SHA-specific accelerator for example.

2 Likes

Thanks for the explanation! I was curious if we could leverage existing technology/approaches. In particular I think that there would be a very real concern present: how secure would a from-scratch project be, compared to battle-tested WASM compilers? I am not familiar with the threats in smart contract environments, but in the realm of PVFs (which I know you said you’re not targeting yet) security is critical. Right now we get to leverage the efforts of the WASM community in identifying bugs and fixing them fast, as well as the large existing test suites (as mentioned in an above post).

But your compiler aims to be very simple which is a great approach to minimizing bugs. Still, at a minimum it would have to be audited and fuzzed. :slight_smile:

Also, was determinism a goal or something you considered? It would be great to have for PVFs at least. If so, how would your approach compare to the current situation for determinism with WASM pre-compilation (summarized here by @dmitry.sinyavin)?

Overall I think this is a great idea. There is an opportunity here to be leaders in very impactful work for the ecosystem, which we can leverage in our marketing. We can also say that we’re 25+x faster than Solana. (Though I guess we already are, with WASM. Does Marketing know about this?)

1 Like

It’s more a parachain concern I think. As you say, we should focus on RV32E with your changes, likely this holds for parachains too. Cool :slight_smile:

As an aside… We’ll largely address this for EC ZKPs on parachains largely by exposing host calls for MSMs and pairings on several popular elliptic curves. We’ll maybe need more for MPCs but not sure yet.

I haven’t pushed my code anywhere yet as it’s really messy and I need to clean it up first. It’s basically a naive interpreter that first decodes the instructions into an enum, and then just matches on them and executes them. But this isn’t my first rodeo, so I do know how to optimize it, I just don’t know how fast it’d get after it gets optimized. (:

Yes, there are some official tests, although I haven’t looked into them in much detail yet. Here: GitHub - riscv-software-src/riscv-tests

Nevertheless, if this goes anywhere near a production state I do plan to have exhaustive tests which would basically test every instruction with every possible combination of registers.

Fundamentally I think the crux of the security issue is that complexity is a problem. Even something as widely used and battle tested as wasmtime recently had an extremely nasty bug which can lead to remote code execution: Guest-controlled out-of-bounds read/write on x86_64 · Advisory · bytecodealliance/wasmtime · GitHub

So my strategy here is three-fold to keep everything secure: (Well, it will be if this gets out of the experimental phase.)

  1. Keep things as simple as possible. Simple things are easier to secure.
  2. Exhaustive and extensive testing of everything.
  3. Alcatraz-level of sandboxing; even if someone would break out of the VM and gain remote execution they still shouldn’t be able to do anything besides maybe waste the host’s CPU cycles.

Yes, determinism is also a goal. The machine code generated by the JIT will be the same regardless of which machine it is generated on (of course assuming the CPU architecture is the same), and I also want for things like e.g. stack space used to be deterministic. And since we’d control the whole execution environment we can always explicitly version whatever aspect of it we want, which makes it possible to change it later without breaking determinism.

3 Likes

I love the idea of starting to experiment with RISC-V! Thought about that myself some time ago. Curious to see how it goes.

Some food for thought: recently, I caught up with Luke Wagner, one of the guys behind the WebAssembly spec, and talked with him about the architecture, and of course, my question was, “why the stack machine architecture which is harder to translate to register machine architecture of a real CPU”. And I found his point quite logical.

  1. It doesn’t matter if the source abstract VM is a stack or register machine. As it never matches 1:1 to a real CPU architecture, a compiler must still convert the code to the SSA form and then do register allocation. At that point, the source architecture is lost anyway.

  2. If you want to do register allocation efficiently, you have to follow values’ live ranges to be able to optimize values’ distribution to registers. And the stack machine architecture is much more convenient here: when a value is pushed to the stack, its live range starts, and when it’s popped from the stack, its live range is unconditionally over. So you don’t have to keep an eye on when it dies, you just know when it dies. So it simplifies the compiler’s design.

But that’s just one of the possible angles of view to the problem. So now we have a RISC-V compiler experiment and a WASM transpiler experiment, and I’m really curious about where it’s going to take us :slight_smile:

Yep, this is true! But, it does make one big assumption - that the goal is to have a compiler which is essentially an optimizing compiler that should be able to generate pedal-to-the-metal fast code at the expense of putting in more work up front when compiling it. I think fundamentally this does make sense in the context of WebAssembly and the environment in which it’s originally meant to run in. (Which is also part of the disconnect we feel when we say that WASM is not a great fit for something like smart contracts, whose base requirements are different.) We’re fine with some execution time performance loss if this means we can get other things we want.

1 Like

I’ve cleaned up my experimental JIT (well, “cleaned up” as much as I could; the code’s still pretty bad) and pushed it to GitHub, so here it is in case anyone wants to take a look:

5 Likes

For sandboxing we could consider using hardware virtualization support (VT-D, …) but it would make our JIT much more complicated I guess.

Additionally, RV64E would probably mean going down the post processing route of re-allocating registers on the rustc output. Not a small project by itself I guess. It needs to transform to SSA, realloc, emit code.

I suggest starting with an MVP (RV32E) and then iterate. We can always add features later (64bit support). We just need to get the API properly versioned so that we can actually iterate.

I would even say determinism is one of the reasons why we want to get away from wasm. All those high level constructs make determinism hard. While in RISC-V everything is just memory. We just need to make the environment identical and then everything is deterministic by default.

The obvious example is a stack overflow: In wasm the executor needs to handle this. In RISC-V we don’t even know about a stack. It is just memory to us and if a guest program wants to corrupt some of its data structures in is well within its rights to do so.

It might even not be a great fit for PvF. While we can get away with an optimizing compiler there it is not ideal. I suspect we would trade some execution performance if we could get rid of pre-checking in return. With parathreads it gets even more important that we don’t spent too much time compiling. It all comes down to how much slower the generated code is when compared to wasmtime.

1 Like

Continuing with this experiment, here’s another update!

I’ve grabbed the work-in-progress LLVM patch implementing the RV32E ABI and I’ve integrated it into rustc. There were some merge conflicts when applying the patch to rustc’s LLVM fork, but the whole process was mostly straightforward and I successfully got the newest nightly Rust toolchain working which can emit RV32E code. (I can push this to GitHub later if there’s interest.)

So I recompiled my test program for RV32E, and here are the results:

  • Original: ~25ms/frame, 92 kilobytes
  • RV32E (unmodified VM): ~32ms/frame, 88 kilobytes
  • RV32E (optimized VM): ~27.5ms/frame

What’s interesting here is that the benchmark got slower! Not entirely sure why this happened, but considering that the resulting code also got smaller I’m guessing that with more registers the compiler generates code which exploits more instruction-level parallelism, and modern superscalar CPUs really like that, while the 16-register version can’t do it as much as it needs to reuse registers.

Nevertheless, after a few extra optimization and tweaks to my recompiler’s codegen I’ve mostly clawed that back so now it’s only slightly slower.

What is nice with the RV32E is that, indeed, it only requires 14 general purpose registers, which is perfect, so on AMD64 it is possible to just keep every RISC-V register in a host register without spilling to memory!

And the RV32E-enabled rustc seems to work pretty well; push comes to shove we could just offer prebuilt binary artifacts which could easily be installed through rustup until the LLVM patch gets upstreamed, so I think the issue of “RV32E is not supported by the compiler” is not a problem at this point. (And there are other people who want RV32E support too, so I think its inclusion in LLVM should be just a matter of time.)

On the more negative side, I’m not sure how much more performance we can extract from the current dead-simple compilation model we have. I really like it for how simple and blazingly fast it is, but because it is essentially just a linear loop over all of the opcodes that directly emits host machine code there isn’t really much in the way of optimizations it can do besides being optimal on per-instruction level (which isn’t that hard because most RISC-V instructions map to a single or at most few AMD64 instructions) and maybe some macro fusing of RISC-V instructions (but from my experiments that doesn’t actually give much of a speedup, at least for my benchmark). So I think we have two options here:

  1. Just accept the current performance as-is in exchange for the extra simplicity and super fast compilation times. (*cough* it’s still order of magnitude faster than solana_rbpf *cough*, and let’s not forget that this benchmark is meant to be an extreme example, so perhaps in practice it won’t matter that much)
  2. Write a preprocessor/precompiler/optimizer of the bytecode (something like wasm-opt) which could be automatically called by cargo-contract which would apply additional expensive/complex/O(n^2) optimizations offline and possibly emit some custom instructions/macro ops for our JIT compiler which could then stay simple and stupid. (This would be a good option, except without actually doing it I can’t tell how much performance exactly we can gain here.)

What would also be fun to do is to do an in-depth comparison of the code emitted by my JIT and the native AMD64 code emitted by rustc and maybe see if we could glean some ideas from that (e.g. see which parts need the most improvement and concentrate on that). But at this stage this is somewhat out-of-scope.

So, possible next steps? (I’m continuing from the points from my original post in this thread)

  • Integrate into Substrate and Ink! (@Alex has already started on this)
  • Investigate how compact we can encode the bytecode.
  • Start to clean up the code, remove jank, add error handling, add tests, prototype sandboxing.
  • Optimize the speed of the JIT compilation itself and maybe end-to-end benchmark it. (I’m not expecting surprises here though since JIT compilation here is essentially a simple O(n) process, so doing comprehensive benchmarks is probably unnecessary; it’s just that my current code is not very optimal in this regard.)
  • Add an AArch64 backend (should be relatively simple after the crate’s cleaned up and refactored)
4 Likes

I wouldn’t have expected that. I assumed it became bigger but faster. But I think the performance hit is a great tradeoff since we gain simplicity and predictability for our JIT. Maybe the hit is just due to the less mature LLVM backend.

I think this is the way to go for now. It is faster than wasmi with probably the same or even faster startup speed. Even as-is it is a huge improvement. Even when disregarding the simplification of the system.

Would that really help?. The RISC-V code is already optimized by LLVM. Isn’t wasm-opt only a thing because LLVM has a hard time optimizing stack bytecode?

1 Like

Yeah, and one extra thing that’s worth highlighting here is that compared with wasmi (as we currently use it) it is faster than what the numbers in my original post may imply, as I was running wasmi natively for my benchmarks while we’re currently running wasmi under wasmtime (where wasmi under wasmtime can be up to twice as slow as wasmi on bare metal; see here for some numbers https://github.com/paritytech/substrate/pull/12173#issuecomment-1240466579)

Yes, but that’s code optimized for running on RISC-V hardware, and not necessarily for recompiling it to something else. (Although you’re right that it might also be an issue of a less mature LLVM backend.)

It’s possible we could still improve the performance if we’d extend it with e.g. some higher level meta instructions that’d be just as easy and simple to JIT (or even simpler) but could make it possible to emit more optimal host code. (But again, I don’t know by how much; I’d have to investigate why the JITted RISC-V code runs slower than the native code. In theory it should be possible to get them closer performance-wise.)

1 Like