[Revive] Deterministic Verifiable Contracts

Motivation

When implementing a smart-contracts that handles a lot economic value you really want to make sure it isn’t vulnerable, a smart-contract is immutable and once a vulnerability is exploited there’s no rollback, is ok to rollback state changes caused by a bug in the protocol like Bitcoin’s Overflow Incident, but it’s not ok to rollback due user’s fault like losing funds or deploying a vulnerable smart-contract, Ethereum learned this during the controversial DAO Hard Fork, a single vulnerable smart-contract caused Ethereum and Ethereum Classic to split into two competing systems.
Since then new practices were adopted by the industry, one is hire renowned audit companies to review the code before publish, for example the contract used in polkadot’s ICO polkadot-claims was audited by Chain Security.
Usually what is audited is the high-level source code, whoever only the compiled binary is stored on-chain, how you can be sure that binary was generated from the same high-level audited code?

3 Likes

The challenge

Maybe your first thought is "Checkout the source-code locally, compile it, then check if the local binary matches the on-chain binary. Easy peasy!“.
Essentially this is what revive does using the `source.hash`, but isn’t that simple, very strict requirements must be followed:
1. You must somehow have access to the off-chain source-code (obviously).
2. Reproduce the exact same steps used to compile the original binary, it means exact same compiler version, settings and tools.
3. Guarantee all steps are "pure“ or deterministic, it means a given source always generates the exact same binary, while this is straightforward using solc, this is not simple in Rust, you must use the same compiler version, right rust toolchain version, there are global dependencies like cargo-contract you may need replace, etc… that’s why ink! suggests using a docker image.
4. Assuming we have a fully deterministic compilation pipeline, are we done? Not yet, while we guaranteed the same code always generates the same binary, we haven’t guaranteed this is the ONLY possible code that generates that exact binary, is a subtle difference with big implications, without this anyone can generate multiple sources for the same binary, I can pretend a given contract have an arbitrary code, but still match the binary using rust’s shinigamis to not include that arbitrary in the final binary, etc… this makes tools like Etherscan Verify Contract inviable to exists in ink! ecosystem, because Etherscan assumes there exists only one valid source+metadata per binary, that’s why it allows anyone to trustless publish the source and metadata of any on-chain binary.

Polkadot and ink! approach

Polkadot and ink! mainly rely on docker images to compile the source deterministically, tools based on docker like srtool and cargo-contract are used to compile wasm and riscv binaries deterministically.

Advantage:

  • cargo-contract docker image contains more than just the compiler, it also includes all development tools needed.

Downsides:

  • Incomplete: The ink! metadata doesn’t contain sufficient information for fully deterministically generate the same binary, for example it includes the compiler version, but not the docker image hash used to compile, if any. Personally I don’t think docker images should be the only possible way to create deterministic binaries, it was ok for runtime code, but is inconvenient for smart-contracts.
  • Disk Space: Docker images are big and not easy to distribute, contracts-verifiable:6.0.0-beta.1 have 500mb compressed and 1.5GB after you instantiate the actual container in aarch64-macos.
  • Slow: The official image was built for amd64, running it in a aarch64 machines requires emulation, which is very very slow even in high-end Apple Sillicon machines.
  • Not “always” reproducible: If docker’s registry get offline, simply rebuild the same Dockerfile locally doesn’t guarantee you get the same image, because fetching external dependencies isn’t a pure step, if you attempt to build an old ubuntu:14.04 Dockerfile today it doesn’t work because the /etc/apt/sources.list no longer exists, apt-get update no longer works, etc…

Solidity approach

Solidity compiler appends a CBOR-encoded hash at the end of all smart-contract bytecode, this allows tools like Etherscan to verify and index deployed contracts.

  • Solidity compiler is deterministic and self-contained, any release is compiled to all major targets, including wasm (solc-js).
  • Instead the compiled code’s codehash, solidity uses the contract metadata hash, which includes compiler settings, compiled bytecode and believe or not the plain-text source and dependencies, this means any changes in the plain-text source results in a different hash. This have downsides too, but is how they guarantees one source to one binary relationship, thus allows “full verification”, as well as pinning the files publicly on IPFS to be accessed with the metadata hash.
  • solc-js compiler allows tools like Remix IDE to exist, it provides a complete developer experience directly in the browser, very appealing for beginners.
  • Compared to docker imagens, solc runs natively and the binaries are small (less than 10mb), they are easily distributed using github releases or package managers like Solidity Version Manager (SVM).

In summary there are 2 issues to solve:

#1 Verifiable Binary

Ok we need some information on-chain that allows verifying the plain-text code, the compiler version and settings, for the source code it can be straightforward if we simply take a fingerprint of the plain text code and compiler settings, let’s call it metadata hash.

Rust is more powerful and more complicated than Solidity, things like cfg and proc-macros can drastically change the final binary based on any arbitrary local setting.. One way to guarantee the code is pure is expand all macros using tools like cargo-expand, but the code can get quite ugly…
open question: Should fingerprint rust’s expanded code or the plain-text code? I’m not aware about any public audited rust smart-contract, would love to know what tools are used to audit code generated by complex macros.

I think fingerprinting the original code is ok, as long the metadata also contains all settings and enabled features needed to reproduce the exact same macro-expansion and binary anywhere, so the developer must guarantee all macros used in the contract are pure.

Where we store the metadata hash?

Solidity simply appends the metadata hash at the end of the actual binary, but for pallet-revive this doesn’t work:

  • Wasm and PVM have a well defined structure, we can’t simply append arbitrary bytes to it, in EVM for example those bytes are parsed as valid code, so solidity language itself is aware about those bytes and must guarantee they are unreachable.
  • For verify the contract, tools needs to split the binary from the metadata hash, because the metadata hash is also derived from the binary.
  • pallet-revive and pallet-contracts split the contract bytecode from the contract instance, which allows many contract instances to share the same bytecode, appending an unique digest to every binary makes this less useful.

Solutions:
Native: Create a dedicated storage in the pallet for store the metadata hash.
Contract: Isn’t possible to include the metadata hash statically in the code, because it is derived from the code itself, but technically it can be provided later when instantiating the contract, the way it is stored and retrieve must be standardized.

1 Like

Hey @Lohann. It’s definitely important to have deterministically verifiable contracts on chain, similarly to how it works with Solidity.

I didn’t look at the details yet for how to achieve this for Rust contracts, but this all sounds like there is no show stopper and all challenges can be solved (they are just not yet solved today).

Thanks for all your suggestion, will come back to it in the future.

2 Likes

For verification you can use the plain-text code. Verification is something different than auditing. For audits, you need to look behind the macros. Not sure they are looking into the expanded code or just check what the macro code is doing.

For Rust we will right now not get around the issue of using a fixed build environment aka docker to achieve reproducible builds. We are doing the same for the runtimes since years and it is not such a big problem. For sure this is not a perfect solution. Daniel from Virto recently brought up the idea of having a compiler running inside a JAM service. This would be ultimate decentralized way to verify that a contract matches the actual source code where you don’t need to trust some random block explorer. But yeah, not sure if this is anywhere near doable. (worth some experiment when we have a compiler that can be compiled to riscv)
To get rid off docker, this is something that needs to be fixed fundamentally by Rust itself and is tracked here: Tracking Issue for Reproducible Build bugs and challenges · Issue #129080 · rust-lang/rust · GitHub

1 Like

I never liked the idea of appending the metadata to begin with. It’s a weird workaround which sort of creates like two bytecode hashes per blob. The clean solution is what you proposed: The contract deployer can store whatever extra data is required to reproduce the build directly in the pallet.

The problem with the clean approach is that it will break the incumbent workflow. The lesson I learnt, for everything we try to do better than on Ethereum, there will be someone screaming loud enough (i.e. people are not at all willing to adapt and we then get blamed that our tech stack “isn’t compatible enough”).

So what I suggest is to leverage the fact that we actually own the PVM blob container format (unlike Wasm or EVM). Seemingly, the only thing we need is a small change to the PVM loader and disassembler to accept blobs where the blob exceeds the length found in the header. PVM can simply ignore the extra data and the loader just always uses the length specified in the header and not the actual bytes (this completely solves your concern of intermingling metadata as code bytes). Which allows us to follow the Ethereum model. @koute I can implement the change if this looks LGTM from your side.

1 Like

I think as a JAM service, this is much more doable than with ZK VMs. Mainly because it’s much more resource efficient but also thanks to CoreVM making it look more like a traditional user-land. FWIW, regarding resources, our own Solidity compiler (resolc), which is a LLVM based compiler, does run in the browser (which I consider a similarly constrained environment). It just doesn’t work for larger contracts (it was reported to use too much memory).

My take is that we are a long way from this being doable in the general sense, but may already work for dApps, because contract code is generally small.

(I also smell a really good business opportunity for JAM here if we can make LLVM work beyond contract sized code bases. Verifiable computation is a concern beyond blockchain. The security industry may actually be very interested in such a service!)

2 Likes

To be fair, even with the WASM blob format we could do it because it supports custom sections.

But, there’s no need to add any new code to ignore any extra data. This code already exists.

The PVM blob is designed in a way that you can have custom sections with custom data, and even the already deployed pallet-revive will ignore these new sections as long as you give them correct IDs and put them at the end of the blob. The only change needed would be to add this to the linker/toolchain so that your custom section is actually emitted.

There is more. Although this isn’t yet implemented (but it’s trivial to implement, if you’re interested), the PVM blob format was also designed so that you can calculate a hash of the program and be able to trivially skip all of the extra stuff.

Because you see, metadata isn’t the only thing that can be inside of a PVM blob. You can also have things like debug info for example. So the format was designed like this: everything which affects the visible runtime semantics (the program) goes first, and everything which doesn’t goes last. So if you want to calculate a fingerprint of the program that’s easy: you read the first byte of the blob, you skip 8 bytes**, then you read all of the bytes until you either encounter a section ID that is 0 (i.e. end of file) or with its most significant bit set (i.e. the first optional section), and you hash that all together. There you go. A fingerprint of the program which ignores metadata.

** - The “skip 8 bytes” part is somewhat unfortunate; those 8 bytes there contain the length of the whole blob, and in fact this wasn’t in my original design, but I’ve added it on request to make it possible to skip the whole blob in O(1) rather than in at most O(~130) operations. So we need to skip it since the “length of the whole blob” obviously also includes the metadata/extra stuff.

1 Like

Just to drop an idea here, I’ve the intuition that a paradigm shift in the line of https://www.unison-lang.org/ backed in a decentralized computing network could be a cosmic banger.

1 Like