We should ban floating point operations from parachain runtimes! Or not?

It turned out that many parachain runtimes use floating point math inside. Neither of them does that consciously, though. Instead, it’s the dependencies (sometimes not even first-level ones) that introduce the floating-point code.

Floating point operations are known to be a source of non-determinism and therefore they are banned from relay chain runtimes, but not from parachains.

The non-determinism in question mostly applies to NaN values, and in the current implementation, it is mitigated by enabling “NaN canonicalization”, a Wasmtime feature. The other concern is that Wasmtime authors may change some logic working with/optimizing the floating point operations and that would introduce a floating point jitter and thus an inter-version non-determinism. Also, a lack of determinism may apply to different node implementations that may use different Wasm engines which may behave differently regarding to floating point operations.

Options to handle that are obvious.

Option 1: Impose a ban on the floating-point operations for parachain runtimes. Enforce it during the PVF pre-checking. Create tooling that helps parachain authors to check if their code is compliant before they submit it on chain.

Pros:

  • Agrees well with the relay chain implementation logic. If we ban it from the relay chain, why shouldn’t we ban it from parachains?
  • Eliminates that kind of non-determinism for sure, without future surprises, for all the versions and implementations.

Cons:

  • Requires a fair amount of engineering and organizational work from our side, as well as the cooperation of parachain developers.

Option 2: Ensure and prove the deterministic floating-point behavior in the current system and create an action plan for the case if an inter-version or inter-implementation non-determinism ever shows up.

Pros:

  • A modest amount of RnD solely on our side, not forcing the parachain authors to make any changes.

Cons:

  • Does not guarantee that at some point, in some implementation, in some version the determinism will not get broken.

There are probably other options I’ve missed. Every thought is appreciated.

CC @gavofyork @rphmeier @eskimor @koute

3 Likes

Yeah, it’s pretty error prone to allow float operations inside the wasm runtime. In our team, we are aware of that. So whenever we want to run some calculation on real number, we use fixed library. However some developer is not careful enough, they may end up using float point without even noticing, which is a big hidden danger.

On the other hand, I sometimes heard from the WASM community saying that the WASM spec itself actually REQUIRES deterministic float point implementation. So as long as an interpreter follows the WASM standard, float points should be allowed. I didn’t find out any specific discussion available online about this debate. So I just want to raise the awareness of it. If it turns out the float point operations in WASM is indeed deterministic, this problem can be solved elegantly.

2 Likes

@dmitry.sinyavin As a rule of thumb, we tried to stay away from relying on a specific engine implementation details, however, in this case it was deemed fine because the actual implementation of NaN canonicalization is pretty straightforward. So the thinking was that this was ok to rely upon, since we could reimplement this pass and put it into the preparation stage as a wasm-level instrumentation. This instrumentation algo could be followed by the other teams as well.

On the other hand, I sometimes heard from the WASM community saying that the WASM spec itself actually REQUIRES deterministic float point implementation.

@hangyin it’s actually the opposite. The specification explicitly permits for non-deterministic outputs for some fp operations. You can convince yourself by checking out the spec here and here. Hence there is no debate around that.

4 Likes

But do you believe the NaN canonicalization is the only problem to solve to make things deterministic?
I’m thinking about different implementations working differently with constants, e.g. one does constant folding and another does not (and the same goes for LICM, code reordering etc), could that be another source of FP non-determinism?

Oh, yes. I think we can avoid this way of reasoning. The standard is precise in terms of where it allows non-determinism and, obviously, if an implementation exhibits non-determinism where it is not supposed to, then that implementation is not compliant.

Also, I can’t recommend enough studying the specification closely if you haven’t done so. It might seem intimidating, but actually it is rather simple.

4 Likes

In general IEEE floating point numbers can be entirely deterministic and the spec guarantees determinism as long as certain things are followed. There’s even a short chapter in the spec regarding this issue (type “754-2019 - IEEE Standard for Floating-Point Arithmetic” on SciHub and go to chapter 11 if you’re interested). On the other hand if those things are not followed then the results can differ between machines/architectures.

Historically floating point numbers got a bad reputation for non-determinism for a few reasons. For example, the original x86 FPU is actually using 80-bit floating point numbers for calculations (and on 32-bit x86 you can’t avoid it because the calling convention requires it!) which will produce different results compared to a different architecture which uses normal 64-bit floating point numbers, or if a rogue library changes the per-thread floating-point rounding mode and doesn’t change it back, or whether the compiler decides to emit x87 FPU code or SSE code for calculations, or if denormals are disabled, or if the compiler uses FMA instructions (but it shouldn’t do this without -ffast-math by default), or if you’ll use one of the operations which is explicitly implementation-defined in the IEEE spec, etc. I’m not a floating point expert and even I can list a bunch of these things off the top of my head.

Suffice to say, I’d sleep better if we didn’t have any floating point operations in the runtime. Even if the WASM VM guarantees determinism there might still be bugs, or it might not sanitize the global FP-state before execution and something might accidentally change it, etc. But in practical terms that train has already left the station, and it would be tricky to actually enforce this kind of a ban. The floating point operations can be hidden in e.g. a transitive dependency, so it’d be a horrible developer experience to now tell everyone that they need to hunt down all of those places and get rid of them. And it’s not like FP math can be disabled in WASM (it’s part of the baseline spec), nor can it be disabled in rustc (it doesn’t allow enabling softfloat for WASM, AFAIK).

So practically I don’t think we have that much of a choice here. As @pepyakin said the WASM spec is clear as to where non-determinism is allowed, so we need to make sure that we’re on top of it and not let any potential non-determinism to bleed through.

One extra thing we could do is to compile some floating-point heavy code into WASM, change the floating point settings (see e.g. https://doc.rust-lang.org/beta/core/arch/x86_64/fn._mm_setcsr.html - this is per-thread and persistent!) and see if that affects the results that wasmtime produces, and add that as a test. I don’t know if it does affect wasmtime’s floating point calculations, but my guess is that it probably does (I don’t think they sanitize the FP state before the execution?), and if so then a stray third-party crate/library calling this could screw up our non-determinism guarantees (probably unlikely, but considering we have 1000+ dependencies it could maybe happen).

2 Likes

Do you mean Wasm spec or floating point IEEE? Anyway, I read both, of course. Wasm spec is not very helpful in that sense as it only declares everything to be IEEE compliant, with some limitations (no exceptions observed, etc.) IEEE says much more about reproducibility but it lists a bunch of guidelines to achieve that reproducibility and we cannot be sure that the implementation authors follow those guidelines.

But I get your point, if things get not-deterministic here, it’s a bug and it’s not ours.

Wow, there’s a 2019 update! I think the last one I read was 2008. I wonder what has changed.

That’s a great idea, thank you! I’ll give it a try.

Wasm spec is not very helpful in that sense as it only declares everything to be IEEE compliant

I have to disagree. The wasm spec goes beyond that and explicitly defines the possible results. See how you could’ve used the spec to convince yourself under the spoiler.

First, how does the spec describe non-determinism in execution? Well, from here we see the execution semantics are described through reduction rules from one configuration to another.

A configuration is a syntactic description of a program state. Each rule specifies one step of execution. As long as there is at most one reduction rule applicable to a given configuration, reduction – and thereby execution – is deterministic . WebAssembly has only very few exceptions to this, which are noted explicitly in this specification.

So, I claim that the instruction f32.sqrt can be non-deterministic. It’s a numeric instruction and defined in terms of a numeric operators. Then we are interested in those statements:

Where the underlying operators are partial, the corresponding instruction will trap when the result is not defined.

and

Where the underlying operators are non-deterministic, because they may return one of multiple possible NaN values, so are the corresponding instructions.

We know that f32.sqrt is a unary operation and is described by those semantics.

At the step 3 there is a branch conditioned on whether the underlying op is defined. This is related to what we saw above regarding the partialness of the operator.

Now, we go to the numerics. There we see the following statements:

Some operators are non-deterministic, because they can return one of several possible results (such as different NaN values). Technically, each operator thus returns a set of allowed values. For convenience, deterministic results are expressed as plain values, which are assumed to be identified with a respective singleton set.

and

Some operators are partial, because they are not defined on certain inputs. Technically, an empty set of results is returned for these inputs.

Let’s then see how the numerical operator for f32.sqrt is defined as fsqrt with N=32.

So, is the operator partial? No, it doesn’t seem so, because it’s defined over the complete domain.

Then, is it non-deterministic? It seems so. For example, in case the input is negative, then it returns an element of nans_N{} which is defined here. Even in the “most deterministic” case when the input to that function is empty, the set consists of at least +nan(n) and -nan(n) where n = canon_N, denoting that the sign is not deterministic. With empty and non-canonical inputs the set is larger.

Just fyi, Falcon is the nicest post-quantum signature for consensus & blockchains. Afaik, there is no floating point in Falcon verifiers, but Falcon signers employ floating point arithmetic for sampling a Gaussian distribution. It’s plausible off-chain workers would do this, although alternatives exist: