I like magic tricks. Watching the performer do the show and attempting to catch the “slight-of-hand” that makes the impossible seem possible. It’s fun and makes me believe in magic for a few moments. So I’m going to give a shot at doing some floating point magic tricks.

Throughout my performance I’ll use two 32-bit formats: IEEE binary32 (aka singles or float in a number of languages) and posits (with es=2).

Goldberg’s long thin triangle performance LIVE on stage

We are going totally compute the area $A$ of a thin triangle using the classic form of Heron’s equation. Life’s way too short to think about computation!

\[\begin{align*} s & = \frac{a+b+c}{2} \\ A & = \sqrt{s\left(s-a\right)\left(s-b\right)\left(s-c\right)} \end{align*}\]

With a wave of my hand I declare the lengths to be:

\[\begin{align*} a & = 7 \\ b & = c= \frac{1}{2}\left(a+3~\text{ulp}\left(a\right)\right) \end{align*}\]

and let’s pull out of my magic computationalizing box the exact, the super-mojo-exponent and tapering-precision-just-exactly-where-you-need-it posits and poor old binary32 results (wrong digits in red):

\[\begin{align*} \text{exact} & = 7.831550 \times 10^{-3} \\ \text{posit} & = 7.831550 \times 10^{-3} \\ \text{IEEE} & = \color{red}{9.043097} \times 10^{-3} \end{align*}\]

Booyah! In your face ya old retired, don’t publish papers anymore, stuck in the 80s people! Posits are, like, exact and you got ZERO digits correct!

Ahhh…okay. Let’s look again:

\[\begin{align*} \text{exact} & = 1000000001001111111001110 \times 2^{-31} \\ \text{posit} & = 100000000100111111100111\color{red}{1} \times 2^{-31} \\ \text{IEEE} & = 100\color{red}{1010000101001011111110} \times 2^{-31} \end{align*}\]

CAT-A-STROPIC! only 3 out of 24 binary digits correct vs. 23!

I’m going to do this trick again, but first I want to tell a story. Way back in 1985..raise your hand if you were even born yet. Yeah? A couple of people. Ok, well, back in the day integer adders were super cheap and integer multiplier super expensive SO the IEEE committee was all like “Okay..let’s make the exponent part bigger (since that needs adders) and the fractional bits part smaller(that the multipliers)!”. I suppose that was fine then, but man-oh-man have things changed. Like we just saw…singles only have 24 bit of precision (AT MOST!) and totally fail at a simple computation. But at the same time can represent really stupid big numbers! Get this: the largest finite number a single can represent is almost $2^{128}$. I think mathematicians call numbers like this googleperplexing…probably because even searching the web won’t help you understand them. See what I mean. Way too big. That’s why I’m using posits with es=2. With that we still represent numbers up to $ {2^{120}} $ AND have more precision. Maybe a better choice would be to use es=3..that would slightly hurt our precision, but we’d be able to drop-in-replace IEEE doubles with 32-bit posits (how dope is that?)…but I’m going to stick with my choice for now.

Let’s do the same thing again, but this time with a reasonably big, but not googleperplexing, number: $a=3670016$.

\[\begin{align*} \text{exact} & = 100000000100111111100111 \times 2^{8} & \approx 2.1527203 \times 10^{9} \\ \text{posit} & = \color{red}{0} \\ \text{IEEE} & = 100\color{red}{101000010100101111111} \times 2^{8} & \approx 2.4857475 \times 10^{9} \end{align*}\]

Yeah, yeah okay…millions is silly, let’s use a more realistic number: $a=917504$

\[\begin{align*} \text{exact} & = 100000000100111111100111 \times 2^{-6} & \approx 1.3454502 \times 10^{9} \\ \text{posit} & = \color{red}{0} \\ \text{IEEE} & = 100\color{red}{101000010100101111111} \times 2^{-6} & \approx 1.5535922 \times 10^{9} \end{align*}\]

Shut! Up! $a=3584$

\[\begin{align*} \text{exact} & = 10000000010011111110011100 \times 2^{-14} & \approx 2.0529940 \times 10^{3} \\ \text{posit} & = 100\color{red}{10100001010010111111101} \times 2^{-14} & \approx 2.3705936 \times 10^{3} \\ \text{IEEE} & = 100\color{red}{10100001010010111111100} \times 2^{-14} & \approx 2.3705935 \times 10^{3} \end{align*}\]

Okay let’s not lose sight of what we’ve seen so far. Which is that posits can be much more accurate than binary32. What I’ve ignored is the common consensus that the most important interval is $\left[-1, 1\right]$. Now, for once, IEEE got this right and HALF of the values are on this interval. Of course posits get it right too so let’s try values less than one…


I’m going to start with some full disclosure. The reason for this blog post is stupid. I made a shitpost poll on tweeter which I was expecting to be ignored but it wasn’t. I offered to make a critique and my (very cruel) followers voted yes so here we are. So here’s what I’m going to do. I’m going to keep it to the bare minimum.

  • I’m only going to address general purpose computations. This eliminates bit sizes smaller than 32 and usage in embedded or otherwise special purpose hardware. Talking about application specific usages requires knowing and addressing the specific needs of that application.
  • No meaningful analysis will be performed. Although more than others I could point to.
  • If I perform a “trick” that shows posits in a negative light I will explain how it works.
  • I am not an expert in scientific computing/NA nor hardware design, so I’ll only be pointing out “obvious” things. On this note I’m not being a jerk to people for which this stuff isn’t “obvious”. Hell, ya’ll is my target audience! I will be tossing out concepts without details or which I’ll sketch later. At least you’ll have some keywords to search on…assuming you care.
  • I’m also going to completely ignore some details. This is too long as it is and keywords are provided. Examples include mostly ignoring overflow/underflow issses and working with denormals (for IEEE case).
  • Yeah this stuff is hard but if you’re getting by without any of it then you’re officially doing it right. If you’re running into trouble then there’s no magic pill and you’re going to need to learn some basics. Luckily there’s a wealth of resources just a web search away and people you can hit up for questions (yeah it’s not the 80s anymore). And of course there’s always the option of grabbing a library written by people that live-eat-and-breath this material.
  • Counting correct digits isn’t my thing. I think it’s a poor measure (more later). I’m partially running with that to be the same.

Another bookkeeping item is the tone that I’ve chosen for this post. I’m sure some readers will find it on somewhere on the range of objectionable to “toxic”. My reasoning is simple. The authors allow themselves to be jerks and provide egregiously disingenuous examples and statements. In their defense I suppose I should point out the generous sprinkly of weasel words which make many statements not complete alt-facts.

Back to the future

Since I’ve broken my big post up into parts we need Bill and Ted to write up some posits..err..post-its to skim some points that don’t yet exist yet.

  • We have two different finite sets designed to model reals (extended reals for IEEE and projectively extended reals for posits). For a fixed $b$ bit package we can represent up to $2^b$ distinct elements
  • They both have a sign bit and are therefore symmetric about zero.
  • They both store a power-of-two exponent and some number of fraction bits and so are both binary floating point models. The fractional part means that both are piece-wise linear (fixed point) on power of two intervals.
    • IEEE uses a fixed number of bits for the exponent (8 for binary32) and the remaining are fraction bits. Those on the normal range have an additional implied bit and denormals do not.
    • Posits choose to encode the exponent using a variant of the variable length Rice encoding and the remaining bits are the fraction. All non-zero finite posit values have an additional implied bit.

So ignoring the “devil’s in the details” parts the only important difference between the two is how finite values are distributed. IEEE’s fixed precision on the normal range is based on the commonly accepted notion of “scale-invariance”. Posits are based on the notation that a tight neighborhood (in a log sense) of one need the lion’s share of points. In this neighborhood posits have a few more bits of precision and they “taper” off as we move away.

One possible argument for tapered precision might be that “gosh there’s a bunch of important constants in this range!” I’ll quote someone that’s well respected and lives-eat-and-breaths this stuff:

Trefethen1: At present, almost nothing in physics is known to more than 12 or 13 digits of accuracy. Thus IEEE numbers (doubles) are orders of magnitude more precise than any number in science. (Of course, purely mathematical quantities like $\pi$ are another matter.)

There’s some specific examples such as the gravitational constant $G$ is known to 4 decimal digits. Note wikipedia lists 6 digits, but then again this 2018 paper says 4, we measure 4. A second point raised is that doubles on a power-of-two interval can represent almost $10^{16}$ values and in physics when we hit about ${10^8}$ discrete things we’re all..this is a continous system! He continues with:

In two senses, then, floating-point arithmetic is far closer to its ideal than is physics. It is a curious phenomenon that, nevertheless, it is floating-point arithmetic rather than the laws of physics that is widely regarded as an ugly and dangerous compromise.

And (partly for fun) the section I’m quoting ends with the Trefethen quote I see most frequently (bolded):

…leading to the current widespread impression that the main business of numerical analysis is coping with rounding errors. In fact, the main business of numerical analysis is designing algorithms that converge quickly; rounding-error analysis, while often a part of the discussion, is rarely the central issue. If rounding errors vanished, 90% of numerical analysis would remain.

To briefly run with the bolded quote the biggest error for practitioners to worry for about is from modeling error. Not the floating point model, but the model used in the computation. A real world example here is the millennium bridge or more simply expecting high school textbook physics equations to hold. The next biggest problem is truncation error which is we can only perform a finite number of steps. Notice that both of these are pure math problems and not computer math ones. (Some truncation error can be removed in pure math when we know a closed-form solution, but this isn’t generally the case). Additionally there’s input data uncertainty but this is neither pure math nor computation.

So one of my points for giving these Trefethen quotes is: Don’t get lost in the numbers. By necessity we’re going to be looking at rounding errors, troublesome cases for floating point (and recall both models are floating point) and measuring\comparing errors. It all to easy to think: “Fuck! Floating point computations are totally broken!”

Triangle trick REVEALED or let’s not break the cardinal rule

Let’s break down the first example of the magician. It’s simply Gustafson’s 128-bit example converted to 32-bit. For both posits & IEEE we have the upper 3 bits set for all three inputs with exponent of $b$ and $c$ smaller by 1. For IEEE’s $b$ and $c$ values the bottom two bits are set. However posits values in this range have two more bits available, so the same numeric value has the bottom two clear. Let’s look what happens when we start to compute $s$. First we perform $a+b$:

         IEEE                                 POSITS

t0=a+b:    111.000000000000000000000            111.00000000000000000000000
         +  11.1000000000000000000011         +  11.100000000000000000001100
         ----------------------------         ------------------------------
          1010.1000000000000000000011          1010.100000000000000000001100
round:    1010.10000000000000000001            1010.10000000000000000000110

The upper bits cause a carry (both increase exp by one) and that trailing bit of $b$ means we need one more bit (now 25) to represent exactly. IEEE has to round to 24 bits and the posits version still has one zero bit at the bottom. Let’s complete the computation of $s$.

t1=t0+c:  1010.10000000000000000001            1010.10000000000000000000110
         +  11.1000000000000000000011         +  11.10000000000000000000110
         ----------------------------         -----------------------------
          1110.0000000000000000000111          1110.00000000000000000001100
round:    1110.00000000000000000010            1110.00000000000000000001100

.5f*t1:   111.000000000000000000010            111.000000000000000000001100

s:        7.000000954                          7.000000715

This time we still need 25 bits to be exact since t0 had to adjust the exp and IEEE must round again. Posits are still good with the padding bits we gave them. The multiply by half introduces no error for either. Also shown is the decimal values of each to 10 digits and the binary32 relative error is a tiny $~3.40598 \times 10^{-8}$.

Now for the $\left(s-a\right)$ and $\left(s-b\right)$ terms:

s-a:      111.000000000000000000010            111.000000000000000000001100
         -111.000000000000000000000           -111.000000000000000000000000
         ----------------------------         -----------------------------
            0.000000000000000000010              0.000000000000000000001100
s-b:      111.000000000000000000010            111.000000000000000000001100
          -11.1000000000000000000011           -11.100000000000000000001100
         ----------------------------         -----------------------------
           11.100000000000000000001             11.100000000000000000000000

Again posits don’t have any rounding error. The binary32 $\left(s-b\right)$ has a tiny relative error of $6.81196 \times 10^{-8}$ and the performed $\left(s-a\right)$ subtraction was exact, but the total relative error is a massive $0.333\overline{3}$. The tiny error in $s$ was magnified by a subtraction of a nearby number. This is an example of catastrophic cancellation or loss of significance. John D. Cook’s version of a common rule of thumb:

Nothing interesting happens in the remaining operations. All of the error is from a contrived set of numbers where IEEE is in a catastrophic cancellation case and posits are not. So this example tells us nothing. Well actually there is a noteworthy thing: the final relative error of the IEEE result is $~0.154701$. This is an example of the incorrect notion that errors grow without bound as a computation progresses.

Background references on the thin triangle problem: as previously mentioned this problem was original introduced by Kahan2 (as early as 1986) followed by Goldberg3 performing a pen-and-paper analysis and most recently Boldo4 provides a formal proof and a tighter error bound.

Posits thin triangle trick: the “fair” fight

As we saw the reason the posit version appears to give a better answer is because the deck was stacked. The point of looking at the thin triangle problem is to see the impact of catastrophic cancellation in what appears to be a simple real-world problem. The stacking of the deck was to choose numbers where loss of significance is occurring for IEEE and not for posits. Let’s look at what happens when we use the correct $\text{ulp}$ value for posits ($\text{ulp}$ stands for “unit in the last place”. Brief hand-waving def is the size of the right most digit..so $\text{ulp}\left(1\right)$ is $2^{-23}$ for IEEE and $2^{-27}$ for posits.

\[\begin{align*} a & = 7 \\ b & = c =\frac{1}{2}\left(a+3~\text{ulp}\left(a\right)\right) = \frac{1}{2}\left(7 + 3 \cdot 2^{-25}\right) \end{align*}\]

The correct answer is $1.9578876 \times 10^{-3}$ and posits give $\color{red}{2.2607739} \times 10^{-3}$ and the binary digits are:

\[\begin{align*} 10000000010011111110011100 \\ 100\color{red}{10100001010010111110111} \end{align*}\]

So posit don’t have any black magic to prevent loss of significance. There is no magic to be found. This is simply a property of modeling reals with a single element of a finite set. Let’s look again at the binary32 digits in this configuration:


Authors: That is ample for protecting against this particular inaccuracy-amplifying disaster

No. Tossing more bits at the problem doesn’t change a thing. The only choice is to change how the computation is performed. However this does include locally moving to a higher precision model or effectively working at a higher precision (more on these later).

Let’s just beat a dead horse with posits128 (es=4):

\[\begin{array}{l} 100\color{red}{101000010100101111101101001001100010011001110011001000000010011100101001101001011101001010110111100101100001010111} \end{array}\]

Authors: It is interesting to note that the posit answer would be more accurate than the float answer even if it were converted to a 16-bit posit at the end.

It is interesting to note that the 128-bit (or of any length) posit answer would not lose any correct digits if converted to the 8-bit IEEE format of the authors’ design at the end.

In the real world when we hit computational problems with inputs we care about we must change how the computation is performed. Here our options are very limited and we have two choices:

  1. Promote to higher precision, perform computation and lower back to original. So with 32-bit input we could promote to a 64-bit format.
  2. Effectively work in a higher precision (more details in a later post. Floating point expansion is one keyword for the impatient).

The Boldo paper4 details Kahan’s solution (for double input) which is an example of using option two. This is going to be left as a black box for now and it cost about one more issue vs. Heron’s (godbolt):

// requires: a >= b >= c && a <= b+c && a <= 0x1.0p255
double tri_area(double a, double b, double c) 
  return 0.25*sqrt((a+(b+c))*(a+(b-c))*(c+(a-b))*(c-(a-b)));

This list of requirements simply are: sorted largest first, valid triangle (including degenerates to line). Taking the original set of inputs and using Kahan’s method with 32-bit operations gives:

\[\begin{align*} \text{exact} & \approx 0.007831550660 & = 1.000000001001111111001110111110 \times 2^{-7} \\ \text{posit} & \approx 0.007831550553 & = 1.00000000100111111100111011 \color{white}{0000} \times 2^{-7}\\ \text{IEEE} & \approx 0.007831551135 & = 1.00000000100111111101000 \color{white}{0000000} \times 2^{-7} \end{align*}\]

An interesting question is then: Does the error bound of Kahan’s method hold for posits? Well we’ll have to de-black-box it at some point I guess. An aside here: the complexity of Kahan’s method as shown is about the same as Heron’s (godbolt). The real cost is the ordering requirement in the cases where it’s not known nor otherwise required.

Thin triangle: reality check

I’d guess that the author’s might counter-argue the above by saying I’m attempting to meet the wrong “esthetic” by being “non-rigorous, but cheap and fast”. That instead of “guessing” I should be “rigorous and mathematical”. Let’s go back to the original rigged example for IEEE.

\[\begin{align*} a & = 7 \\ b & = c=\frac{1}{2}\left(a+3~\text{ulp}\left(a\right)\right) = \frac{1}{2}\left(a+3\cdot2^{-21}\right) \end{align*}\]

and let’s assume that $a$ is a given (taken to be exact) and that $b$ was computed with uber extreme care and we know that it’s within 1 ulp of correct:

\[\begin{align*} a & = \left[7,~7\right] \\ b & = c =\left[\frac{7340033}{2097152},~\frac{3670017}{1048576}\right] \end{align*}\]

performing exact arithmetic with these gives:

\[A = \left[\frac{133\sqrt{40665}}{4194304},~\frac{7\sqrt{7340033}}{2097152}\right] \approx \left[6.394434228\times 10^{-3},~9.043095918\times 10^{-3} \right]\]

and once again the original IEEE result computed by Herons and hitting catastrophic cancellation was:

\[9.043097 \times 10^{-3}\]

This is exactly the upper end of the interval rounded to IEEE. Blindly using interval arithmetic got us nowhere (we’d have to observe a small interval yielded a relatively wide one) and didn’t they claim we could just copy equations from textbooks? Let’s do some of that (roughly) define stuff shit:

This is a pure mathematical property and not one of computation. A formal measure of conditioning is the condition number. The quick non-exact & non-wikipedia def is:

\[\frac{\text{change in output}}{\text{change in input}}\]

and given a function $f$ it can be approximated by:


for more than one variable you gotta get all Jabocian.

We also have the solely numeric property of stability which is frequently defined in a way that sounds identical to conditioning. Let me give two quick & dirty defs of stable:

The second definition is in terms of backward error analysis which I may mention in a future post (Actually there’re two flavors: forward and backward stability).

Both the conditioning of a problem and the stability of an algorithm can depend on the range (or domain) of the inputs. And to beat a dead horse all four combinations of well-conditioned/ill-conditioned and stable/unstable are possible.

This brings us back to authors’ arguments of just copying equations from textbooks (since algebra is hard) and if you run into trouble use intervals. Great. We find an interval that looks too big. What’s the problem? Are we doing something unstable? Is our problem ill-conditioned? Point being intervals are a tool whos usefulness depends on the user knowledge of which the authors are showing disdain.

Rah rah ah-ah-ah! I want the log! log-log-log! A log-log plot!

The authors propose a warm up exercise to compare the two formats which I’ll address in the next section. For my warm-up comparison I’ll run with triangle areas. Since the authors have some interesting ideas about error measures (more in a later post), I’ll quickly stub it out forward relative error:

\[\frac{a-e}{e} = \frac{a}{e} - 1\]

where $e$ is exact and $a$ is the approximation (computed result). Care must be taken for $e$ approaching zero and the right side side is useless for computing unless using extended or arbitrary precision. We can rewrite $a$ in a Wilkinson style:

\[a = e\left(1+n~2^{-p}\right)\]

where $p$ is bits of precision of $a$ which is fixed in the normal range of IEEE and varies for posits. Then the error can be expressed as:


where $n$ is the number of ULPs which can indeed be any real value. As an example a function which is “correctly rounded” means that it’s error is within one half ULP $\left(n \le \frac{1}{2}\right)$ and $p$ for binary32 is fixed to $24$ in the normal range so on that range we’d see a relative error of less that $2^{-25} \approx 2.9802322 \times 10^{-8}$.

Let’s take the magician’s stage show and look at a plot of relative errors:

\[a=7 \cdot 2^x \qquad b=c=\frac{1}{2}\left(7+3\cdot2^{-21}\right)\cdot 2^x\]

The IEEE results are garbage but it’s dead consistent across the range with a fixed relative error of $~0.154701$. It only changes when values start hitting denormals or overflow. For this section I want to show how the error plots behave so I’m going to show a log plot instead (actually manually performed and base-2 of course). Taking the log of the relative error gives us:

\[\log_{2}\left(n~2^{-p}\right) = \log_{2}\left(n\right) - p\]

If the error is zero then this shoots off to negative infinity. Repeating the previous we have:

There’s a couple of things to note. First the little hiccup posits had around $x=-22$ is less noticeable. We don’t really care here since we want to examine behavior when the error is reasonably low. The second thing is the posit plot looks nice and flat on $x \in \left[-15,~8\right]$ in the linear plot but is rather a mess when we microscope in.

Now let’s change to using Kahan’s method:

The IEEE result is still flat and the error has radically dropped. Posits have generally improved but is error is even more erratic. However up to this point the comparison isn’t fair. So I’m going to change the method. We have IEEE with the lower 3-bits of $b$ set and posit has two extra clear (as in the trick revealed section). And once posit have less than 24-bits of precision it can no longer even represent the input, so it’s screwed. Each model will now have it’s own set of inputs in the same configuration. Showing the above will help make it clear that this isn’t cheating.

\[a=\text{RN}\left(1.75 \cdot 2^x\right) \qquad b=c=\text{RN}\left( 2^{x-1}\left(a + 3~ \text{ulp}\left(a\right)\right)\right)\]

Where $\text{RN}$ and $\text{ulp}$ are the rounding and size of ulp function in each model. IEEE is the same as before and posit now always has representable inputs. Additionally there’s the minor tweak to $a$ (divided by four from previous) so the graphs are centered around posits maximal precision range. Also the plot now have two more traces which are the exact area $A$ correctly rounded in each model. Ignore these for the moment.

In addition I’ll add RMS and average error number for each on the range $x \in \left[-20,~20\right]$. In linear space this is the “please don’t hurt ‘em” range of $\left[\frac{1}{1048576},~1048576\right]$.

  IEEE IEEE RN(A) posit posit RN(A)
RMS 1.99108e-8 1.99108e-8 0.003084440 1.62931e-6
AVE 1.99108e-8 1.99108e-8 0.000714024 6.21741e-7

Let’s jump to the opposite end of the conditioning spectrum and look at a well-conditioned triangle:

\[a = b = c = 2^x\]
  IEEE IEEE RN(A) posit posit RN(A)
RMS 1.79482e-8 1.79482e-8 6.09276e-7 3.50695e-7
AVE 1.79482e-8 1.79482e-8 2.61015e-7 1.64948e-7

It’s gonna be admitted. Posits are kicking IEEE’s ass for equalateral triangles with side measures on $\left[\frac{1}{32}, 32\right]$. Recall this is a base-2 log-log plot so a decrease in Y by one is halving and increase by one is doubling of relative error respectively.

However significands with a population count of one is very improbable, so let’s change to the very probable half filled using $\alpha = \left(\sqrt{5}-1\right)/2$:

\[a = b = c = \text{RN}\left(\alpha~2^x\right)\]
  IEEE IEEE RN(A) posit posit RN(A)
RMS 2.61353e-8 2.61353-8 0.0000794538 5.79264e-7
AVE 2.61353e-8 2.61353e-8 0.0000342251 2.40965e-7

How about a rescaled Pythagorean triple:

\[\left(a,b,c\right) = \frac{2^x}{17} ~ \left(8,15,17\right)\]

Just a side thought here. I’m getting the impression that when working with posits that choice of unit of measure might be a bit important.

  IEEE IEEE RN(A) posit posit RN(A)
RMS 2.06133e-8 2.06133e-8 0.000206543 4.55979e-7
AVE 2.06133e-8 2.06133e-8 0.0000562866 1.75246e-7

Okay that’s enough log-log plot examples. Going back to equilateral triangles let’s look at relative error on $\left[1,~1000\right]$:

and a static image on wider range $\left[1,~2^{24}\right]$. The range choice doesn’t matter since (probably counter to expectations) I don’t have much to say that directly compares the IEEE and posit results.


What we’re seeing above is the effect of tapered precision. Here’s a quick sketch:

  • The sawtooth sections are the caused by the ULP size on the power of two intervals.
  • For both models when we go up a POT interval the absolute size of a ULP doubles which keeps the relative size fixed.
    • Additionally each time posits pass a $2^{\text{es}}$ boundary we lose a bit of precision which results in an additional doubling of ULP size and the relative error is now doubled.
  • Our example model is $\text{es}=2$ so we have groups of $2^{2} = 4$.

Let’s note that mixture of operations that form the area computation isn’t very involved. There are 7 addition/subtractions, 4 multiplications and a square root. A plot (pun intended) theme of posit presentations is that IEEE dropped the ball and the dynamic range is too large. An example comment is roughly “even astrophysicist don’t know what to do with it”. Well our simple little sequence of operations on some well-conditioned problems and using a stable method divided our useable range by a factor of ~4 in the IEEE case. Posits it depends on how much error you’re willing to tolerate.

In the IEEE case the relative error is flat and the pair of plots (Kahan’s method and exact correctly rounded) are virtually identical excluding the ends of the usable range.

For posits the relative error (if you flip it upside down) looks like rolling a ball down a ziggurat and worse there’s generally a large gap (remember log scale) between the exact correctly rounded and Kahan’s method results.

Lies, damned lies and benchmarks or if a trick worked once…

The title here is more a general comment of the usefulness of drawing conclusions from benchmarks than the specific example presented in the posits paper. For the authors’ “warm-up exercise” comparison they propose a bang for your bit budget measure by computing following meaningless expression:


where the rules are: we start with the integers and correctly round versions (in each model) of $\pi$ and $e$, perform the computations and then look at how accurate the results are. In other words it isn’t representative of a meaningful computation. If it was and we needed it to be accurate then we’d simply precompute the correctly rounded value and use that. Let’s run with it anyway. First note that the values do give an air of credibility. We have some small integers and a couple of common mathematical constants. Let’s take a quick peek at what’s going on starting with the numerator:

\[\begin{array}{cccc} \frac{27}{10} & & 10.101100110011001100110011001 \\ e & \approx 2.718281828 & \underbrace{10.10110}1111110000101010001011 \\ \end{array}\]

We’re jumping-up-and-down on the cardinal rule and subtracting two numbers that agree in the first seven digits. The properly rounded posits version of $e$ has 27 bits of precision so it comes out less effected. Gosh what’s gonna happen in the denominator:

\[\begin{array}{cccc} \sqrt{2} + \sqrt{3} & \approx 3.146264370 & 11.001001010111000110010100111 \\ \pi & \approx 3.141592654 & \underbrace{11.0010010}00011111101101010100 \\ \end{array}\]

Nine! They agree in the first nine digits and again the posit version will be less effected since it has more digits in this range. Let’s run with the farce and give the results:

\[\begin{align*} \text{exact} & \approx 302.8827197 & = 100101110.1110000111111001111010 \\ \text{posit (es=3)} & \approx 302.882\color{red}{3166} & = 100101110.1110000111\color{red}{0111111}\color{white}{00000} \\ \text{posit (es=2)} & \approx 302.88\color{red}{12790} & = 100101110.111000011\color{red}{00110111}\color{white}{00000} \\ \text{IEEE} & \approx 302.\color{red}{9124146} & = 100101110.1110\color{red}{10011001010}\color{white}{0000000} \end{align*}\]

But really the raise to a power doesn’t do anything interesting. The exponent computation is exact for both and we’re taking the result to be correctly rounded so without that we’d get:

\[\begin{align*} \text{exact} & \approx 3.913300184 & = 11.111010011100111000001010011 \\ \text{posit (es=3)} & \approx 3.913\color{red}{298935} & = 11.11101001110011\color{red}{01111101011} \color{white}{00} \\ \text{posit (es=2)} & \approx 3.913\color{red}{295731} & = 11.11101001110011\color{red}{011011111111} \color{white}{0} \\ \text{IEEE} & \approx 3.9133\color{red}{91829} & = 11.11101001110\color{red}{10100000011} \color{white}{00000} \end{align*}\]

Notice how good “count correct decimal digit” measure is here. Now instead of counting correct digits let’s count how many are wrong since the point of precise computation is to prevent errors from creeping into the result. Both binary32 and posits (es=3) have lost 11 digits and the better choice of es=2 has lost 12 and we got there using seven basic and correctly rounded operations. In the mathematica nookbook he adds insult to injury and uses a wide accumulator, oh sorry, “quire” for the denominator (yeah..that’s in later post as well).

I’d imaging that running across something like this in production code might cause some to start yelling “Everyone’s fired!” in frustration.

Authors: Therefore, the extra accuracy in the result was not attained at the expense of dynamic range.

No it was attained by a stacked deck which we can do for either model, but it doesn’t tell us anything about either model. Besides why are you mentioning dynamic range? You don’t visit any of it. I kindly direct the reader to my warmup exercise.

Oops…He Did It Again! Yeah, yeah, yeah, yeah, yeah, yeah I think he did it again

The notebook presents a solve linear equation example of a near singular matrix problem presented by D.H. Bailey:

\[\left( \begin{array}{cccc} 0.25510582 & 0.52746197 \\ 0.80143857 & 1.65707065 \end{array} \right) \left( \begin{array}{cccc} x \\ y \end{array} \right) = \left( \begin{array}{cccc} 0.79981812 \\ 2.51270273 \end{array} \right)\]

Which Gustafson transforms into values which can be exactly represented in IEEE doubles:

\[2^{-26} \left( \begin{array}{cccc} 25510582 & 52746197 \\ 80143857 & 165707065 \end{array} \right) \left( \begin{array}{cccc} x \\ y \end{array} \right) = \left( \begin{array}{cccc} 79981812 \\ 251270273 \end{array} \right) 2^{-26}\]

Gustafson: Since the system is only 2-by-2, we can write the solution in compact form using Cramer’s rule

Seriously? Cramer’s rule?

wikipedia: Cramer’s rule can also be numerically unstable even for $2\times2$ systems.

Okay..no prob, I can roll with that. Let’s rewrite the system using symbols:

\[\left( \begin{array}{cccc} a & b \\ c & d \end{array} \right) \left( \begin{array}{cccc} x \\ y \end{array} \right) = \left( \begin{array}{cccc} u \\ v \end{array} \right)\]

For Cramer’s rule we compute the determinate ‘$s$’:

\[s = a d - b c\]

and scaled versions of the result:

\[\begin{align*} x_0 & = u d - b v \\ y_0 & = a v - u c \end{align*}\]

and remove the scale factor (divide by the determinate):

\[\begin{align*} x & = \frac{x_0}{s} \\ y & = \frac{y_0}{s} \end{align*}\]

Both IEEE and posits compute the determinate exactly ($s = 2^{-52}$), posits (64-bit, es=3) and (59-bit, es=3) get the correct result $\left(-1,~2\right)^T$ and IEEE returns $\left(0,~2\right)^T$. Well since $s$ is correct that must mean the problem of getting a totally mysterious $0$ result has to have come from the computation of $x_0$. Well let’s attempt to deduce what could have gone wrong:

\[\begin{array}{lml} ud & = & 10.1111000101100000100011100010011011110110011001010100 \\ bv & = & 10.111100010110000010001110001001101111011001100101010\underline{1} \end{array}\]

Bullet points!

  • Yeap that little lonely underlined bit at the end gets rounded away in IEEE.
  • And yes this is in the small interval where posits have more bits.
  • And yes this is yet another example of catastrophic cancellation.
  • And yes Gustafson claims these extra bits somehow give you some protection.
  • And yes I’m getting really bored of typing these things.

No matter. The solution is just a web-search away, but I’ll save you the trouble and give a detailed reference5 and a quote from the introduction:

Expressions of the form $ad \pm bc$ with $a,b,c,d$ some floating-point numbers arise naturally in many numerical computations. Examples include complex multiplication and division; discriminant of quadratic equations; cross-products and 2D determinants. … Unfortunately, the naive way of computing $ad \pm bc$ may lead to very inaccurate results, due to catastrophic cancellations. Hence it is of interest to compute $ad \pm bc$ accurately.

Awesome. So the Gustafson is encouraging people to use a method that evaluates this troublesome expression not once, or twice but three times. Nice. Well the reference provides us with Kahan’s method (wow his name is showing up alot) to evaluate said expression (code on godolt):

// computes: ad-bc within +/- 3/2 ulp of exact
double discriminant(double a, double b, double c, double d)
  double w = b*c;
  double e = fma(-b,c,w);
  double f = fma(a,d,-w);
  return f+e;

Huh? I wonder if the error bound on this expression holds for posits? Well we’ll have to de-black-box it at some point I guess.

There’s no point in running any numbers, we’ve effectively looked at this with thin triangle. (NOTE: It’s actually general worse here because the posit point distribution is better suited to square roots than IEEE…details in later post).

The Catastrophic Cancellation Kid STRIKES AGAIN!

Seriously people. YOU CANNOT MAKE THIS SHIT UP. Solving the real quadratic equation $ax^2+bx+c = 0$ is the (wanted) poster child example used to demonstrate catastrophic cancellation. The “textbook” equation is:

\[\frac{-b \pm \sqrt{b^2-4ac} }{2a}\]

and they tells us the equation can be rewritten using a “classic” trick:

\[\frac{2c}{-b \mp \sqrt{b^2-4ac}}\]

Authors: Like most numerical tricks, it is far from obvious, has to be worked out with tedious algebra, and is prone to errors in coding it.

John D. Cook has a blog post6 with the derivation of this second equation, but that doesn’t matter since the second equation is no better than the first. A bottom line here is if you want to improve performance and/or accuracy of some computations…well if someone hasn’t already done the work for you then some math (algebra or otherwise which you may or may not find tedious) is required and writing any code is an error prone process (but somebody’s gotta do it).

The first problem is when $b^2$ is significantly greater than $\left| 4ac \right|$ since then $\sqrt{b^2-4ac} \approx$ $ \left| b \right|$ so for one choices of the $\pm$ the computation goes boom. The second problem is computing the discriminant $b^2-4ac$ when $ b^2 \approx 4ac $, (near equal roots) but we just talked about that solution. So this is an interesting problem since we need two different techniques to solve: algebraic manipulation to solve the first (see below) and effectively extending the precision for the second. (Recall I’m ignoring overflow/underflow issues)

To find the larger magnitude root we can side step the problem:

\[r_0 = \frac{-\left(b + \text{sgn}\left(b\right)\sqrt{b^2-4ac}\right) }{2a}\]

where $\text{sgn}$ is the signum like function:

\[\text{sgn}\left(x\right) = \begin{cases} 1 & x \geq 0 \\[2ex] -1 & x < 0 \end{cases}\]

which in “C” can be implemented as:

double sgn(double x)  { return copysign(1.0, x);  }
float  sgnf(float  x) { return copysignf(1.f, x); }

The second root can be computed via:

\[r_0r_1 = \frac{c}{a} \implies r_1 = \frac{c}{a~r_0}\]

An addtional advantage of the solution I’m showing is that you know which is the larger and smaller magnitude without inspection.

Authors: Instead of expecting programmers to memorize a panoply of arcane tricks for every formula, perhaps posits can make it a bit safer to simply apply the textbook form of the formula.

Nope still no magic protection and there’s still no need to memorize it. They can repeat stuff like this as much as they want. What I really should be doing is showing IEEE using stable methods and posits with their suggestions. But that’d be cruel. Sigh. Another missed opportunity.

For form sake the given example was: $\left(a,b,c\right) = \left(3,100,2\right) $ and computing as I suggest gives:

  $r_0$ $r_1$ $r_0$ $r_1$
exact -33.313321 -0.020012014 $100001.010100000011010111010$ $1.010001111110000001111000111 \times 2^{-6}$
IEEE -33.313320 -0.020012015 $100001.010100000011010111$ $1.01000111111000000111101 \times 2^{-6}$
posit -33.313322 -0.020012014 $100001.010100000011010111011$ $1.01000111111000000111100011\times 2^{-6}$

Surely you’re smoking, Mr. Gustafson!

Gustafson in a talk7 presented at Stanford runs a theme with the follow dot product:

\[\begin{array}{r} \mathbf{a} & = \left(32000000,~1,~-1,~\color{white}{+}80000000\right) \\ \mathbf{b} & = \left(40000000,~1,~-1,~-16000000\right) \end{array}\]

and in binary:

\[\begin{array}{l} \mathbf{a} & = \left(1.1110100001001000 \times 2^{24},~1,~-1,~\color{white}{+}1.0011000100101101 \times 2^{26} \right) \\ \mathbf{b} & = \left(1.0011000100101101 \times 2^{25},~1,~-1,~-1.1110100001001000 \times 2^{23} \right) \\ \end{array}\]

Note the form of the vectors are: $ a = \left(-2x,~1,~-1,~2y \right) $ and $ b = \left(y,~1,~-1,~x \right) $ so the outer elements cancel and we’re left with the sum of the inner elements and a result of $2$. We’re informed that both binary32 and binary64 results are zero, binary64 computed via binary sum collapse returns $1$ and (of course) posits get the correct answer.

Gustafson: Most linear algebra is unstable with floats!

Let’s step back and consider how bad this error is. If we measure using either absolute or relative error then, yeah, it’s pretty bad. However the dot product is a scalar projection which can be define geometrically (doesn’t matter if the vectors don’t have explicit geometric meaning) as:

\[\mathbf{a} \cdot \mathbf{b} = \lVert \mathbf{a} \rVert~ \lVert \mathbf{b} \rVert~ \cos\left(\theta\right)\]

and we can compute the relative projection:

\[\frac{\mathbf{a} \cdot \mathbf{b}}{\lVert \mathbf{a} \rVert~ \lVert \mathbf{b} \rVert} \approx 5.38793103448 \times 10^{-16} \approx 1.0011011010010111111 \times 2^{-51}\]

and solving for $\theta$:

\[\theta \approx \frac{\pi}{2} - 2^{-51}\]

So if a model returns zero instead of one we’re off about approximately $2^{-51}$ whichever way you want to look at it.

Gustafson: IEEE floats require 80-bit precision to get it right. Posits (es=3) need only 25-bit precision to get it right.

Nope, nope, nope:

  • 32-bit IEEE & posits return zero.
  • 64-bit posits (es=3) returns the correct answer.
  • 25-bit posit (es=3) can represent the INPUT. It cannnot compute the correct answer without a wide-accumulator..wait..sorry: “quire”.
  • 64-bit IEEE returns the CORRECT answer.

There’s some more crazy talk about the quire with respect to dot products, but I’ll leave that to the next post.

Look ma I can count! or much ado about a ULP


Here is an example that has disappeared from the stage act. This is a mock-up of a slide from the debate8:

From my book, to show why round-to-nearest might not be random and how unums can self-manage accuracy:

float sumtester()
  float sum; int i;
  sum = 0.0;
  for (i = 0; i < 1000000000; i++) {
    sum = sum + 1.0;
  printf("%f\n", sum);

In trying to count to a billion, IEEE floats (32-bit) produce 16777216.

Let’s walk through this:

  • We start $sum$ at zero and add one each iteration.
  • Each time $sum$ hits a power-of-two integer the exponent is increased by one and the size of a ULP is doubled.
  • When we reach $sum = 2^{23} = 8388608$ then the size of a ULP is 1. (no rounding yet)
  • When we reach $sum = 2^{24} = 16777216$ then the size of a ULP is two. (didn’t round)
  • All remaining iterations attempt to add 1 which is 1/2 a ulp and $sum$ is left unchanged (no rounding ever occurs)
  • IEEE-754 doesn’t specify a random rounding mode (although researchers have played with the concept).

If computed with a 32-bit posit (es=2) then the result is $8388608$ and es=3 would be same as IEEE. So this absurd example had to go.

With a bit of imagination we can rework the example into a much more impressive one anyway. Taking a post9 by Nick Higham derived from a note10 by David Malone let’s look at computing the $n^{th}$ harmonic number directly from the series:

\[H_n = \sum_{i=1}^{n} \frac{1}{i}\]

If we directly convert the summation into a loop that terminates when the value doesn’t change then we get:

model $n$ computed $H_n$ posit advantage multiplier!
bfloat16 $65$ 5.0625 0.126706x
binary16 $513$ 7.0859 -
posit16 es=1 $1024$ 7.7773 1.9961x
posit16 es=2 $1025$ 7.7813 1.99805x
binary32 $2097152$ 15.404 -
posit32 es=2 $8388609$ 16.813 4x

Look how much better posits are than IEEE: the 32-bit version can compute 4x as many! This is same mechanism as counting by one. We have a very slowly growing sum where in IEEE we hit a value too small to contribute earlier than posits. It’s trivial to contrive an example where IEEE would win but simple or convoluted examples which boil down to measuring the size of a ULP in the two formats doesn’t tell us anything we shouldn’t already know.


The provided examples are the computational equivalent of playing three card monte on the street corner. The “trick” is obvious to those that know it and hopefully I’ve done a sufficient job of explaining if loss of significance and ULP size issues were unfamiliar to you. These examples don’t measure anything about either model beyond what is already known. Specifically the size of a ULP at a given number.

We’ve already covered the majority of the “evidence” that they provide to back-up the claim of posits being a “uniformly superior” model than IEEE. What else is there?

  1. A “scaled model” comparison. Here they compare how a made up 8-bit IEEE format performs vs. posits on basic operations.
  2. The LINPACK benchmark is used as a “method of manufactured solutions”.

In the next post we’ll look at more absurd things and I’ll mention some uber nerdy math and CS shit!

Postscriptum 20220930

Well obviously I lost all motivation to complete the follow-up. Since originally writing this there have been some formal papers on the topic written by well respected researchers (to put it mildly) who never “game” their findings. In fact they bend over backwards attempting to be as positive on posits as possible. Some examples:

  • “Comparing posit and IEEE-754 hardware cost”, 2001. A TL;DR snippet from abstract “This library improves the state of the art of posit hardware arithmetic, and still, IEEE-754 operators remain between 30% and 60% faster and smaller than their posit counterparts”. One of the co-authors (Florent de Dinechin) has even written a paper with Gustafson even though Gustafson has previous attacked his integrity as a researcher. Like I said: “bending-over-backwards”.
  • “Posits: the good, the bad and the ugly”, 2019. Looks at some of issues of using posits. Note this paper was written for a posit conference so that fact certainly effected its content.

Also this write-up was discussed in a post. The TL;DR version is that Gustafson was very polite about it and he doesn’t really deny my claims about the posit examples being contrived.

And FWIW the format now has a specification.

References and Footnotes

  1. “Branches of Mathematics: Numerical Analysis”, N. Trefethen (PDF

  2. “Miscalculating Area and Angles of a Needle-like Triangle”, W. Kahan, 2014 (PDF

  3. “What Every Computer Scientist Should Know About Floating-Point Arithmetic”, David Goldberg, 1991 (page

  4. “How to Compute the Area of a Triangle: a Formal Revisit”, Sylvie Boldo, 2013 (PDF 2

  5. “Further analysis of Kahan’s algorithm for the accurate computation of 2x2 determinants”, Jeannerod, Louvet & Muller, 2013 (page

  6. “The quadratic formula and low-precision arithmetic”, John D. Cook (page

  7. “Beyond Floating Point: Next-Generation Computer Arithmetic”, J. Gustafson, 2017 (PDF


  9. “Half Precision Arithmetic: fp16 Versus bfloat16”, Nick Higham, 2018 (page

  10. “To what does the harmonic series converge?”, David Malone, 2013 (PDF


© 2023 Marc B. Reynolds
all original content is public domain under UNLICENSE