This is a quick note on a pair of small things that you either already know or will probably never need. OK, enough of my hard sell. As usually I’ll use single precision as the working example and I’ll only use *binary32* or *float* to talk about the FP format so I can use single to mean **one** of something.

We have a floating point value $x$ to which we want to either add or multiply by some constant $K$ to get a floating point value in the same format as $x$. But $K$ doesn’t fit in that format and we want to bump up the accuracy of the result. As an example: binary32 $x$ and multiplying by $\pi$ the way normal people do:

but we’re nerding out and instead want a drop in replacement that is a little more accurate. Note this example is taking a double constant `M_PI`

and casting to a *float* (which rounds it to a binary32). This works out in this case but in general a constant rounded to a double and then to a float is not the same as the constant directly rounded to a float.

We’re going to need some extra tools to work with and I’ll write using the following freely available:

- Sollya is a free library and DSL designed for building math functions.
- The WolframAlpha website to perform prime factorization of integers.

Recall that *fused multiply add* (FMA) operations (exposed in C/C++ as `fma`

and `fmaf`

) computes `a*b+c`

with a single rounding step:

where $\text{RN} \left(\cdot\right)$ is simply the hardware *round to nearest (ties to even)* for the given format.

**WARNING:** Only recent updates of Visual C (2019) treat `fma`

and `fmaf`

as intrinsics and older versions will instead calls some obnoxiously expensive functions.

## Multiply by a constant 1 FMA + 1 product

We can represent a number $\left(v\right)$ using more than one floating point value. I will call using the logical sum of two $\left(v = h+l\right)$ an *unevaluated pair*. We can build all kinds of tools out of them, but we need just one: multiply a pair with a single value and return a single value. A previous (and related) blog post (*“Floating point division with known divisor”*) uses this and has a few more details.

So the multiply by constant of this post is just a special case of this. We need to factor our constant $\left(K\right)$ into a pair:

So $H$ is just the constant rounded to a single, $L$ is the error rounded to a single and $K \approx H+L$. Then we compute as:

Let’s fire up sollya (`rlwrap sollya --nocolor`

) and use it as a REPL to factor $\pi$:

```
> single(Pi);
3.1415927410125732421875
> single(Pi-single(Pi));
-8.74227765734758577309548854827880859375e-8
> quit;
```

(the function `single`

rounds to *binary32* to nearest) and shoving the output into the above defs:

So we have our drop-in replacement for the function from the introduction. What’s interesting is that our replacement returns:

which in English is: multiplies $x$ by $\pi$ as if in infinite precision and then rounds to nearest. The original version returns this ideal result about 66.8053% of the time. For a given constant and floating point format precision the FMA implementation will either: always return the correctly rounded result (like this $\pi$ example), always correctly rounded except for one input bit pattern (per power-of-two interval like the “division by constant”), or will be incorrect for multiple bit patterns. The methods to determine what happens for a given constant and FP format are described in *“Correctly rounded multiplication by arbitrary precision constants”*

Here’s a sollya def to split out these constants as per this post:

```
// computes a non overlapping pair out of input v
procedure f32_mul_k(name,v)
{
var h,l,e;
h = single(v);
l = single(v-h);
print("const f32_pair_t f32_mul_k_" @ name @
" = {.h = " @ h @ "f, .l=" @ l @ "f};");
};
```

For your amusement here’s a small table of some constants $\left(K \approx H+L\right)$, $CR$ indicates if always correctly round, ? = I don’t know (I haven’t implemented the testing methods) and ‘%’ is percent of not correctly rounded when computed the normal way.

K | H | L | CR | % |
---|---|---|---|---|

$\pi$ | 0x1.921fb6p1f | -0x1.777a5cp-24f | Y | 33.194710 |

$1/\pi$ | 0x1.45f306p-2f | 0x1.b93910p-27f | Y | 48.123135 |

$\log(2)$ | 0x1.62e430p-1f | -0x1.05c610p-29f | Y | 3.260410 |

$1/\log(2)$ | 0x1.715476p0f | 0x1.4ae0c0p-26f | Y | 15.840387 |

$\log(10)$ | 0x1.26bb1cp1f | -0x1.12aabap-25f | Y | 16.824018 |

$1/\log(10)$ | 0x1.bcb7b2p-2f | -0x1.5b235ep-27f | Y | 28.183519 |

$e$ | 0x1.5bf0a8p1f | 0x1.628aeep-24f | ? | 36.054657? |

$1/e$ | 0x1.78b564p-2f | -0x1.3a621ap-27f | ? | 29.529118? |

Something I haven’t gotten around to explicitly mentioning is that since scaling by a power-of-two is error free that these precomputed unevaluated pairs can be simply scaled as well.

## Carry it through… added: 20221022

**WARNING:** All of these are just examples. A good factorization depends on the range of input(s) and relative magnitudes of the individual quantities.

The previous by itself is mostly useless and just tightens up the final bit. However the real intent was for carrying through to rest of the computation. Let’s look at adding to it the subresult $c$ (the interesting case is when $c$ is significantly smaller than $Kx$):

\[\begin{align*} Kx + t &= \left(H+L\right)x+c \\ &= Hx + Lx + c \\ &= \left(Hx + \left(Lx + c\right)\right) \end{align*}\]

which is simply a nested FMA structure:

A more practical example usage when $c$ is a function of $x$ and we’re computing a polynomial which is a minimax approximating of a function on some interval. Let’s look at a polynomial in Horner’s form (and without a constant term) example:

where $c_1$ is the unevaluated pair $H+L$, distribute the outer $x$, nest $L$ and denote a *tail* term $r$. So we have:

which in translated into code give:

so this expansion has the same cost as adding one more term to our approximation.

If our polynomial is *even* then we’ve precomputed $s=x^2$ and could just compute as the previous but let’s consider another option. Replacing $s$ with $x^2$ and and reworking:

and in code:

so choosing this second form adds more more product. And let’s rinse and repeat for *odd* polynomials:

and again in code:

which works out to be either an additional constant load (count fma & mul as equivalent in cost) or the cost of an extra term.

## Overlapping pairs added: 20221022

The constant factorization method given above produces a *non overlapping* pair which maximizes the precision of the constant. It can be desirable to produce a less accurate factorization. Here’s a solla script that produces a pair where they both have the same sign.

```
// produces same signed constants. generally
// loses one bit of accuracy
procedure f32_mul_samesign_k(name,v)
{
var ha,la;
ha = round(v,24,RZ);
la = single(v-ha);
print("const f32_pair_t f32_mul_k_" @ name @
" = {.h = " @ ha @ "f, .l=" @ la @ "f};");
};
```

The motivation here is when wanting to handle returning a signed zero for “free”. Given: `fma(a,b,c)`

where `a*b`

is zero and `c`

is zero then the result is `-0`

if both are `-0`

and otherwise postive zero.

Another example is producing a $H$ with less than the maximum number of bit. FWIW: I’ve never used this…but I’ve never really tried either. But anyway this shrinks the magnitude gap between $H$ and $L$ (roughtly doubles for each bit) and the product $H*x$ requires fewer bits to be represented exactly. Why I think this could be interesting is because if we’re bothering with extended precision for a polynomial approximaion then our coefficients are more than sufficient to hit a correctly rounded result and the problem has become managing the errors in performing the computation. The downside is we’re losing some effectiveness of the *fma* with the $H$ term.

```
// produces constants where H is a (24-g) bits
procedure f32_mul_p_k(name,v,g)
{
var ha,la;
ha = round(v,24-g,RN);
la = single(v-ha);
print("const f32_pair_t f32_mul_k_" @ name @
" = {.h = " @ ha @ "f, .l=" @ la @ "f};");
};
```

## Add a constant 1 FMA

Let me first note that this method **does not** help with catastrophic cancellation since the source of that is the rounding error already present in $x$.

Like with the extended precision product this can be used as part of function approximations. Here’s a motivational example implemenation of *atan* written by Norbert Juffa which needs to add $\frac{\pi}{2}$.

We can add by an extended precision constant as follows:

\[K + x \approx \text{RN}\left(A \cdot B + x\right)\]

So we need to again factor $K$ but this time into a multiplicative pair $\left(K \approx A \cdot B \right)$ instead of a sum. Here’s some sollya defs to get started:

Breaking down `add_k`

:

`f`

the constant rounded to nearest at twice the precision of the format (48/106 for float/double)`d`

denotes if`f`

was rounded up (0) or rounded down (1). This could be needed to break a ‘tie’.`m`

significand (or mantissa) as an integer of`f`

`e`

power-of-two exponent of`f`

`p`

is the actual number of bits in`m`

. This can be less than requested since any number of the low bits can be zero.`m`

is adjusted by the number of trailing zeroes`e`

is likewise adjusted

and if you copied that into a file and ran it (`sollya whateverFileName`

) you’d get:

```
[|3.1415926535897966687116422690451145172119140625, 221069929750889, -46, 0|]
[|3.1415926535897932384626433832794812270636909610957, 63719069007931157819013617823235, -104, 1|]
[|0.61803398874989312616889947094023227691650390625, 173961102589770, -48, 1|]
```

Taking the last one we have:

Finally we have the information we need to attack the factoring. If we take the integer and plug it into WolframAlpha it gives the prime factorization:

We can set aside any powers of two for the moment (this is our trailing zeroes) and want to take the remaining terms and attempt to partition them into two sets where the product of each is small enough to fit in our floating point format without rounding. For floats that means they need to be $\le 2^{24}-1$. For this example we have two choices:

Since we were successful with this integer we could have used the simpler sollya routine `add_k_q`

to get the same results. There’s no reason I can see to prefer one over the other. Toy example code:

Now let’s look at $\pi$ for *binary32* (an example usage is a part of range reduction for one and two parameter arc-tangents). The initial integer isn’t going to work out this time so we’re going to look at a range of value. Our target integer rounded up so we want to inspect: $ \left\{I,I-1,I+1,I-2,I+2…\right\} $. If it had rounded down we’d flip the sign on the ordering.

$2^t$ | prime factorization | |
---|---|---|

I | 0 | $ 5867 \cdot 44279 \cdot 850973 $ |

I-1 | 3 | $ 317 \cdot \color{red}{87172685233} $ |

I+1 | 1 | $ 3 \cdot 5 \cdot 11 \cdot 1061 \cdot \color{red}{631393853} $ |

I-2 | 0 | $ 3 \cdot 7 \cdot 19 \cdot \color{red}{554059974313} $ |

I+2 | 0 | $ 13 \cdot 61 \cdot 73 \cdot 14879 \cdot 256661 $ |

I-3 | 1 | $ 17 \cdot 4253 \cdot \color{red}{1528816543} $ |

I+3 | 2 | $ \color{red}{55267482437723} $ |

The $2^t$ column is the extra powers of two and the red integers are too large, so we have two candidates in this window $I$ and $I+2$. I already spoiled that $I$ doesn’t work out and that’s because the smallest times the largest is too big. But we’re good with $I+2$:

And some toy code.