Floating point division with known divisor

March 12th, 2019

A quick note that’s mostly regurgitating the paper “Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance” with some comments, code and extra numbers.

The problem is computing $\frac{x}{y}$ when $y$ is known in advance or a compile time constant. I’m going to ignore hardware reciprocal approximation instructions and related techniques. I’ll use single precision throughout.

NOTE: The working assumption throughout this is that the inputs are limited to floating point values on the normal range and that no overflow or underflow (absolute value is not smaller than smallest normal number) occurs. Otherwise error bounds are not as quoted.

BEWARE: Also division latency and throughput are not as bad as in days of yore.

The “naive” method multiply by reciprocal

Let’s run though this well known method. First we precompute the reciprocal:

\[z_h = \text{RN} \left( \frac{1}{y} \right)\]

where $\text{RN} \left(\cdot\right)$ is simply the hardware round to nearest (ties to even). Then the approximate quotient $q$ is computed by:

\[q = \text{RN} \left( x\cdot z_h \right)\]

Performing the division (the correctly rounded result) has an error bound of $0.5~\text{ulp}$ and this approximation’s bound is $1.5~\text{ulp}$. A surprise from the paper is that this approximation returns a correctly rounded result more often than I would have expected at 72.9% of the time (this is a conjecture back by some empirical testing and a partial proof).

Of course the authors dismiss this as being too awful to contemplate but for those of us willing to increase the error bound for a performance boost it’s a very useful tool. I’m including this method because:

It’s seemly common to run across code which is performing some $n$ divides to avoid the increased error bound and at the same time $y$ and/or the $n$ values of $x$ are computed with a very wide bound. Improving the latter and using this “naive” method will be faster and lower total error in some cases.
The other methods require hardware FMA.
As noted in intro the latency of division isn’t the murder it used to be and this might be the only viable option.
And might as well mention that computing the reciprocal as a double, promoting the $x$ values to doubles and demoting is another potential option.

The product of unevaluated pair method 1 FMA + 1 product

The main proposed method of the paper precomputes an unevaluated pair $\left(z_h,~z_l\right)$ approximation of the reciprocal by:

\[\begin{array}{lll} z_h & = & \text{RN} \left( \frac{1}{y} \right) \\ t & = & \text{RN}\left(1-y \cdot z_h\right) \\ z_l & = & \text{RN}\left(\frac{t}{y}\right) \end{array}\]

One possible way to translate that into C code would be:

typedef struct { float h,l; } f32_pair_t;

// compute the unevaluated pair (h,l) approximation of 1/a
void f32_up_recip(f32_pair_t* p, float a)
{
 p->h = 1.f/a;
 p->l = -fmaf(p->h, a, -1.f)/a;
}

To approximate the division we simply multiply by the unevaluated pair:

\[q = \text{RN} \left(x\cdot h + \text{RN}\left(x\cdot l\right)\right)\]

// multiply 'x' by the value represented by the unevaluated pair p=(h,l)
static inline float f32_up_mul(f32_pair_t* p, float x) 
{
  return fmaf(x, p->h, x*p->l);
}

As an aside there’s a related paper “Correctly rounded multiplication by arbitrary precision constants” which describes how to determine if some constant (stored as an unevaluated pair) is correctly rounded for all x. Some examples include: $\pi,~ 1/\pi, ~\log\left(2\right), ~1/\log\left(2\right)$, etc.

So how good is this approximation? The short version is that it’s excellent:

For $98.7273\%$ of choices of $y$ the result is exact (meaning correctly rounded, aka bit identical to performing the division) for all input.
The remain $1.2727\%$ of $y$ values only return an inexact result for exactly one mantissa bit pattern per power-of-two interval which in single precision means $1$ in $2^{23}$ or approximately $1.19209\times 10^{-5}\%$
The maximum error of these inexact results is $0.990934~\text{ulp}$, so they are all faithfully rounded. The average error is a much tighter $0.605071$ and the RMS is $0.611434$.

A minor note is that the sign of a zero output is flushed to positive if the signs of the two elements of the ordered pair are different.

Aside: unevaluated pairs are a special case of (non-overlapping) floating-point expansions. We have:

\[\text{RN} \left(h+l\right) = h \implies l \le 0.5~\text{ulp}\left(h\right)\]

Corrected “naive” method 2 FMA + 1 product

Up front: this method isn’t very promising since it requires two FMAs and a product. Minimum sketch: we compute the approximate $q$ as in the “naive” method, then the approximate remainder $r$ and finally the correctly rounded quotient $q’$ by:

\[\begin{array}{lll} q & = & \text{RN} \left( x\cdot z_h \right) \\ r & = & \text{RN} \left(x-q \cdot y\right) \\ q' & = & \text{RN} \left(q+r\cdot z_h\right) \end{array}\]

and a code version:

// it up to the reader to come up with a good name
inline float algorithm_1(float x, float y, float zh) 
{
  float q = x*zh;
  float r = -fmaf(q,y,-x);
  return fmaf(r,zh,q);
}

Oh! For form sake!

For the single FMA method the paper provides a recipe to determine if a given divisor $y$ is correctly rounded for all $x$ and if not the mantissa bit pattern $X$ for which it is not. The method is very expensive since it requires a fair number of branches and to add insult to injury with near equal probability (noted in code comments). It’s bordering on useless for runtime usage IMHO.

My code is slightly different than the method presented in the paper. First I computed the smallest $Y$ value which is not correctly rounded which is: 0x9f0237. That’s folded with the paper’s initial test of all even $Y$ are correctly rounded into a rotate right operation. This increases the accepted values from $0.5$ to $~0.62$. Secondly I completely drop the test on the magnitude of $z_l$. This is a fair number of operations to cut half the cases at that point. Next the computation of the candidate $X$ value is slightly restructured to drop the number of branches. This block of code can be to convert into branchfree but I couldn’t be bothered. Lastly if we have a candidate $X$ we test if it’s correctly rounded.

// computes the unevaluated pair C which is approximates 1/y and returns the mantissa
// bit pattern X for which a result will be inexact (zero if always correctly rounded)
uint32_t fp_up_recip_x(f32_pair_t* C, float y)
{
  f32_up_recip(C, y);                           // compute 1/y approximation C=(h,l)
  
  uint32_t S = f32_to_bits(y);                  // IEEE bit pattern of 'y'
  uint32_t Y = S & 0x7FFFFF;                    // mantissa as an integer (without implied bit)
  
  // folded test: even or smaller than smallest that fails
  if (u32_rotr(Y,1) < 0x800f811b) return 0;     // p taken = ~.62113
  
  // 37.89% reach here
  Y |= 0x800000;                                // add in the implied bit
  
  // compute: P- and X- (see supplement)
  uint32_t p = u32_mod_inverse(Y) & 0x1FFFFFF;
  uint32_t q = p >> 1;
  uint32_t X = 0;

  // this block could be made branch-free
  if (q >= (1<<23)) {                           // p taken = ~.5
    X = (uint32_t)(((uint64_t)p*Y-1)>>25);
  }
  else {
    // compute: P+ and X+
    p = 0x2000000-p;
    X = (uint32_t)(((uint64_t)p*Y+1)>>25);
  }
  // ------------------------------------

  if (X >= (1<<23)) {                           // p taken = ~0.742677
    // we have a potential X so test if it
    // is correctly rounded or not.
    float x  = (float)X;
    float r0 = x/(float)Y;
    float r1 = f32_up_mul(C,x);
    
    if (r0 != r1)                               // p taken = ~0.0452307
      return X;
  }
  return 0;
}

The above needs to the following helper functions:

// multiplicative modulo inverse of odd 'a'
static inline uint32_t u32_mod_inverse(uint32_t a)
{
  uint32_t x;
  x = (a*a)+a-1;
  x *= 2-a*x;
  x *= 2-a*x;
  x *= 2-a*x;
  return x;
}

// warning: 'i' on [1,31] otherwise undefined
static inline uint32_t u32_rotr(uint32_t x, uint32_t i)
{
  return (x>>i)|(x<<((0-i) & 31));
}

static inline uint32_t f32_to_bits(float x) 
{
  uint32_t u; memcpy(&u, &x, 4); return u;
}

static inline float f32_from_bits(uint32_t x)
{
  float f; memcpy(&f, &x, 4); return f;
}

Comments

math (34)