This is a mostly code post with some supporting comments. I’ll assume you’re familiar with IEEE bit formats and will use binary32 as the working example. The quick skim version is we have a sign bit followed by a biased exponent and then the explict bits of the significand (fractional part). For all of this we’re taking the inputs to be finite (including denormals).

We’ll start with the fact that the bit pattern of finite IEEE values are equivalent to a signed magnitude integer. So to compute the ULP distance between two floats we can simply:

  1. get the bit patterns for both
  2. convert both from signed magnitude to two’s complement
  3. subtract to get the difference
  4. compute the absolute value to get the distance

Tossing together some helper functions and pasting them together gives:


// get the bit pattern of 'x'
inline uint32_t f32_to_bits(float x)
{
  uint32_t u; memcpy(&u, &x, 4); return u;
}

// returns -1 if negative and otherwise zero
inline uint32_t u32_sgn_mask(uint32_t x)
{
  return (uint32_t)((int32_t)x >> 31);
}

// convert signed magnitude to two's complement. 
// * If negative flip all bits except the sign and add one.
inline uint32_t sm_to_tc(uint32_t x)
{
  uint32_t s  = u32_sgn_mask(x);
  return (x ^ (s>>1))-s;
}

// okay...the name's goofy, but you get what I mean
inline uint32_t u32_abs(uint32_t x)
{
  return (int32_t)x >= 0 ? x : -x;
}

// first pass: return ULP distance between finite a & b
uint32_t f32_ulp_dist_orig(float a, float b)
{
  uint32_t ua = sm_to_tc(f32_to_bits(a));
  uint32_t ub = sm_to_tc(f32_to_bits(b));
  return u32_abs(ua-ub);
}


The resulting function is relatively expensive for what it does. Let’s create a special case version which only properly handles inputs with the same sign. If both are positive then the subtraction can be directly performed on the signed magnitude integers and likewise for both negative (the sign bits cancel). This eliminiates the integer representation conversion which is the bulk of our original version.

uint32_t f32_ulp_dist_ss(float a, float b)
{
  uint32_t ua = f32_to_bits(a);
  uint32_t ub = f32_to_bits(b);
  uint32_t r  = ub-ua;
  return  u32_abs(r);
}


The previous is far from useless since it’s common to only care about small distances and incorrectly computed large wouldn’t matter. So it’s usable if we know that one value isn’t in the neighborhood of zero (at least as far from zero as the maximum distance we’re interested in).

Let’s now think about when the signs are different. If we ignore the sign bit than rest of the signed magnitude integer is the distance from zero to the float so we can then simply add the two integers together and zero out the sign bit of the result to compute the distance. Mashing that up with a same signed special case takes us back to a general case function:

uint32_t f32_ulp_dist(float a, float b)
{
  uint32_t ua = f32_to_bits(a);
  uint32_t ub = f32_to_bits(b);
  uint32_t s  = ub^ua;

  if ((int32_t)s >= 0)
    return u32_abs(ua-ub);
  
  return ua+ub+0x80000000;
}



Comments





© 2024 Marc B. Reynolds
all original content is public domain under UNLICENSE