I’m going do a quick run through of a design and implementation (in binary32 aka single precision) of a method to simultaneously compute:
which is going to contain both minimal details and validation.
- non rigorous: don’t care about how float-point sticky flags are effected nor about how the how sign of zeroes, NaNs and infinities pan out.
- highly accurate: target “faithfully rounded” (error less than 1 ULP)
- no limit on input range (ok..caveats. see below)
- high throughput computation
- branch-free: both so input can be effectively random (not burdened by branch mispreditions) and can be directly extended to SIMD.
I’m going to assume basic familiarity with polynomial (and in particular minimax) approximations. Robin Green has some very accessible introductory material1 on this topic. I’m going to use Sollya2 which is a freely available library and DSL to build the core approximations. An additional assumption is that the target hardware has fused multiply accumulate3 (FMA) hardware and the used language & compiler support generating them.
Given that we’re not working with a limited input range we can’t simply brute force the approximation and must:
- Perform a range reduction
- Compute the approximation on the reduce range
- Perform the required “fixed-up” to undo the range reduction
So we have our first trade-off knob to turn: a smaller range makes the function easier to approximate and the flip side is the amount of work required to get that smaller range. The great news is both sine and cosine are pretty easy functions to approximate so we don’t need to reduce the range by much. Also the range reduction for the
Throughout this post I’m going to be structuring the code in an awkward way which is intended to help visualize operations which can be performed in parallel.
Approximate the primary range
Let’s run with the reduced range of
S = fpminimax(sin(pi*x), [|1,3,5,7|], [|24...|], [0;1/4], floating, relative);
C = fpminimax(cos(pi*x), [|0,2,4,6,8|], [|24...|], [0;1/4], floating, relative);
The actual script used is slightly more complex and can be found (along with generated output) here. Additionally there’s a brief summary of the output at the end of this post.
The inputs are: function to approximate, explicit list of mononomials to use, number of significand bits per component (24 for all), range, floating point (instead of fixed point) coefficients and optimize for relative error. A perhaps noteworthy item is the monomial list which means we don’t need rewrite the function and work out a modified range (which is needed for some software tools).
The sine and cosine approximations generated by the mini script above have errors less than
const float S[] = { 0x1.921fb6p1f, -0x1.4abbecp2f, 0x1.466b2p1f, -0x1.2f5992p-1f };
const float C[] = {-0x1.3bd3ccp2f, 0x1.03c1b8p2f, -0x1.55b7cep0f, 0x1.d684aap-3f };
a2 = a*a;
c = fmaf(C[3], a2, C[2]); s = fmaf(S[3], a2, S[2]);
c = fmaf(c, a2, C[1]); s = fmaf(s, a2, S[1]);
c = fmaf(c, a2, C[0]); s = fmaf(s, a2, S[0]);
c = fmaf(c, a2, 1.f); s = a*s;
If we test the output of cosine we’d find 94.9% to be correctly rounded5 and the rest are faithfully rounded (target met). Notice that the coefficient table doesn’t include constant term. This is so we can simply comment out the final FMA and replace the
//const float C[] = {-0x1.3bd3ccp2f, 0x1.03c1b8p2f, -0x1.55b7cep0f, 0x1.d684aap-3f };
const float C[] = {0x1.fffffep-1f, -0x1.3bd37ep2f, 0x1.03acccp2f, -0x1.4dfd3ap0f};
// all the other terms as above
/*c = fmaf(c, a2, 1.f);*/ s = a*s;
In that case we get 63.2% correctly rounded and rest faithfully. Although with this choice we significantly increase our average and RMS errors we’d still be meeting our target error bound (maximum error)..which is arguably only really interesting one. Moving on to the sine approximation we’d get 73.4692% correctly, 26.5289% faithfully and (sigh) 0.00183582% within 2 ULP. We have a polynomial that can meet our error bound but how we’re computing it isn’t sufficient. As a polynomial expression (written as evaluated) we currently have:
Let’s plot these values:
Now let’s look at what happens if we rewrite the computation by distributing the outer
and the updated code looks like:
a2 = a*a;
c = fmaf(C[3], a2, C[2]); s = fmaf(S[3], a2, S[2]); a3 = a2*a;
c = fmaf(c, a2, C[1]); s = fmaf(s, a2, S[1]);
c = fmaf(c, a2, C[0]); s = a3 * s;
c = fmaf(c, a2, 1.f); s = fmaf(a, S[0], s);
So we’re adding an operation (computing fmaf(s, a2, S[0])
and
Range reduction
Our chosen primary range handles the quarter of the unit circle centered around
r = nearbyintf(a+a);
a = fmaf(r,-.5f,a);
q = (uint32_t)((int32_t)r);
WARNING: As written the computation of
Note the use of FMA to produce the reduced
and likewise for sin. From the period we can see we only need to bottom two bits of
if (q & 2) { s = -s; c = -c; }
if (q & 1) { t = -s; s = c; c = t; }
To nuke these branches let’s define some helper functions:
inline uint32_t f32_to_bits(float x) { uint32_t u; memcpy(&u,&x,4); return u; }
inline float f32_from_bits(uint32_t x) { float u; memcpy(&u,&x,4); return u; }
inline float f32_mulsign(float v, uint32_t s) { return f32_from_bits(f32_to_bits(v)^s); }
The first two simply let us move the floating point bits to an integer and back again. The last simply XORs the two inputs where the intended usage is to pass in a float as the first parameter and the second an integer which is either zero or only has the top bit set. This allows us to multiply by either 1 (integer is zero) or by -1 (only top bit set). So we can eliminate the sign flip portions with:
uint32_t sy = (q << 31);
uint32_t sx = (q >> 1) << 31;
sy ^= sx;
// core approximation here
s = f32_mulsign(s,sy); c = f32_mulsign(c,sx);
if (q & 1) { float t = s; s = c; c = t; }
Now all that’s left is swapping the values. For a SIMD version we have to use a branch-free select method, here I’m going to punt and have the function pass in an array pointer and write out the final cosine and sine values to the first and second elements respectively:
q &= 1;
// ....
dst[q ] = c;
dst[q^1] = s;
Paste the parts together
Let’s take a look at all the parts put together:
// constants for sin(pi x) and cos(pi x) for x on [-1/4,1/4]
const float f32_sinpi_7_k[] = { 0x1.921fb6p1f, -0x1.4abbecp2f, 0x1.466b2p1f, -0x1.2f5992p-1f };
const float f32_cospi_8_k[] = {-0x1.3bd3ccp2f, 0x1.03c1b8p2f, -0x1.55b7cep0f, 0x1.d684aap-3f };
// dst[0] = cos(pi a), dst[1] = sin(pi a)
void f32_sincospi(float* dst, float a)
{
static const float* S = f32_sinpi_7_k;
static const float* C = f32_cospi_8_k;
float c,s,a2,a3,r;
uint32_t q,sx,sy;
// range reduce
r = nearbyintf(a+a);
a = fmaf(r,-0.5f,a);
// setting up for reconstruct
q = (uint32_t)((int32_t)r);
sy = (q<<31); sx = (q>>1)<<31; sy ^= sx; q &= 1;
// compute the approximations
a2 = a*a;
c = fmaf(C[3], a2, C[2]); s = fmaf(S[3], a2, S[2]); a3 = a2*a;
c = fmaf(c, a2, C[1]); s = fmaf(s, a2, S[1]);
c = fmaf(c, a2, C[0]); s = a3 * s;
c = fmaf(c, a2, 1.f); s = fmaf(a, S[0], s);
c = f32_mulsign(c,sx); s = f32_mulsign(s,sy);
dst[q ] = c;
dst[q^1] = s;
}
There’s still some minor inefficiencies I’m not going to bother to remove. The most obvious is we can fold the two f32_mulsign
computations into the core approximations and the rest is “I’m desperate so let’s think about the ISA and uArch and look at what the compiler’s spitting out”. For the former after we’ve computed a2
we could do a = f32_mulsign(a,sy)
which covers adding that sign to a3
and we’re done of the core sign of sine. For cosine we’d need to add its sign sx
to the one and to a copy of a2
for the final FMA.
Note that I’ve kept the cosine computation as the
WARNING: An important consideration here is that the 6th degree cos approximation never returns 1. You get 1-ulp(1) instead.
// constants for sin(pi x) and cos(pi x) for x on [-1/4,1/4]
const float f32_sinpi_7_k[] = { 0x1.921fb6p1f, -0x1.4abbecp2f, 0x1.466b2p1f, -0x1.2f5992p-1f };
const float f32_cospi_6_k[] = { 0x1.fffffep-1f, -0x1.3bd37ep2f, 0x1.03acccp2f, -0x1.4dfd3ap0f };
void f32_sincospi(float* dst, float a)
{
static const float* S = f32_sinpi_7_k;
static const float* C = f32_cospi_6_k;
float c,s,a2,a3,r;
uint32_t q,sx,sy;
r = nearbyintf(a+a);
a = fmaf(r,-0.5f,a);
q = (uint32_t)((int32_t)r);
a2 = a*a;
sy = (q<<31); sx = (q>>1)<<31; sy ^= sx; q &= 1;
c = fmaf(C[3], a2, C[2]); s = fmaf(S[3], a2, S[2]); a3 = a2*a;
c = fmaf(c, a2, C[1]); s = fmaf(s, a2, S[1]);
c = fmaf(c, a2, C[0]); s = a3 * s;
c = f32_mulsign(c,sx); s = fmaf(a, S[0], s);
s = f32_mulsign(s,sy);
dst[q ] = c;
dst[q^1] = s;
}
Improved sinpi kernel added: 20221031
At the cost of an additional constant load we can rework the sinpi approximation to have a much lower average error and transform the majority of faithfully rounded outputs to correctly rounded. In the original version we reworked the polynomial evaluation to better incorporate the small tail portion. In this updated version we’ll instead represent the leading coefficent
The following sollya statement produces a new approximating polynomial:
S = fpminimax(sin(pi*x), [|1,3,5,7|], [|48,24...|], [0;1/4], floating, relative);
the only change is adding the ‘48’ to the list of the number of precision bits per coefficient. Recall single precision can represent 24-bits in the normal range and our extended version is twice that (so double precision would be: 106 and 53).
We also need to be able to multiply by our new
// sin(pi*a) on [-1/4,1/4]
float f32_sinpi_k7a(float a)
{
const float C[] = {-0x1.4abbbap2f, 0x1.465edcp1f, -0x1.2d9e5ap-1f};
const f32_pair_t K = {.h = 0x1.921fb6p1f, .l=-0x1.9e5ee4p-24f};
float s,r;
s = a*a;
r = fmaf(s, C[2],C[1]);
r = fmaf(s, r, C[0]);
r = fmaf(a, K.h, a*fmaf(s,r, K.l));
return r;
}
Here are the number of faithfully instead of correctly rounded results on
core | denormal | normal | other |
---|---|---|---|
f32_cospi_k6 | n/a | 1029659223 | 3085346 |
f32_cospi_k8 | n/a | 1015279 | 428668 |
f32_sinpi_k7 | 311624 | 342485931 | 1179023 |
f32_sinpi_k7a | 311624 | 37382902 | 252983 |
where the denormal is limited to range of correctly rounded denormal results (so only applies to sinpi), normal for the remaining and other are the counts that will contribute to the other function (so cospi for sinpi entries and vice versa) via range reduction.
The total number of inputs on f32_cospi_k8
and f32_sinpi_k7a
we can compute the percentage of faithfully rounded for cospi/sinpi by:
and these are valid for input on the range
Less of an eye-sore and other stuff added: 20221031
For better or worse the original code above is attempting to visually spitball the instruction level parallelism possible in the computations. Let’s just repeat the original cores as individual functions.
// cos(pi*a) on [-1/4,1/4]
float f32_cospi_k8(float a)
{
static const float* C = f32_cospi_8_k;
float r,s;
s = a*a;
r = fmaf(C[3], s, C[2]);
r = fmaf(r, s, C[1]);
r = fmaf(r, s, C[0]);
r = fmaf(r, s, 1.f);
return r;
}
// sin(pi*a) on [-1/4,1/4]
float f32_sinpi_k7(float a)
{
static const float* C = f32_sinpi_7_k;
float r,a2,a3;
a2 = a*a; a3 = a2*a;
r = fmaf(C[3], a2, C[2]);
r = fmaf(r, a2, C[1]);
r = a3 * r;
r = fmaf(C[0], a, r);
return r;
}
and the original7 front-end range reduction:
// compile time valid provided: |a| <= 0x1.fffffep29 otherwise BOOM!
// runtime pins to infinity: |a| >= 2^127 which yields: -sign(a)*inf
float f32_pi_func_reduce(float a, uint32_t* q)
{
float r = nearbyintf(a+a); // <-- spurious overflow here when |a| >=2^127
float t = fmaf(-0.5f, r, a);
*q = (uint32_t)((int32_t)r); // <-- recall UB here if overflows
return t;
}
As previously mentioned there’s some limitations with this reduction (see code comments). I’m leaving this in place because I think it’s a useful version for my intended target audience where almost all calls will be on
But first there is a compliant range reduction in Norbert Juffa’s sincospi
(& w/o direct bit manipulation) which written up on StackOverflow. Let’s take a look the front-end part of his reduction:
// for |a| > 2**24, cospi(a) = 1.0f, but cospi(Inf) = NaN
az = a * 0.0f; // must be evaluated with IEEE-754 semantics
a = (fabsf (a) < 0x1.0p24f) ? a : az; // <-- comment is for here
r = nearbyintf (a + a); // must use IEEE-754 "to nearest" rounding
i = (int32_t)r;
t = fmaf (-0.5f, r, a);
OK: so az
is just the zero with the same sign as a
, then he limits a
in the denoted line. In addition to preventing both UB in the cast and the spurious overflow in the nearbyint
call it also prevents the overflow stick bit/flag (as per so-called floating point exceptions) from being set by nearbyint
.
What I’d prefer to avoid (for a non-standard version) is the extra latency introduced limiting a
before computing r
which is needed to start the polynomial evaluations and just remove the UB:
// runtime pins to infinity if |a| >= 2^127 which yields: -sign(a)*inf
float f32_pi_func_reduce(float a, uint32_t* q)
{
float r = nearbyintf(a+a);
float t = fmaf(-0.5f, r, a);
*q = (fabs(f) < 0x1.0p24f) ? (uint32_t)((int32_t)r) : 0;
//*q = (r*r < 0x1.0p25f) ? (uint32_t)((int32_t)r) : 0;
return t;
}
the commented out statement is a possible alternate (ARM has fabs, Intel has to mask off bit). So anyway: a slightly less ugly version:
void f32_sincospi(float* dst, float a)
{
uint32_t q,sx,sy;
a = f32_pi_func_reduce(a, &q);
// whatever choice of core approximations
float c = f32_cospi_k8(a);
float s = f32_sinpi_k7a(a);
// or however you want the handle the reconstruction
sy = (q<<31); sx = (q>>1)<<31; sy ^= sx; q &= 1;
c = f32_mulsign(c,sx);
s = f32_mulsign(s,sy);
dst[q ] = c;
dst[q^1] = s;
}
Tau or full turn
Since recently there’s seemingly been an interest spike in
inline void f32_sincos_tau(float* dst, float a) { f32_sincospi(dst, 2.f*a); }
provided your language and compiler are mildly aggressive with inlining then this is “free” otherwise tweaking the statement (or equivalent) nearbyintf(a+a)
to nearbyintf(4.f*a)
in the range reduction and you’re done. Well other than documenting valid ranges given the choice of range reduction implementation. (notice no rounding errors are introduced by this modification)
Sollya output summary
This section summarizes the output of the sollya script giving the supnorm and coefficents of various degree approximations for both single and double precision. The full output also include tables of inputs where each has zero and maximum errors on the primary range.
single precision sine:
degree | supnorm | supnorm |
---|---|---|
3 | ||
5 | ||
7 |
float f32_sin_pi_3_k = {0x1.91f5aap1f, -0x1.408cf2p2f};
float f32_sin_pi_5_k = {0x1.921f8ep1f, -0x1.4aa618p2f, 0x1.3f3f3cp1f};
float f32_sin_pi_7_k = {0x1.921fb6p1f, -0x1.4abbecp2f, 0x1.466b2p1f, -0x1.2f5992p-1f};
single precision cosine:
degree | supnorm | supnorm |
---|---|---|
4 | ||
6 | ||
8 |
float f32_cos_pi_4_k = {0x1.fffe74p-1f, -0x1.3ba0ecp2f, 0x1.f73ff8p1f};
float f32_cos_pi_6_k = {0x1.fffffep-1f, -0x1.3bd37ep2f, 0x1.03acccp2f, -0x1.4dfd3ap0f};
float f32_cos_pi_8_k = {0x1p0.0f, -0x1.3bd3ccp2f, 0x1.03c1b8p2f, -0x1.55b7cep0f, 0x1.d684aap-3f};
double precision sine:
degree | supnorm | supnorm |
---|---|---|
7 | ||
9 | ||
11 | ||
13 |
double f64_sin_pi_7_k = {0x1.921fb52e6a183p1, -0x1.4abbb9003982p2, 0x1.465e91ee1939cp1, -0x1.2d93021f38404p-1};
double f64_sin_pi_9_k = {0x1.921fb5443af5fp1, -0x1.4abbce564cd85p2, 0x1.466bba8bfcp1, -0x1.32ca854cad10ep-1, 0x1.4bc25574ce357p-4};
double f64_sin_pi_11_k = {0x1.921fb54442cf8p1, -0x1.4abbce6257778p2, 0x1.466bc670fa464p1, -0x1.32d2c627f47dap-1, 0x1.5071be0a2d3dap-4, -0x1.dd4e0ae5b9fcdp-8};
double f64_sin_pi_13_k = {0x1.921fb54442d18p1, -0x1.4abbce625be09p2, 0x1.466bc67754fffp1, -0x1.32d2ccdfe9424p-1, 0x1.50782d5f14825p-4, -0x1.e2fe76fdffd2bp-8, 0x1.e357ef99eb0bbp-12};
double precision cosine:
degree | supnorm | supnorm |
---|---|---|
6 | ||
8 | ||
10 | ||
12 | ||
14 |
double f64_cos_pi_6_k = {0x1.fffffee7d9c45p-1, -0x1.3bd38b6200606p2, 0x1.03ae64c9b1678p2, -0x1.4e35da5eabfc3p0};
double f64_cos_pi_8_k = {0x1.ffffffff84b78p-1, -0x1.3bd3cc6e7cf9dp2, 0x1.03c1daad03632p2, -0x1.55c4e9c28184p0, 0x1.d99f538fa683dp-3};
double f64_cos_pi_10_k = {0x1.ffffffffffdb1p-1, -0x1.3bd3cc9bd0994p2, 0x1.03c1f073cd5eap2, -0x1.55d3b9726d2d5p0, 0x1.e1e761359aeb5p-3, -0x1.a0d8acfd174acp-6};
double f64_cos_pi_12_k = {0x1.fffffffffffffp-1, -0x1.3bd3cc9be4565p2, 0x1.03c1f081af262p2, -0x1.55d3c7dac6352p0, 0x1.e1f4fa4f3bb22p-3, -0x1.a6c9501cb8cdap-6, 0x1.f3b7b6aafa2abp-10};
double f64_cos_pi_14_k = {0x1p0.0, -0x1.3bd3cc9be45dep2, 0x1.03c1f081b5a67p2, -0x1.55d3c7e3c325bp0, 0x1.e1f5067b90b37p-3, -0x1.a6d1e7294bffap-6, 0x1.f9c89ca1d5187p-10, -0x1.b167302e37198p-14};
References and footnotes
-
“Sollya: an environment for the development of numerical codes”, (project page) ↩
-
“Fused multiply accumulate” (Wikipedia) ↩
-
“Horner’s method” (Wikipedia) ↩
-
Correctly rounded: the floating point result is the nearest one to the exact (the error is less that 1/2 ULP). ↩
-
“List of trigonometric identities” (Wikipedia) ↩
-
Actually there’s a couple of modification here as well. I was originally using
rintf
which rounds according to the currently set rounding mode. This has been changed to usingnearbyintf
. On Intel and ARM this switch is free as both support in hardware. The second change was the original cast ofq
was directly touint32_t
instead of throughint32_t
first. Again on Intel and ARM this is free. ↩