I’m going do a quick run through of a design and implementation (in binary32 aka single precision) of a method to simultaneously compute:

\begin{align*} \sin\left(\pi a\right) \\ \cos\left(\pi a\right) \end{align*}

which is going to contain both minimal details and validation.

• non rigorous: don’t care about how float-point sticky flags are effected nor about how the how sign of zeroes, NaNs and infinities pan out.
• highly accurate: target “faithfully rounded” (error less than 1 ULP)
• no limit on input range (ok..caveats. see below)
• high throughput computation
• branch-free: both so input can be effectively random (not burdened by branch mispreditions) and can be directly extended to SIMD.

I’m going to assume basic familiarity with polynomial (and in particular minimax) approximations. Robin Green has some very accessible introductory material1 on this topic. I’m going to use Sollya2 which is a freely available library and DSL to build the core approximations. An additional assumption is that the target hardware has fused multiply accumulate3 (FMA) hardware and the used language & compiler support generating them.

Given that we’re not working with a limited input range we can’t simply brute force the approximation and must:

1. Perform a range reduction
2. Compute the approximation on the reduce range
3. Perform the required “fixed-up” to undo the range reduction

So we have our first trade-off knob to turn: a smaller range makes the function easier to approximate and the flip side is the amount of work required to get that smaller range. The great news is both sine and cosine are pretty easy functions to approximate so we don’t need to reduce the range by much. Also the range reduction for the $\pi$ variants isn’t the PITA of the standard versions.

Throughout this post I’m going to be structuring the code in an awkward way which is intended to help visualize operations which can be performed in parallel.

Approximate the primary range

Let’s run with the reduced range of $a \in \left[-\frac{1}{4},~\frac{1}{4}\right]$. A minimal sollya script to generate the approximations:

  S = fpminimax(sin(pi*x), [|1,3,5,7|],   [|24...|], [0;1/4], floating, relative);
C = fpminimax(cos(pi*x), [|0,2,4,6,8|], [|24...|], [0;1/4], floating, relative);


The actual script used is slightly more complex and can be found (along with generated output) here. Additionally there’s a brief summary of the output at the end of this post.

The inputs are: function to approximate, explicit list of mononomials to use, number of significand bits per component (24 for all), range, floating point (instead of fixed point) coefficients and optimize for relative error. A perhaps noteworthy item is the monomial list which means we don’t need rewrite the function and work out a modified range (which is needed for some software tools).

The sine and cosine approximations generated by the mini script above have errors less than $2^{-27}$ and $2^{-30}$ respectively so both of these should be good for our target. Let’s take the output and shove it into a code fragment which evaluates each using Horner’s method4:

If we test the output of cosine we’d find 94.9% to be correctly rounded5 and the rest are faithfully rounded (target met). Notice that the coefficient table doesn’t include constant term. This is so we can simply comment out the final FMA and replace the $8^{\text{th}}$ with the $6^{\text{th}}$ order polynomial coefficients as follows:

In that case we get 63.2% correctly rounded and rest faithfully. Although with this choice we significantly increase our average and RMS errors we’d still be meeting our target error bound (maximum error)..which is arguably only really interesting one. Moving on to the sine approximation we’d get 73.4692% correctly, 26.5289% faithfully and (sigh) 0.00183582% within 2 ULP. We have a polynomial that can meet our error bound but how we’re computing it isn’t sufficient. As a polynomial expression (written as evaluated) we currently have:

$a (s_0 + \underbrace{a^2 (s_1 + a^2 (s_2 + a^2 s_3))}_{\text{tail}})$

Let’s plot these values:

Now let’s look at what happens if we rewrite the computation by distributing the outer $a$:

$\sin\left(\pi~a\right) \approx \underbrace{a~s_0}_{\text{head}} + \underbrace{a^3 (s_1 + a^2 (s_2 + a^2 s_3))}_{\text{tail}}$

and the updated code looks like:

So we’re adding an operation (computing $a^3$) but it can be computed in parallel. With this change we get 85.95% correctly and the remaining faithfully rounded. Let’s spitball this. In the original expression the ‘tail’ is being added to $s_0$ (where $s_0 \approx \pi$) in the C expression: fmaf(s, a2, S[0]) and $s_0$ is significantly larger than the magnitude of ‘tail’. The reworked form computes its subexpression marked ‘tail’ and rounds, then the final FMA computes ‘head’ $({a~s_0})$ and adds its ‘tail’ in a single rounding step.

Range reduction

Our chosen primary range handles the quarter of the unit circle centered around $a=0$. We now need the code to narrow to our target range on the front end and handle the reverse mappings at the end.

Note the use of FMA to produce the reduced $a$ isn’t needed for accuracy (the result is exact computed with or without) and is being used in case the language and/or compiler disallows promoting (and I’m assuming that an explicit FMA is faster than a multiply add sequence). Now that we have the input factored into reduced $a$ and $q$ we can write the reconstruction as:

$\cos\left(\pi~a + \pi \frac{q}{2}\right)$

and likewise for sin. From the period we can see we only need to bottom two bits of $q$ and from the phase shift identities6 we get:

To nuke these branches let’s define some helper functions:

The first two simply let us move the floating point bits to an integer and back again. The last simply XORs the two inputs where the intended usage is to pass in a float as the first parameter and the second an integer which is either zero or only has the top bit set. This allows us to multiply by either 1 (integer is zero) or by -1 (only top bit set). So we can eliminate the sign flip portions with:

Now all that’s left is swapping the values. For a SIMD version we have to use a branch-free select method, here I’m going to punt and have the function pass in an array pointer and write out the final cosine and sine values to the first and second elements respectively:

Paste the parts together

Let’s take a look at all the parts put together:

There’s still some minor inefficiencies I’m not going to bother to remove. The most obvious is we can fold the two f32_mulsign computations into the core approximations and the rest is “I’m desperate so let’s think about the ISA and uArch and look at what the compiler’s spitting out”. For the former after we’ve computed a2 we could do a = f32_mulsign(a,sy) which covers adding that sign to a3 and we’re done of the core sign of sine. For cosine we’d need to add its sign sx to the one and to a copy of a2 for the final FMA.

Note that I’ve kept the cosine computation as the $8^{\text{th}}$-order polynomial. Dropping a term is a much bigger win then the stuff in the previous paragraph (and recall we’re still faithfully rounded). A cut-n-paste version for ya:

At the cost of an additional constant load we can rework the sinpi approximation to have a much lower average error and transform the majority of faithfully rounded outputs to correctly rounded. In the original version we reworked the polynomial evaluation to better incorporate the small tail portion. In this updated version we’ll instead represent the leading coefficent $s_0$ in extended precision.

The following sollya statement produces a new approximating polynomial:

      S = fpminimax(sin(pi*x), [|1,3,5,7|],   [|48,24...|], [0;1/4], floating, relative);


the only change is adding the ‘48’ to the list of the number of precision bits per coefficient. Recall single precision can represent 24-bits in the normal range and our extended version is twice that (so double precision would be: 106 and 53).

We also need to be able to multiply by our new $s_0$ which is covered in another post which also covers some more details. The updated core approximation then looks like (K is the extended precision coefficient and C the remaining):

Here are the number of faithfully instead of correctly rounded results on $\left[0,\frac{1}{4}\right]$ for each kernel:

core denormal normal other
f32_cospi_k6 n/a 1029659223 3085346
f32_cospi_k8 n/a 1015279 428668
f32_sinpi_k7 311624 342485931 1179023
f32_sinpi_k7a 311624 37382902 252983

where the denormal is limited to range of correctly rounded denormal results (so only applies to sinpi), normal for the remaining and other are the counts that will contribute to the other function (so cospi for sinpi entries and vice versa) via range reduction.

The total number of inputs on $\left[0,\frac{1}{2}\right]$ is 1056964608 so for the pair f32_cospi_k8 and f32_sinpi_k7a we can compute the percentage of faithfully rounded for cospi/sinpi by:

\begin{align*} 100 (1015279 + 252983)/1056964608 & \approx 0.119991\% \\ 100 (311624 + 37382902 + 428668)/1056964608 & \approx 3.606860\% \end{align*}

and these are valid for input on the range $\left[-1,1\right]$. These numbers improve as we grow the range since we have less input precision (one less bit each time we move up a power of two boundary). But anyway this is the really interesting range IMHO.

Less of an eye-sore and other stuff added: 20221031

For better or worse the original code above is attempting to visually spitball the instruction level parallelism possible in the computations. Let’s just repeat the original cores as individual functions.

and the original7 front-end range reduction:

As previously mentioned there’s some limitations with this reduction (see code comments). I’m leaving this in place because I think it’s a useful version for my intended target audience where almost all calls will be on $\left[-1,1\right]$ and still works up to $\left| a \right| \lt 2^{30}$. However I’ll toss out a modification to handle the UB case as well.

But first there is a compliant range reduction in Norbert Juffa’s sincospi (& w/o direct bit manipulation) which written up on StackOverflow. Let’s take a look the front-end part of his reduction:

OK: so az is just the zero with the same sign as a, then he limits a in the denoted line. In addition to preventing both UB in the cast and the spurious overflow in the nearbyint call it also prevents the overflow stick bit/flag (as per so-called floating point exceptions) from being set by nearbyint.

What I’d prefer to avoid (for a non-standard version) is the extra latency introduced limiting a before computing r which is needed to start the polynomial evaluations and just remove the UB:

the commented out statement is a possible alternate (ARM has fabs, Intel has to mask off bit). So anyway: a slightly less ugly version:

Tau or full turn

Since recently there’s seemingly been an interest spike in $\tau$ or “full turn” variants (perhaps from Casey Muratori’s tweet) we might as well cover that case as well:

provided your language and compiler are mildly aggressive with inlining then this is “free” otherwise tweaking the statement (or equivalent) nearbyintf(a+a) to nearbyintf(4.f*a) in the range reduction and you’re done. Well other than documenting valid ranges given the choice of range reduction implementation. (notice no rounding errors are introduced by this modification)

Sollya output summary

This section summarizes the output of the sollya script giving the supnorm and coefficents of various degree approximations for both single and double precision. The full output also include tables of inputs where each has zero and maximum errors on the primary range.

single precision sine:

degree supnorm supnorm
3 $\text{0x1.2ed540} \cdot 2^{-12}$ $\approx 2.8880 \cdot 10^{-4}$
5 $\text{0x1.1ea062} \cdot 2^{-20}$ $\approx 1.0678 \cdot 10^{-6}$
7 $\text{0x1.71dde6} \cdot 2^{-28}$ $\approx 5.3824 \cdot 10^{-9}$

single precision cosine:

degree supnorm supnorm
4 $\text{0x1.8c } \cdot 2^{-17}$ $\approx 1.1802\cdot 10^{-5}$
6 $\text{0x1.0 } \cdot 2^{-24}$ $\approx 5.9605\cdot 10^{-8}$
8 $\text{0x1.1093e2} \cdot 2^{-31}$ $\approx 4.9582\cdot 10^{-10}$

double precision sine:

degree supnorm supnorm
7 $\text{0x1.3ab388049b2fb} \cdot 2^{-29}$ $\approx 2.2898\cdot 10^{-9}$
9 $\text{0x1.c4c3c95e213b2} \cdot 2^{-39}$ $\approx 3.2171\cdot 10^{-12}$
11 $\text{0x1.ca8315e19366f} \cdot 2^{-49}$ $\approx 3.1816\cdot 10^{-15}$
13 $\text{0x1.3ebb2d89a070c} \cdot 2^{-59}$ $\approx 2.1598\cdot 10^{-18}$

double precision cosine:

degree supnorm supnorm
6 $\text{0x1.18263bb } \cdot 2^{-25}$ $\approx 3.2614 \cdot 10^{-8}$
8 $\text{0x1.ed22 } \cdot 2^{-35}$ $\approx 5.6063 \cdot 10^{-11}$
10 $\text{0x1.278 } \cdot 2^{-44}$ $\approx 6.5614 \cdot 10^{-14}$
12 $\text{0x1.0 } \cdot 2^{-53}$ $\approx 1.1102 \cdot 10^{-16}$
14 $\text{0x1.c8900a3e39182} \cdot 2^{-62}$ $\approx 3.8672 \cdot 10^{-19}$

References and footnotes

1. “Faster math function”, Robin Green (link)

2. “Sollya: an environment for the development of numerical codes”, (project page)

3. “Fused multiply accumulate” (Wikipedia)

4. “Horner’s method” (Wikipedia)

5. Correctly rounded: the floating point result is the nearest one to the exact (the error is less that 1/2 ULP).

6. “List of trigonometric identities” (Wikipedia)

7. Actually there’s a couple of modification here as well. I was originally using rintf which rounds according to the currently set rounding mode. This has been changed to using nearbyintf. On Intel and ARM this switch is free as both support in hardware. The second change was the original cast of q was directly to uint32_t instead of through int32_t first. Again on Intel and ARM this is free.