I’m going do a quick run through of a design and implementation (in binary32 aka single precision) of a method to simultaneously compute:

\begin{align*} \sin\left(\pi a\right) \\ \cos\left(\pi a\right) \end{align*}

which is going to contain both minimal details and validation.

• non rigorous: don’t care about how float-point sticky flags are effected nor about how the how sign of zeroes, NaNs and infinities pan out.
• highly accurate: target “faithfully rounded” (error less than 1 ULP)
• no limit on input range
• high throughput computation
• branch-free: both so input can be effectively random (not burdened by branch mispreditions) and can be directly extended to SIMD.

I’m going to assume basic familiarity with polynomial (and in particular minimax) approximations. Robin Green has some very accessible introductory material1 on this topic. I’m going to use Sollya2 which is a freely available library and DSL to build the core approximations. An additional assumption is that the target hardware has fused multiply accumulate3 (FMA) hardware and the used language & compiler support generating them.

Given that we’re not working with a limited input range we can’t simply brute force the approximation and must:

1. Perform a range reduction
2. Compute the approximation on the reduce range
3. Perform the required “fixed-up” to undo the range reduction

So we have our first trade-off knob to turn: a smaller range makes the function easier to approximate and the flip side is the amount of work required to get that smaller range. The great news is both sine and cosine are pretty easy functions to approximate so we don’t need to reduce the range by much. Also the range reduction for the $\pi$ variants isn’t the PITA of the standard versions.

Throughout this post I’m going to be structuring the code in an awkward way which is intended to help visualize operations which can be performed in parallel.

## Approximate the primary range

Let’s run with the reduced range of $a \in \left[-\frac{1}{4},~\frac{1}{4}\right]$. A minimal sollya script to generate the approximations:

  S = fpminimax(sin(pi*x), [|1,3,5,7|],   [|24...|], [0;1/4], floating, relative);
C = fpminimax(cos(pi*x), [|0,2,4,6,8|], [|24...|], [0;1/4], floating, relative);


The actual script used is slightly more complex and can be found (along with generated output) here. Additionally there’s a brief summary of the output at the end of this post.

The inputs are: function to approximate, explicit list of mononomials to use, number of significand bits per component (24 for all), range, floating point (instead of fixed point) coefficients and optimize for relative error. A perhaps noteworthy item is the monomial list which means we don’t need rewrite the function and work out a modified range (which is needed for some software tools).

The sine and cosine approximations generated by the mini script above have errors less than $2^{-27}$ and $2^{-30}$ respectively so both of these should be good for our target. Let’s take the output and shove it into a code fragment which evaluates each using Horner’s method4:

If we test the output of cosine we’d find 94.9% to be correctly rounded5 and the rest are faithfully rounded (target met). Notice that the coefficient table doesn’t include constant term. This is so we can simply comment out the final FMA and replace the $8^{\text{th}}$ with the $6^{\text{th}}$ order polynomial coefficients as follows:

In that case we get 63.2% correctly rounded and rest faithfully. Although with this choice we significantly increase our average and RMS errors we’d still be meeting our target error bound (maximum error)..which is arguably only really interesting one. Moving on to the sine approximation we’d get 73.4692% correctly, 26.5289% faithfully and (sigh) 0.00183582% within 2 ULP. We have a polynomial that can meet our error bound but how we’re computing it isn’t sufficient. As a polynomial expression (written as evaluated) we currently have:

$a (s_0 + \underbrace{a^2 (s_1 + a^2 (s_2 + a^2 s_3))}_{\text{tail}})$

Let’s plot these values:

Now let’s look at what happens if we rewrite the computation by distributing the outer $a$:

$\sin\left(\pi~a\right) \approx \underbrace{a~s_0}_{\text{head}} + \underbrace{a^3 (s_1 + a^2 (s_2 + a^2 s_3))}_{\text{tail}}$

and the updated code looks like:

So we’re adding an operation (computing $a^3$) but it can be computed in parallel. With this change we get 85.95% correctly and the remaining faithfully rounded. Let’s spitball this. In the original expression the ‘tail’ is being added to $s_0$ (where $s_0 \approx \pi$) in the C expression: fmaf(s, a2, S[0]) and $s_0$ is significantly larger than the magnitude of ‘tail’. The reworked form computes its subexpression marked ‘tail’ and rounds, then the final FMA computes ‘head’ $({a~s_0})$ and adds its ‘tail’ in a single rounding step.

## A brief aside about factorization

The polynomials are too short to get any real mileage out of considering alternate factorizations, but just for fun I’ll show how using a 2nd-order Horner’s and Estrin’s method6 play out wrt error bound for the cosine.

method degree CR 1 ULP 2 ULP
Horner 8 94.8916 5.10845
Horner 2nd 8 93.6111 6.38889
Estrin 8 93.6038 6.39621
Horner 6 63.1888 36.81120
Horner 2nd 6 43.9763 51.68299 4.340697
Estrin 6 58.2155 40.86100 0.923443

And to appease the math library gods I’ll note that Estrin’s method not generally the best factorization wrt throughput. This is a big topic and under active research. If you’re reading this much later than the publication date, then the Code Generation for Polynomial Evaluation7 project is suppose to get an update (including floating-point support) in the near-ish future. Actually forget I talked about any of this.

(Note to self: delete this section prior to pushing)

## Range reduction

Our chosen primary range handles the quarter of the unit circle centered around $a=0$. We now need the code to narrow to our target range on the front end and handle the reverse mappings at the end.

Note the use of FMA to produce the reduced $a$ isn’t needed for accuracy (the result is exact computed with or without) and is being used in case the language and/or compiler disallows promoting (and I’m assuming that an explicit FMA is faster than a multiply add sequence). Now that we have the input factored into reduced $a$ and $q$ we can write the reconstruction as:

$\cos\left(\pi~a + \pi \frac{q}{2}\right)$

and likewise for sin. From the period we can see we only need to bottom two bits of $q$ and from the phase shift identities8 we get:

To nuke these branches let’s define some helper functions:

The first two simply let us move the floating point bits to an integer and back again. The last simply XORs the two inputs where the intended usage is to pass in a float as the first parameter and the second an integer which is either zero or only has the top bit set. This allows us to multiply by either 1 (integer is zero) or by -1 (only top bit set). So we can eliminate the sign flip portions with:

Now all that’s left is swapping the values. For a SIMD version we have to use a branch-free select method, here I’m going to punt and have the function pass in an array pointer and write out the final cosine and sine values to the first and second elements respectively:

## Paste the parts together

Let’s take a look at all the parts put together:

There’s still some minor inefficiencies I’m not going to bother to remove. The most obvious is we can fold the two f32_mulsign computations into the core approximations and the rest is “I’m desperate so let’s think about the ISA and uArch and look at what the compiler’s spitting out”. For the former after we’ve computed a2 we could do a = f32_mulsign(a,sy) which covers adding that sign to a3 and we’re done of the core sign of sine. For cosine we’d need to add its sign sx to the one and to a copy of a2 for the final FMA.

Note that I’ve kept the cosine computation as the $8^{\text{th}}$-order polynomial. Dropping a term is a much bigger win then the stuff in the previous paragraph (and recall we’re still faithfully rounded). A cut-n-paste version for ya:

## Sollya output summary

This section summarizes the output of the sollya script giving the supnorm and coefficents of various degree approximations for both single and double precision. The full output also include tables of inputs where each has zero and maximum errors on the primary range.

single precision sine:

degree supnorm supnorm
3 $\text{0x1.2ed540} \cdot 2^{-12}$ $\approx 2.8880 \cdot 10^{-4}$
5 $\text{0x1.1ea062} \cdot 2^{-20}$ $\approx 1.0678 \cdot 10^{-6}$
7 $\text{0x1.71dde6} \cdot 2^{-28}$ $\approx 5.3824 \cdot 10^{-9}$

single precision cosine:

degree supnorm supnorm
4 $\text{0x1.8c } \cdot 2^{-17}$ $\approx 1.1802\cdot 10^{-5}$
6 $\text{0x1.0 } \cdot 2^{-24}$ $\approx 5.9605\cdot 10^{-8}$
8 $\text{0x1.1093e2} \cdot 2^{-31}$ $\approx 4.9582\cdot 10^{-10}$

double precision sine:

degree supnorm supnorm
7 $\text{0x1.3ab388049b2fb} \cdot 2^{-29}$ $\approx 2.2898\cdot 10^{-9}$
9 $\text{0x1.c4c3c95e213b2} \cdot 2^{-39}$ $\approx 3.2171\cdot 10^{-12}$
11 $\text{0x1.ca8315e19366f} \cdot 2^{-49}$ $\approx 3.1816\cdot 10^{-15}$
13 $\text{0x1.3ebb2d89a070c} \cdot 2^{-59}$ $\approx 2.1598\cdot 10^{-18}$

double precision cosine:

degree supnorm supnorm
6 $\text{0x1.18263bb } \cdot 2^{-25}$ $\approx 3.2614 \cdot 10^{-8}$
8 $\text{0x1.ed22 } \cdot 2^{-35}$ $\approx 5.6063 \cdot 10^{-11}$
10 $\text{0x1.278 } \cdot 2^{-44}$ $\approx 6.5614 \cdot 10^{-14}$
12 $\text{0x1.0 } \cdot 2^{-53}$ $\approx 1.1102 \cdot 10^{-16}$
14 $\text{0x1.c8900a3e39182} \cdot 2^{-62}$ $\approx 3.8672 \cdot 10^{-19}$

## References and footnotes

1. “Faster math function”, Robin Green (link)

2. “Sollya: an environment for the development of numerical codes”, (project page)

3. “Fused multiply accumulate” (Wikipedia)

4. “Horner’s method” (Wikipedia)

5. Correctly rounded: the floating point result is the nearest one to the exact (the error is less that 1/2 ULP).

6. “Estrin’s scheme” (Wikipedia)

7. “CGPE: Code Generation for Polynomial Evaluation” (project page)

8. “List of trigonometric identities” (Wikipedia)