December 10th, 2019 Okay. So I made a time-killing post on twitter that read: A silly little approximation of x^2.2 (0.5f*x*x)*(sqrtf(x)+1.f) which was intended to be mostly a joke and just a little bit serious. Since a lot of people actually ended up looking at the thread I figure I should be a good net citizen and clarify what’s a joke and what’s not. The “approximation” above can rewritten as: \[a~x^2 + \left(1-a\right)~x^2\sqrt{x}\] with $a=1/2$, so it’s saying: Hey! It’s just about the average of $x^2$ and $x^{2.5}$! This was a pure joke to have a very short and unlikely looking approximation. Let’s ignore if it’s a good idea or not and look at some better constant choices. But first a little bookkeeping. Throughout this post I’m optimizing for absolute error and most often relative error is the way to go. All given constants are for single precision and are representable. I’m going to be more-or-less mainstream CPU specific in my commentary..my GPU knowledge is (let’s say) a bit rusty. Okay optimizing $a$ (method below) gives: \[\begin{align*} a & = 0.564024388790130615234375 \\ 1-a & = 0.435975611209869384765625 \end{align*}\] well that’s pretty close to $0.5$. This will give us a version that’s exact at both $x=0$ and $x=1$. We could also relax our implied exact at $1$ requirement by optimizing $a$ and $b$ for the following: \[a~x^2 + b~x^2\sqrt{x}\] which gives: \[\begin{align*} a & = 0.556629836559295654296875 \\ b & = 0.444691956043243408203125 \end{align*}\] and we could carry on by relaxing the exact at $0$ requirement with: \[a~x^2 + b~x^2\sqrt{x} + c\] but AFAIK nobody really cares about 2.2 ramps (directly computed that way), so let’s stop and look at error plots: So we can see the tweet joke version is total garbage. We don’t even need the other traces to see this and here’s how if you don’t know. We’re look at an error plot which is showing the kind of error we’re interested in. If the approximation is close to being optimal (for that error) then it should “wiggle” between two peak values $\pm e$ and “kiss” them before heading off in the other direction. Roughly speaking for an approximation to be (near) optimal then it should minimize the maximum error (minimax) which will give error plots as described. The other two approximations do have this property. The $x^{2.4}$ ramp no this isn’t a complete sRGB conversion Repeating the same process with $x^{2.4} $ for exact at both $(a)$, relaxed at 1 $(ab)$ and relaxed at both endpoints $(abc)$: \[\begin{align*} \text{method: a} \\ a & = 0.17758834362030029296875 \\ 1-a & = 0.82241165637969970703125 \\ \\ \text{method: ab} \\ a & = 0.17309470474720001220703125 \\ b & = 0.827698051929473876953125 \\ \\ \text{method: abc} \\ a & = 0.181508541107177734375 \\ b & = 0.81964015960693359375 \\ c & = -5.7434104382991790771484375e-4 \end{align*}\] Also we could consider what would happen if we reworked the square-root into a recip square root and just used the hardware approximate opcode (where available and without any NR step). Using method $ab$ as the example: \[a~x^2 + b~x^2\sqrt{x} = \left(a~x^2\right) + \left(b~x^3\right) \left(\sqrt{x}\right)^{-1}\] This introduces a gotcha for $x=0$ since $1/\sqrt{\pm0} = \pm\inf$ which we’d end up multiplying by zero and getting a $\text{NaN}$ (but if we were talking about sRGB then a zero input doesn’t occur). Now that we have some approximations we need to measure if they have any merit and there are two obvious questions: Is the error sufficiently small to be usable for the given problem? Does it have higher throughput than our other known choices? (well assuming performance is important..we’ll pretend like it is) To have an error bound target let’s just say the result is going to be truncated to a 10-bit integer (just to randomly make-up a problem) then any approximation whose peak error is less than $ 2^{-10} $ is sufficient. Moreover any approximation that spends additional cycles pushing the error bound lower than this is performing useless work (since it doesn’t contribute to the result). For an alternate method we can generate a minimax polynomial of high enough degree to meet the error bound requirement (method and code given below). Just to add an equation, a forth degree polynomial looks like this: \[a x^4 + b x^3 + c x^2 +d x + e\] Let’s cut the error plot (you can hide traces by clicking on its name in the legend): The Y axis has ticks/grid-lines at $\pm2^{-10}$ and $\pm2^{-11}$ and we can immediately observe that all the approximations meet the minimal error requirement except the degree 3 polynomial (remez 3). The forth degree has a low enough error for 11-bits, so we could find a new polynomial with the constant term $e$ fixed to zero (exercise for the reader). Likewise $abc$ requires an extra constant so it’s not worth considering (aside: in the tweeter thread I joked about $ab$ not being exact at 1 being a sadface..totally unimportant). Using the computer math library/DSL sollya I computed the peak errors of each: method peak error peak error remez 3 0.00113705781122 $ 1.2a12b \cdot 2^{-10} $ remez 4 0.00026079002477 $ 1.11755 \cdot 2^{-11} $ a 0.00092451750243 $ 1.e4b6a \cdot 2^{-11} $ ab 0.00079275667667 $ 1.9fa20 \cdot 2^{-11} $ abc 0.00057435967028 $ 1.2d214 \cdot 2^{-11} $ So what about the second question? What about performance? Well to spitball it: using any of the above will be roughly $5x$ faster than a standard power function! Oh, compared to each other? Well…it depends. We roughly have the same number of ISA ops for each. This godbolt has (completely useless) scalar versions as functions. The mix of ops and port usages are somewhat similar. The potential interesting differences are: the sqrt and rsqrt version need fewer constants which might lead to few register spills and the throughput numbers on sqrt (scalar/SIMD) & rsqrt (scalar/SIMD) are on the move in recent-ish microarchitecture. So we’d really need to look at them in flagrante (or is it in situ, I always mix them up)…whatever…the surrounding work and given microarch are quite important to the answer. whY U No TaLk sRGB??? Because I’m lazy, grump and have issues and questions. (whispers: I started a follow-up that does. We’ll see if it has anything interesting or not) Method (and lack-thereof) and the worst code snippets on the planet The various constants for equations with the square root were produced in Mathematica like this: t = Table[{x, x^2.4}, {x, 0, 1, 1/100000}]; f = x x (a + b Sqrt[x]); p = FindFit[t, f, {a,b}, x, NormFunction -> (Norm[#, Infinity] &)] f /. p Which does the following: $t$ is just an array of $\left(x,~x^{2.4}\right)$ values sample on $\left[0,~1\right]$. $f$ defines the expression we want to optimize. $p$ is the result of the optimization: take the table $t$, expression $f$, $a$ and $b$ are the values to optimize, $x$ is the function input and use the supremum norm as error measure. This generates sub-optimal coefficents but it doesn’t matter here since improving the error bound is useless for the presented toy problem. For the polynomials I used Sollya. As an example the degree 4 was generated by: fpminimax(x^2.4, 4, [|24...|], [0;1], floating, absolute); Full mathematica, sollya and c-source will be linked in follow-up post. Comments math (33) color (1) , approximation (3)