Just about a year ago I tossed out a standard normal distribution approximation in a tweet just for the fun factor. I ran across it recently and noticed I could make a couple of improvements. Per Vognsen thoughfully broke down the “how it works” in a pair of tweets:

## Baseline version

Before jumping into the improvements let’s take a look at an alternate approximation for comparison purposes. The super old skool method of approximating a normal distriubtion is to simply sum up a bunch of uniform distributions (so also central limit theorem based). Since our population count version draws two 64-bit uniform numbers let’s stick with that to keep the uniform draw cost the same.

As for the scaling value $K$ we need to include the factor to take our fixed-point integer to floating point which is $2^{-32}$. We can look up the variance for the sum of $n$ uniform values which is $\frac{n}{12}$ and the standard deviation is the square root of the variance. Since the standard normal distribution has a standard deviation of one we need an additional factor of the reciprocal of our sum’s standard deviation: $\sqrt{3} \approx 1.732051$. The value of $K$ in the code is different from this because I ran an optimization to minimize the maximum error (aka minimax for absolute error).

## Improved popcount version

The modified population count version:

There are three changes (marked as side comments):

1. The triangle distribution uses a subtraction instead of addition. This makes it centered around zero which removes the final subtraction (-0.25f) in the original version. This reduces the dependency chain length by one.
2. The original was (foolishly) designed to have no rounding when converting the integer to floating point. That’s a non feature here so the left shift has been adjusted up and the right shift is eliminated.
3. In addition to nuking the subtraction (as per item 1) the scaling constant has been minimax optimized like the summation method above.

So the cost of the transform is now ~9 primitive ops (dropping two) on 64-bit hardware with native popcount.

## 32-bit popcount version

In the case of hardware that either has 32-bit popcount (but not 64-bit) or the population count must be performed in software it seems worthwhile to toss out a variant for those cases. Adjust the centering of the triangle distribution and perform a search for a new scaling constant…done:

A sadface about the previous version is we’re tossing away 32 perfectly good random bits. It’ll cost us a couple of operations but we can instead use them as another convolution step:

Won Chun tweeted an interesting alternate. Instead of using the spare 32-bits as an additional uniform instead create a second binomial distribution.

EDIT: Saying a second binomial distribution is rather awkward on my part since summing any $n$ bits is a single binomial distribution.

Let me first state that the summation method is probably sufficient for most “horseshoes and handgrenades” usages. The Encyclopedia of Mathematics goes so far to say: “the approximation for $n=3$ is already satisfactory for many practical purposes”.

The most commonly used method to generate a normal distribution is to use the Box-Muller transform which generates two uniform random numbers and transforms with a logarithm, square root and a sin/cos pair to produce two results.

Although the scaling constants have been optimized they are not optimal. My search/testing method was a fast hack. They are, however, nearly optimal (famous last words) because the peak errors of each swing nearly to the same peak values.

The plots of the produced distributions (The plots are interactive. Notiably click on name in legend to toggle being displayed):

and the absolute error plots:

method ops range error
box-muller $\pm5.768108 / \pm8.571674$
sum 8 $\pm3.4172022$ $8.898866 \times 10^{-3}$
pop 9 $\pm8.1768640$ $9.249441 \times 10^{-4}$
pop32wc 12 $\pm6.455824$ $1.022137\times 10^{-3}$
pop32x 12 $\pm4.611270$ $1.391753 \times 10^{-3}$
pop32 10 $\pm6.079518$ $2.213490 \times 10^{-3}$

• ops is the number of CPU operations ops needed for the transform on x64 (for spitballing only)
• range is the range of the output values. The Box-Muller numbers are for single/double precision respectively.
• error is peak absolute error