March 18th, 2021 Just about a year ago I tossed out a standard normal distribution approximation in a tweet just for the fun factor. I ran across it recently and noticed I could make a couple of improvements. Per Vognsen thoughfully broke down the “how it works” in a pair of tweets: BTW, an intuitive way to think about this is to remember that addition of random variables corresponds to convolution of their probability densities. So adding the two uniform variates is equivalent to box filtering the stair-stepped distribution twice.— Per Vognsen March 20, 2020 Baseline version Before jumping into the improvements let’s take a look at an alternate approximation for comparison purposes. The super old skool method of approximating a normal distriubtion is to simply sum up a bunch of uniform distributions (so also central limit theorem based). Since our population count version draws two 64-bit uniform numbers let’s stick with that to keep the uniform draw cost the same. float dist_normal_approx_sum() { // Generate two 64-bit uniform random integers. These could // be passed in as parameters and/or the two values could // be drawn from independent generators to break the serial // dependency. uint64_t u0 = rng_u64(); uint64_t u1 = rng_u64(); // partition both into two 32-bit uniform integers each int64_t a = (int64_t)(u0 & 0xffffffff); int64_t b = (int64_t)(u0 >> 32); int64_t c = (int64_t)(u1 & 0xffffffff); int64_t d = (int64_t)(u1 >> 32); // magic constant (not really) const float K = 0x1.b566e2p-32f; // ~1.70860*2^-32 // perform the sum and scale. The pairings can be in any // if that helps with throughput. Flipping the two adds // to subs and the one sub to add is equivalent. return K*((a+b)-(c+d)); } As for the scaling value $K$ we need to include the factor to take our fixed-point integer to floating point which is $2^{-32}$. We can look up the variance for the sum of $n$ uniform values which is $\frac{n}{12}$ and the standard deviation is the square root of the variance. Since the standard normal distribution has a standard deviation of one we need an additional factor of the reciprocal of our sum’s standard deviation: $\sqrt{3} \approx 1.732051$. The value of $K$ in the code is different from this because I ran an optimization to minimize the maximum error (aka minimax for absolute error). Improved popcount version The modified population count version: float dist_normal_approx() { // Generate two 64-bit uniform random integers. These could // be passed in as parameters and/or the two values could // be drawn from independent generators. This would allow // to break the serial dep on u0 & u1. uint64_t u0 = rng_u64(); // We compute population count (aka hamming weight) of u0. // this gives us a binomial distribution (p=1\2, cut 64). // The subtraction centers the distribution on [-32,32]. int64_t bd = (int64_t)__builtin_popcountl(u0)-32; // draw & partition u1 into two 32-bit integers uint64_t u1 = rng_u64(); int64_t a = (int64_t)(u1 & 0xffffffff); int64_t b = (int64_t)(u1 >> 32); // triangle distribution int64_t td = a-b; // <-- changed add to sub // sum the binomial and triangle distributions float r = (float)((bd<<32) + td); // <-- nuked a shift here // scale the result return r * 0x1.fb760cp-35f; // <-- nuked a constant add here & magic constant! } There are three changes (marked as side comments): The triangle distribution uses a subtraction instead of addition. This makes it centered around zero which removes the final subtraction (-0.25f) in the original version. This reduces the dependency chain length by one. The original was (foolishly) designed to have no rounding when converting the integer to floating point. That’s a non feature here so the left shift has been adjusted up and the right shift is eliminated. In addition to nuking the subtraction (as per item 1) the scaling constant has been minimax optimized like the summation method above. So the cost of the transform is now ~9 primitive ops (dropping two) on 64-bit hardware with native popcount. 32-bit popcount version In the case of hardware that either has 32-bit popcount (but not 64-bit) or the population count must be performed in software it seems worthwhile to toss out a variant for those cases. Adjust the centering of the triangle distribution and perform a search for a new scaling constant…done: // in graphs: pop32 float dist_normal_approx() { uint64_t u0 = rng_u64(); int64_t bd = (int64_t)__builtin_popcount((uint32_t)u0)-16; uint64_t u1 = rng_u64(); int64_t a = (int64_t)(u1 & 0xffffffff); int64_t b = (int64_t)(u1 >> 32); int64_t td = a-b; float r = (float)((bd<<31) + td); return r * 0x1.59db68p-33f; } A sadface about the previous version is we’re tossing away 32 perfectly good random bits. It’ll cost us a couple of operations but we can instead use them as another convolution step: // in graphs: pop32x float dist_normal_approx() { uint64_t u0 = rng_u64(); int64_t bd = (int64_t)__builtin_popcount((uint32_t)u0)-16; uint64_t u1 = rng_u64(); int64_t a = (int64_t)(u1 & 0xffffffff); int64_t b = (int64_t)(u1 >> 32); int64_t td = a-b+((int64_t)u0>>UINT64_C(32)); float r = (float)((bd<<31) + td); return r * 0x1.540aep-33f; } Won Chun tweeted an interesting alternate. Instead of using the spare 32-bits as an additional uniform instead create a second binomial distribution. // in graphs: pop32wc float dist_normal_approx() { uint64_t u0 = rng_u64(); int64_t bd = (int64_t)(__builtin_popcount((uint32_t)u0)-__builtin_popcount((uint32_t)(u0>>32))); uint64_t u1 = rng_u64(); int64_t a = (int64_t)(u1 & 0xffffffff); int64_t b = (int64_t)(u1 >> 32); int64_t td = a-b; float r = (float)((bd<<30) + td); return r * 0x1.d8328ap-33f; } EDIT: Saying a second binomial distribution is rather awkward on my part since summing any $n$ bits is a single binomial distribution. Comparisons and comments Let me first state that the summation method is probably sufficient for most “horseshoes and handgrenades” usages. The “Encyclopedia of Mathematics” goes so far to say: “the approximation for $n=3$ is already satisfactory for many practical purposes”. The most commonly used method to generate a normal distribution is to use the Box-Muller transform which generates two uniform random numbers and transforms with a logarithm, square root and a sin/cos pair to produce two results. Although the scaling constants have been optimized they are not optimal. My search/testing method was a fast hack. They are, however, nearly optimal (famous last words) because the peak errors of each swing nearly to the same peak values. The plots of the produced distributions (The plots are interactive. Notiably click on name in legend to toggle being displayed): and the absolute error plots: method ops range error box-muller – $\pm5.768108 / \pm8.571674$ – sum 8 $\pm3.4172022$ $8.898866 \times 10^{-3}$ pop 9 $\pm8.1768640$ $9.249441 \times 10^{-4}$ pop32wc 12 $\pm6.455824$ $1.022137\times 10^{-3}$ pop32x 12 $\pm4.611270$ $1.391753 \times 10^{-3}$ pop32 10 $\pm6.079518$ $2.213490 \times 10^{-3}$ ops is the number of CPU operations ops needed for the transform on x64 (for spitballing only) range is the range of the output values. The Box-Muller numbers are for single/double precision respectively. error is peak absolute error Comments distribution (7) random (17)