I wanted to mention computing the multiplicative inverse via Newton’s iteration in a post and to my suprise I couldn’t find a simple overview which is publically available. (We interrupt this post in final edit with just such a thing from Daniel Lemire1). For a formal presentation this 2012 paper2 by Jean-Guillaume Dumas is pretty concise and easy to understand. Note I’ll use mod as an operator instead of for congruence and will only talk about power-of-two modulus (linked paper covers generalization). First let’s define the function. Given an odd integer $a$ we want to find integer $x$ such that:
For reals and floating point we can compute the inverse of $a$ given a good inital guess $x_0$ and applying the following recurrence (Newton’s method3) a sufficient number of times:
Very roughly speaking Hensel’s (lifting) lemma4 extends Newton’s method to modular arithmetic where (for our case) if we have the value $x_n$ such that:
then after applying the recurrence $\eqref{rr}$ we’d have:
Which each application the number of bits double. We need to compute an initial value $x_0$ that’s correct to some $k$ bits. It happens that $a$ is its own inverse up to three bits $\left(k=3\right)$ so one option is to simply set $x_0 = a$. Starting with 3-bits and repeated application of the recurrence would give: $\left(6,12,24,48,96,\ldots\right)$. So assuming working in 32-bit register with any odd 32-bit input we’d need four iterations:
static inline uint32_t mod_inverse_1(uint32_t a)
{
uint32_t x,t;
x = a; // 3 bits
x *= 2-a*x; // 6
x *= 2-a*x; // 12
x *= 2-a*x; // 24
x *= 2-a*x; // 48 / retaining bottom 32
return x;
}
Notice if we instead had 4 bits for our initial value we’d have after each successive step: $\left(8,16,32,64,128,\ldots\right)$ bits so we’d need only three iterations. So if we can get an extra bit for cheaper than the cost of one step then we’re ahead. There are two quadradic solutions5 :
Using the first one we now have:
static inline uint32_t mod_inverse_2(uint32_t a)
{
uint32_t x;
x = (a*a)+a-1; // 4 bits (For serial comment below: a*a & a-1 are independent)
x *= 2-a*x; // 8
x *= 2-a*x; // 16
x *= 2-a*x; // 32
return x;
}
Another paper6 mentions a method for initial value good to 5 bits:
uint32_t mod_inverse_3(uint32_t a)
{
uint32_t x;
x = 3*a ^ 2; // 5 bits
x *= 2-a*x; // 10
x *= 2-a*x; // 20
x *= 2-a*x; // 40 -- 32 low bits
return x;
}
Dumas carries through with the derivation (SEE: Section 3.3) to produce algorithm 3, which reworked looks like:
static inline uint32_t mod_inverse_4(uint32_t a)
{
uint32_t u = 2-a;
uint32_t i = a-1;
i *= i; u *= i+1;
i *= i; u *= i+1;
i *= i; u *= i+1;
i *= i; u *= i+1;
return u;
}
The first three variants are almost identical in structure so it looks7 like an architectural toss up on performance. The last I find quite interesting. The other two require every operation to be performed in serial order…not so for the third. Renamed and with ops that can happen at the same time on the same line.
1: u = 2-a i0 = a-1
2: i1 = i0*i0
3: i2 = i1*i1 t1 = i1+1
4: i3 = i2*i2 t2 = i2+1 u *= t1
5: i4 = i3*i3 t3 = i3+1 u *= t2
6: t4 = i4+1 u *= t3
7: u *= t4
(done)
The Dumas version cannot drop any of the iteration steps, but all of the other can. This is potentially interesting if there’s a known bit-bound on input values. We have choices between 3, 4 and 5 inital bits and must perform at least one step.
References and Footnotes
-
“Computing the inverse of odd integers”, Daniel Lemire 2017 (page) ↩
-
“On Newton-Raphson iteration for multiplicative inverses modulo prime powers”, Jean-Guillaume Dumas 2012 (PDF) ↩
-
I just brute forced this in Mathematica. ↩
-
“Efficient long division via Montgomery multiply”, Ernst W. Mayer, 2016 (PDF) ↩
-
AKA I’ve spent zero time really thinking about performance here. ↩