I wanted to mention computing the multiplicative inverse via Newton’s iteration in a post and to my suprise I couldn’t find a simple overview which is publically available. (We interrupt this post in final edit with just such a thing from Daniel Lemire^{1}). For a formal presentation this 2012 paper^{2} by JeanGuillaume Dumas is pretty concise and easy to understand. Note I’ll use mod as an operator instead of for congruence and will only talk about poweroftwo modulus (linked paper covers generalization). First let’s define the function. Given an odd integer $a$ we want to find integer $x$ such that:
For reals and floating point we can compute the inverse of $a$ given a good inital guess $x_0$ and applying the following recurrence (Newton’s method^{3}) a sufficient number of times:
Very roughly speaking Hensel’s (lifting) lemma^{4} extends Newton’s method to modular arithmetic where (for our case) if we have the value $x_n$ such that:
then after applying the recurrence $\eqref{rr}$ we’d have:
Which each application the number of bits double. We need to compute an initial value $x_0$ that’s correct to some $k$ bits. It happens that $a$ is its own inverse up to three bits $\left(k=3\right)$ so one option is to simply set $x_0 = a$. Starting with 3bits and repeated application of the recurrence would give: $\left(6,12,24,48,96,\ldots\right)$. So assuming working in 32bit register with any odd 32bit input we’d need four iterations:
Notice if we instead had 4 bits for our initial value we’d have after each successive step: $\left(8,16,32,64,128,\ldots\right)$ bits so we’d need only three iterations. So if we can get an extra bit for cheaper than the cost of one step then we’re ahead. There are two quadradic solutions^{5} :
Using the first one we now have:
Another paper^{6} mentions a method for initial value good to 5 bits:
Dumas carries through with the derivation (SEE: Section 3.3) to produce algorithm 3, which reworked looks like:
The first three variants are almost identical in structure so it looks^{7} like an architectural toss up on performance. The last I find quite interesting. The other two require every operation to be performed in serial order…not so for the third. Renamed and with ops that can happen at the same time on the same line.
1: u = 2a i0 = a1
2: i1 = i0*i0
3: i2 = i1*i1 t1 = i1+1
4: i3 = i2*i2 t2 = i2+1 u *= t1
5: i4 = i3*i3 t3 = i3+1 u *= t2
6: t4 = i4+1 u *= t3
7: u *= t4
(done)
The Dumas version cannot drop any of the iteration steps, but previous methods can. This is potentially interesting if there’s a known bitbound on input values. We have choices between 3, 4 and 5 inital bits and must perform at least one step.
Jeffrey Hurchalla’s method ADDED: 20220407
Let’s take stock of where we are. The classic version has the downside of being a serial computation but we we can choose the method of inital value which in turn controls how many iteration we need to perform. As far as I know the best (in terms of cost) “computed” intial value is the 5bit of Montgomery. For Dumas we lose the initial value feature and in trade get some instruction level parallelism.
In April of 2022 Jeffrey Hurchalla released a paper^{8} which gives us both: choose inital value and broken dependency chains (see paper for details). Here’s an example implementation for 32bit integers using the 5bit initial value:
The original version of this post didn’t talk about performance at all. All of these are going to be very close for a general purpose implementation and the “best” is going to depend on the exact microarchitecture and surrounding code. As a rule of thumb the Hurchualla variant is the most promising.
Spitballing performance ADDED: 20220407
All of these functions are small and fast in an absolute sense. So the only reason to look closely at various alternate choices is the case of computing tons of them when there is very little other work happening at the same time. So it’s probably a waste of time to read any further.
But if you’re still here let’s look at why I’m suggest Hurchualla as the general purpose choice. I’m using the intel family of processors for following:
In terms of both high level opcodes they are all close:
mod_inverse_5  mod_inverse_h  mod_inverse_d  notes  

total ops  17  14  15  mircoop count is the same 
LEA  1  2  3  latency of 1, ports {1,5} 
IMUL  6  6  8  latency of 3, ports {1} 
other  9  6  4  latency of 1, ports {0,1,5,6} 
OK. I really should have use 64bit (each need one extra round) here but I’m too lazy to go back in modify. But anyway Hurchalla has the fewest number of uOps to issue. Compared to Dumas (only one more uOP) we’re eliminating 2 multiplies (latency 3, tied to port 1) and a LEA. Few of Dumas ops can go through ports {0,1,5,6} adding to the probability of competition with surrounding code. Compared with the “classic” serial method: the classic needs to execute 3 more uOps, has one less LEA then Hurchalla and 3 more uOps can go through any of {0,1,5,6} (the same number of extra uOps to be issued). Bottom line is that any of these could end up being “fastest” at some specific site but Hurchalla on average will win out.
Now back to the hypothetical problem of computing many inverses with little other work happening at the same time. In that case we can very likely afford to use a small lookup table. With a 256 byte table we can drop one round giving:
the data table
No clear winner here between the two. We drop one round for a data dependent load (5 cycle latency & issued to port 2 or 3). Either should outperform the computed initial value variants.
References and Footnotes

“Computing the inverse of odd integers”, Daniel Lemire 2017 (page) ↩

“On NewtonRaphson iteration for multiplicative inverses modulo prime powers”, JeanGuillaume Dumas 2012 (PDF) ↩

I just brute forced this in Mathematica. ↩

“Efficient long division via Montgomery multiply”, Ernst W. Mayer, 2016 (PDF) ↩

AKA I’ve spent zero time really thinking about performance here. ↩

“Speeding up the Integer Multiplicative Inverse (modulo 2^{w}) “, Jeffrey Hurchalla, 2022 (PDF) ↩