The problem is to perform a perfect shuffle when $n \le 64$ and we’re going to attack it with the standard method: The Fisher-Yates shuffle. This is sufficient to shuffle a standard deck of card or to generate a $64\times64$ permutation matrix.

The important assumption is that the target hardware has a bit scatter opcode (and not microcoded: AMD prior to Zen3 I’m looking at you) and is exposed in the target language (PDEP in Intel land). Let’s jump straight in:

The first step of Fisher-Yates is to initialize an array to: $0$ through $n-1$. Instead of an explicit array we’re initializing a bit-set (cards) to represent our (logically) unshuffled deck. We’re storing $n$ in pop to mentally note that this is redundant data which is maintained to be the population count of cards.

For the rest we’re ignoring modern variants and are running with the original pencil-and-paper method (except for the counting from zero instead of one part):

So the first two helper functions are just intrinsic wrappers for the scatter op (bist_scatter_64) and trailing zero count (ctz_64). The next function (bit_clear_nth_set_64) is lynchpin of the method that performs the strike the nth card operation. The m constant is just all ones, o is just one and t is all ones except the nth bit is clear. The result of the function is to apply the bit-scatter op. Notice the ordering of the parameters: the input x is being used as the selector and the t is value that is being scattered. This transforms our nth bit clear t into clearing the nth set bit of x.

Finally foo_next just snaps all the pieces together:

• i uniformly selects a remain card in our unshuffled deck
• c is the updated cards (selected one striked)
• r isolates which card was striken
• update the data structure with new cards and decrease the popcount
• return the number of the drawn card which is the trailing zero count of its isolated position

Gluing the initialization code and looping over the next function to fill an array gives a standardish kind of thing:

And that’s it! “Obviously” this can be extended to supporting larger $n$. As examples for some $64 \cdot m$ max value with $m$ very small then a linear search to find the correct 64-bit window of the bitset to modifiy is pretty straight forward and one could push to $64 \cdot 2^p$ which can be performed in $p$ steps to find the window and update the hierarchical popcounts. Generally using a standard method instead would probably be the better choice however YMMV.

Closing note: For the presented toy code to be “correct” requires that rng_range returns a uniform result. There are recent solutions which are quite performant like the one by Daniel Lemire which is talked about in this blog post (paper linked at end).

On review the example code is rather ugly so here’s generating a binary $64\times64$ permutation matrix. The real version doesn’t use a global PRNG…but you get the idea. Otherwise this is the example code minus the zero-counting which isn’t needed here (row result is the choosen bit position).