I'm implementing RSA 1024 in hardware (xilinx ZYNQ FPGA), and am unable to figure out a few curious issues. Most notably, I am finding that my implementation only works for certain base/exponent/modulus combinations, but have not found any reason why this is the case.

*Note: I am implementing the algorithm using Xilinx HLS (essentially C code that is synthesized into hardware). For the sake of this post, treat it just like a standard C implementation, except that I can have variables up to 4096 bits wide. I haven't yet parallelized it, so it should behave just like standard C code.*

## The Problem

**My problem is that I am able to get the correct answer for certain modular exponentiation test problems, but only if the values for the base, exponent, and modulus can be written in much fewer bits than the actual 1024 bit operand width (i.e. they are zero padded).**

When I use actual 1024-bit values generated from SSH-keygen, I no longer get the correct results.

For example, if my input arguments are

```
uint1024_t base = 1570
uint1024_t exponent = 1019
uint1024_t modulus = 3337
```

I correctly get a result of **1570^1029 mod(3337) = 688**

However, when I actually use values that occupy all (or approximately all) 1024 bits for the inputs...

```
uint1024_t base = 0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec920399f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3faf83c86bfdd6e9daad12559f8d2747
uint1024_t exponent = 0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab620fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafca72c9f3ca5bbf96b24c1345eb936d1
uint1024_t modulus = 0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2db0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7030c5c004c5aea3cf99afe89b86d6d
```

**I incorrectly get a massive number, rather than the correct answer of 29 (0x1D)**

I've checked both algorithms a million times over, and have experimented with different initial values and loop bounds, but nothing seems to work.

## My Implementation

I am using the standard square and multiply method for the modular exponentiation, and I chose to use the Tenca-Koc radix-2 algorithm for the montgomery multiplication, detailed in pseudocode below...

```
/* Tenca-Koc radix2 montgomery multiplication */
Z = 0
for i = 0 to n-1
Z = Z + X[i]*Y
if Z is odd then Z = Z + M
Z = Z/2 // left shift in radix2
if (S >= M) then S = S - M
```

My Montgomery multiplication implementation is as follows:

```
void montMult(uint1024_t X, uint1024_t Y, uint1024_t M, uint1024_t* outData)
{
ap_uint<2*NUM_BITS> S = 0;
for (int i=0; i<NUM_BITS; i++)
{
// add product of X.get_bit(i) and Y to partial sum
S += X[i]*Y;
// if S is even, add modulus to partial sum
if (S.test(0))
S += M;
// rightshift 1 bit (divide by 2)
S = S >> 1;
}
// bring back to under 1024 bits by subtracting modulus
if (S >= M)
S -= M;
// write output data
*outData = S.range(NUM_BITS-1,0);
```

}

and my top-level modular exponentiation is as follows, where (switching notation!) ...

```
// k: number of bits
// r = 2^k (radix)
// M: base
// e: exponent
// n: modulus
// Mbar: (precomputed residue) M*r mod(n)
// xbar: (precomputed initial residue) 1*r mod(n)
void ModExp(uint1024_t M, uint1024_t e, uint1024_t n,
uint1024_t Mbar, uint1024_t xbar, uint1024_t* out)
{
for (int i=NUM_BITS-1; i>=0; i--)
{
// square
montMult(xbar,xbar,n,&xbar);
// multiply
if (e.test(i)) // if (e.bit(i) == 1)
montMult(Mbar,xbar,n,&xbar);
}
// undo montgomery residue transformation
montMult(xbar,1,n,out);
}
```

I can't for the life of me figure out why this works for everything except an actual 1024 bit value. Any help would be much appreciated

6more comments