zs3.me

A Fast Method for Generating Primes

Revision 42
© 2013-2024 by Zack Smith. All rights reserved.

Introduction

After considering the problem of how to most efficiently generate prime numbers in assembly language I hit upon an idea for optimizing the process on standard 32- and 64-bit Intel and AMD CPUs. I've since extended this to 32- and 64-bit ARM and 64-bit RISC-V CPUs.

Sieve of Smith (my sieve)

My sieve takes some inspiration from the Sieve of Eratosthenes, but is more memory-efficient.

Instead of eliminating every single multiple of every prime using a bit array, as Eratosthenes does, I keep counters for multiples of the lower primes in the registers.

As the program proceeds from lower into higher candidate numbers, the counters permit skipping over multiples of primes without maintaining a bit array.

For instance, my counter3 is cycled from 0 to 1 to 2 and when we hit 3, I set it back to 0. Every time we hit 3, we know that the current candidate number is a multiple of 3, therefore not prime.

Similarly my counter7 is cycled from 0 to 6 and when we hit 7, it too is set to 0 and we know that hitting 7 indicates the candidate is a multiple of 7.

Using these counters lets me avoid checking about 87% of all candidate numbers when the counters are located in the CPU's main registers.

The exact percentage varies a little depending on the implementation and how many CPU registers are free to be used for counters e.g. there are more on x64 than on i386.

The end result of this approach is, by eliminating multiples of the lowest primes, the vast majority of numbers are removed from consideration and do not need to be checked for primeness.

The primeness check is expensive, since it require repeated unsigned integer divisions.

Checking a number for primeness entails dividing known primes into it in an attempt to factor it. One's goal should therefore always be to avoid doing those divisions. Using counters lets me do that.

A further optimization is to only divide primes up to 1+sqrt(number), because if larger factors than that exist, then smaller ones must as well and they would have already been checked.

For example:

  • 3 times 7 is 21, sqrt(21) is between 4 and 5, therefore only check up to 5 and we find that 21 is factorable by 3, therefore not prime.

My algorithm is also efficient because the critical data to avoid checking 87% of the candidate numbers is held in the fastest storage available: The processor's registers.

The remaining necessary work consists mainly of divisions, done to determine the primeness of a candidate number, which requires dividing each known prime in the growing array of prime numbers into the candidate number, until a zero remainder is found or not.

As program execution continues, that array of primes is always increasing in size because it is, after all, the output array -- the set of discovered primes.

Here is my pseudocode for the Sieve of Smith, checking against the lower 9 primes:

 function generate_primes
 {
  array primes[] = { 2,3,5,7,11,13,17,19,23 }
  array of bytes prime_counters[9] = { 0,0,0,0,0,0,0,0,0 }
  int n_primes = 9
  int n = 0
  while (always) {
   ++n
   for i = 1 to 8 {
    prime_counters[i]++
   }
   bool skip = false
   if prime_counters[0] >= 3 {
    prime_counters[0] = 0
    skip = true
   }
   if prime_counters[1] >= 5 {
    prime_counters[1] = 0
    skip = true
   }
   if prime_counters[2] >= 7 {
    prime_counters[2] = 0
    skip = true
   }
   if prime_counters[3] >= 11 {
    prime_counters[3] = 0
    skip = true
   }
   if prime_counters[4] >= 13 {
    prime_counters[4] = 0
    skip = true
   }
   if prime_counters[5] >= 17 {
    prime_counters[5] = 0
    skip = true
   }
   if prime_counters[6] >= 19 {
    prime_counters[6] = 0
    skip = true
   }
   if prime_counters[7] >= 23 {
    prime_counters[7] = 0
    skip = true
   }
   if (skip) continue
   if (n is even) then continue
   if (is_prime (n, primes, n_primes)) {
    primes[n_primes++] = n
    println ("%llu is prime", n)
   }
  }
  return primes
 }

Results

All results are from systems running Linux unless otherwise noted.

100 million primes

CPU frequency CPU name and RAM type Total primes generated Data size Implementation Duration Efficiency*
4.06 GHz Apple MBP-14 M3, MacOS, LPDDR5-6400 100 million primes 64-bit aarch64 asm
28div
3.43 minutes 7.18 M
4.06 GHz Apple MBP-14 M3, MacOS, LPDDR5-6400 100 million primes 64-bit aarch64 asm
4div
3.45 minutes 7.14 M
3.5 GHz Apple MBA-15 M2, Linux, LPDDR5-6400 100 million primes 64-bit aarch64 asm
4div
4.00 minutes 7.1 M
3.2 GHz Apple MBA-13 M1, MacOS, LPDDR4X-4266 100 million primes 64-bit aarch64 asm 4.93 minutes 6.3 M
3.5 GHz Apple MBA-15 M2, MacOS, LPDDR5-6400 100 million primes 64-bit aarch64 asm 5.13 minutes 5.6 M
3.5 GHz Apple MBP-13 M2, Linux, LPDDR5-6400 100 million primes 64-bit aarch64 asm 5.17 minutes 5.5 M
5.1 GHz Ryzen 7 7840U, Linux, LPDDR5-6400 100 million primes 64-bit x86_64 asm
SIMD 50div
7.93 minutes 2.5 M
5.1 GHz Ryzen 7 7840U, Linux, LPDDR5-6400 100 million primes 64-bit x86_64 asm
SIMD
7.98 minutes 2.5 M
5.1 GHz Ryzen 7 7840U, Linux, LPDDR5-6400 100 million primes 64-bit x86_64 asm
50div
8.05 minutes 2.4 M
5.1 GHz Ryzen 7 7840U, Linux, LPDDR5-6400 100 million primes 64-bit x86_64 asm 8.25 minutes 2.4 M
4.7 GHz Ryzen 7 6850U, Linux, LPDDR5-6400 100 million primes 64-bit x86_64 asm 9.0 minutes 2.4 M
4.7 GHz Intel Core i7 1165G7, Linux, DDR4-3200 100 million primes 64-bit x86_64 asm 9.0 minutes 2.4 M
4.4 GHz Intel Core i5 1240P, Linux, LPDDR4-4267 100 million primes 64-bit x86_64 asm 9.3 minutes 2.4 M
4.5 GHz Ryzen 5 6650U, Linux, LPDDR5-6400 100 million primes 64-bit x86_64 asm 9.42 minutes 2.4 M
4.2 GHz Intel Core i5 1135G7, Linux, LPDDR4X-4266 100 million primes 64-bit x86_64 asm 9.8 minutes 2.4 M
4.2 GHz Ryzen 5 5650U, Linux, LPDDR4-4267 100 million primes 64-bit x86_64 asm 9.83 minutes 2.4 M
3.1 GHz Intel Core i5-4278U, Linux, DDR3L-1600 100 million primes 64-bit x86_64 asm
AVX-50div
19.8 minutes 1.6 M
3.1 GHz Intel Core i5-4278U, Linux, DDR3L-1600 100 million primes 64-bit i386 asm
28div
19.8 minutes 1.6 M
2.8 GHz Snapdragon 888, Android, Samsung S21 FE 100 million primes 64-bit aarch64 asm 22.7 minutes 1.6 M
2.2 GHz Snapdragon 750G, Android, Samsung A52 5G 100 million primes 64-bit aarch64 asm 29.3 minutes 1.6 M
2.8 GHz Snapdragon 8 Gen 1, Android, Samsung S23 FE 100 million primes 64-bit aarch64 asm 29.5 minutes 1.6 M
2.4 GHz AWS Xeon E5-2676, Linux 100 million primes 64-bit x86_64 asm 32.3 minutes 1.3 M
1.8 GHz Raspberry Pi 4, Linux, LPDDR4-3200 100 million primes 64-bit aarch64 asm 36.1 minutes 1.5 M
1.8 GHz Raspberry Pi 4, Linux, LPDDR4-3200 100 million primes 32-bit aarch32 asm 39.0 minutes 1.4 M
2.8 GHz Intel Celeron circa 2008 105 million primes 32-bit C 80 minutes 0.47 M
1.5 GHz Pine64 Star64 RISC-V StarFive JH-7110, Linux, LPDDR4-1867 100 million primes 64-bit riscv64 asm 126 minutes 0.53 M

  • MBA means Macbook Air.
  • MBP means Macbook Pro.
  • Linux on Apple Silicon was Fedora Asahi Linux.
  • The Core i5-4278U is an Intel Macbook Pro, tested while running MacOS.
  • 4div (aarch64) performs a sequence of 4 divisions to parallelize modulo operations using speculative execution and minimize loop branch penalties.
  • 28div (aarch64) performs a sequence of 28 divisions to parallelize modulo operations using speculative execution and minimize loop branch penalties.
  • 50div (x86_64) performs a sequence of 50 divisions to parallelize divisions using speculative execution and minimize loop branch penalties.

The last column represents the efficiency of the CPU in generating primes, measured in primes per minute per GHz, with the Apple M1 trumping every other CPU.

Looking at the efficiency, the Apple M2 running the 4div variant of my code matched the Apple M3 (not Pro) also running 4div.

These beat all the recent x86 CPUs which were running non-parallelized divisions by a factor of 3.3.

In general, the efficiency of my parallelized-division is_prime routines was substantially better than doing one division per loop.

All recent x86 processors, both Intel and AMD, 2021, 2022 and 2023 models, curiously computed 2.4 million primes per minute per gigahertz with non-parallelized division.

The Raspberry Pi 4 model B was 3X more efficient than the StarFive RISC-V chip.

The Snapdragon 8 Gen 1, Snapdragon 888 and Snapdragon 750G scores were obtained by installing Termux on my Samsung S23 FE, Samsung S21 FE and A52 5G phones. It provides a Linux-like terminal environment, which allows installation of the Clang compiler and numerous other packages.

aarch64 implementation variants

Two basic ways of doing the time-consuming divisions:

  • I use unsigned division followed by multiply-subtract (UDIV + MSUB) instructions, one per loop.
  • Parallelizing a sequence of UDIV+MSUBs using speculative execution, 4 or 28 per loop.

  1. 100 million primes on M2 without parallelization takes 5.3 minutes.
  2. 100 million primes on M2 with parallelization takes 4.0 minutes.

The latter was thus 1.3X faster.

x86_64 implementation variants

I've implemented many variants to the x86_64 code in pursuit of better performance.

Varying where the prime counters on x86_64 are located and how they're used:

  • Putting counters in main registers, increasing counters, using compare-branch.
  • Putting counters in main registers, decreasing counters, using compare-branch.
  • Putting counters in main registers, stuffing 8 counters into each register, rotating to access the counters (ROR), increasing counters, compare-branch.
  • Putting counters in AVX registers, maintaining 31 7-bit counters, decreasing counters.
  • Putting counters in main registers, increasing counters, using conditional-move (very slow).

Two basic ways of doing the time-consuming divisions:

  • Unsigned integer DIV instructions, one per loop.
  • Parallelizing a sequence of unsigned integer DIV instructions using speculative execution, up to 57 per loop.

Parallelizing improves the performance of both 64:32 DIV and 128:64 DIV, for instance I did the experiment of using 128:64 DIV alone on the Intel Core i5-4278U:

  1. 100 million primes without parallelization takes 60.6 minutes.
  2. 100 million primes with parallelization takes 46.9 minutes.

Two types of division:

Finally on x86_64, for divisions where the dividends are 32-bit I use the 64:32-bit DIV instruction because it is 3 times as fast as the 128:64-bit DIV.

Results for a Core i5:

CPU frequency CPU name and RAM type Total primes generated Data size Implementation Duration Efficiency*
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm
AVX-50div
19.8 minutes 1.63 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm
CMP 50div
20.3 minutes 1.59 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm
ROR2 50div
20.3 minutes 1.59 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm
CMP 50div
20.5 minutes 1.58 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm 7div 21.7 minutes 1.5 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm 3div 23.7 minutes 1.4 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm CMP 27.9 minutes 1.2 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm ROR 27.9 minutes 1.2 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm ROR2 28.0 minutes 1.2 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm AVX 29.2 minutes 1.1 M
3.1 GHz Intel Core i5-4278U, DDR3L-1600 100 million primes 64-bit x86_64 asm CMOV 55.5 minutes 0.58 M

What can be learned from this is, it is the slowness of the divisions that has the most negative impact on performance.

The maintainence of the counters is secondary but impactful, if abnormally slow as when implemented with condition move (CMOV) instructions.

i386 implementation variants

There are two, based on the implementation of is_prime:

  1. Compare-branch with only one division per loop.
  2. Compare-branch with 28 divisions per loop.

I found that 28 was the sweet spot on an i5-4278U, performing better than 16, 24 or 32 divisions per loop.

203.3 million 32-bit primes

There are 203280221 primes that can be expressed in 32 bits or less, beginning with 2. Generating all of them on a 32-bit system running my 32-bit prime sieve gives the following results:

CPU, RAM Total primes generated Data size Implementation Duration Efficiency*
4.2 GHz AMD Ryzen 5 5650U, 16GB LPDDR4-4267 203.3 million primes 32-bit i386 asm 26.5 minutes 1.8 M
3.1 GHz Intel Core i5-4278U, 8GB DDR3L-1600 203.3 million primes 32-bit i386 asm
28div
55.6 minutes 1.2 M
3.1 GHz Intel Core i5-4278U, 8GB DDR3L-1600 203.3 million primes 32-bit i386 asm 89 minutes 0.74 M
1.8 GHz Raspberry Pi 4b, 8GB LPDDR4-3200 203.3 million primes 32-bit aarch32 asm 105 minutes 1.1 M

Notes:

  • Efficiency is measured in primes/minute/GHz.
  • The i5-4278U was running Raspberry Pi OS for x86, which uses 32-bit userland and a 64-bit kernel.

1 billion primes

Finding a billion primes on a 64-bit system requires allocating 7.63 GB of RAM to hold 1,000,000,000 64-bit prime numbers. If your computer has only 8GB of RAM, you may need to shut down the windowing system (e.g. X-Windows) to run this at maximum speed without paging.

CPU and RAM type Total primes generated Data size Implementation Duration Efficiency*
4.06 GHz Apple MBP-14 M3, MacOS, 8GB LPDDR5-6400 1 billion primes 64-bit aarch64 asm
28div
96.5 minutes 2.55 M
4.06 GHz Apple MBP-14 M3, MacOS, 8GB LPDDR5-6400 1 billion primes 64-bit aarch64 asm
4div
100 minutes 2.46 M
3.5 GHz Apple MBA M2 (MacOS)
16GB LPDDR5-6400
1 billion primes 64-bit aarch64 asm 112 minutes 2.55 M
3.2 GHz Apple MBA M1 (MacOS)
8GB LPDDR4X-4266
1 billion primes 64-bit aarch64 asm 125.9 minutes 2.48 M
3.5 GHz Apple MBA M2 (MacOS)
8GB LPDDR5-6400
1 billion primes 64-bit aarch64 asm 134.0 minutes 2.1 M
3.5 GHz Apple MBP M2 (Linux)
8GB LPDDR5-6400
1 billion primes 64-bit aarch64 asm 137.0 minutes 2.1 M
5.1 GHz AMD Ryzen 7 7840U (Linux)
32GB LPDDR5-6400
1 billion primes 64-bit x86_64 asm 259.1 minutes 0.76 M
4.7 GHz AMD Ryzen 7 6850U (Linux)
32GB LPDDR5-6400
1 billion primes 64-bit x86_64 asm 284.5 minutes 0.75 M
4.2 GHz AMD Ryzen 5 5650U (Linux)
16GB LPDDR4-4267
1 billion primes 64-bit x86_64 asm 313.5 minutes 0.76 M
4.4 GHz Intel Core i5 1240P (Linux)
16GB LPDDR4-4267
1 billion primes 64-bit x86_64 asm 418.7 minutes 0.54 M
1.8 GHz Raspberry Pi 4b (Linux)
8GB LPDDR4-3200
1 billion primes 64-bit aarch64 asm 1117 minutes 0.50 M
1.5 GHz StarFive JH-7110 RISC-V (Linux)
8GB LPDDR4-1867
1 billion primes 64-bit riscv64 asm 1845 minutes 0.36 M

Notes:

  • Efficiency is measured in primes/minute/GHz.
  • The RISC-V board was the Pine64 Star64.

The last column represents the efficiency of the CPU in generating primes, with the Apple M1 still beating the nearest contender by a factor of 3.

One conclusion one could reach is that RAM speed matters when generating 1 billion primes.

  • The Apple Silicon processors have tightly coupled RAM that runs faster than the separate chips you see in other processors.
  • At the other extreme, the StarFive JH-7110 is coupled with slow RAM, which hampers its ability to load divisors with sufficient speed.

It is possible that the Apple Silicon CPUs have built-in divider circuits that are 3X faster than microcode-based division.

All systems were running Linux except for the Macbook Airs and the Intel Macbook Pro, which were running MacOS. The M2 Air might be slower than M1 because the M1 Air is known to have a better heat spreader on the CPU and therefore better cooling.

In the presense of super-fast RAM like in the Macs or the Ryzens, CPU speed also matters, as evidenced by the Ryzen 5 and 7 having about the same efficiency scores and differing only in CPU speed.

An interesting observation is the similarity of the efficiency scores between the Raspberry Pi 4b and the Core i5 1240P, which one might naively assume ought to be very different.

If the CPUs are inherently equally inefficient but have different clock speeds, the ratio between their clock speeds should roughly correspond to the inverse of the ratio of their run times:

Core i5-1240P Raspberry Pi 4b Ratio
Clock speed 4.4 GHz 1.8 GHz 2.5
Run time 419 minutes 1120 minutes 1/2.7

2 billion primes

Finding two billion primes on a 64-bit system requires allocating 15258 MB of RAM to hold 2,000,000,000 64-bit prime numbers.

Device Total primes generated Data size Implementation Duration Efficiency*
5.1 GHz AMD Ryzen 7 7840U (Linux)
32GB LPDDR5-6400
2 billion primes 64-bit x86_64 asm 719.9 minutes (12 hours) 0.55 M

  • Efficiency is measured in primes/minute/GHz.

CPU speeds

To determine the Apple Silicon CPU speed, one has to run the utility powermetrics which is a part of MacOS. This displays core speeds in real time. The prime64 program always migrates to a performance core and runs at the highest speed.

For x86 the speed that I give of the core that's running primes64/32 is that of Turbo Boost, because the most demanding process will typically run using that mode, if not immediately then within a second or two.

Analysis

The goal of my program is to find all prime numbers up to a reasonable limit, and not just Merseinne primes or some other subset. Since most CPUs and operating systems are 64-bits these days, I'm currently looking for 64-bit primes and using the CPUs' built-in unsigned division instruction to that end.

My algorithm has several advantages which explain its speed.

  1. By using counters to identify multiples of the small prime numbers, it avoids testing at least 87% of candidate numbers for primeness, depending on the architecture.
    • It avoids divisions for numbers that are already covered by the prime number counters.
    • The more counters it uses, the greater the percentage of candidate numbers that it can avoid checking for primality.
  2. The bookkeeping data that it uses to avoid checking numbers is entirely in the fastest memory available: CPU registers.
  3. Its memory use is proportional to the number of primes found, not the number of candidate numbers as you see with the Sieve of Eratosthenes.
  4. It minimizes divisions by only checking numbers up to 1+sqrt(number).
  5. The newest incarnations use speculative branching to perform parallel divisions on a single core, even on x86.

Download

The source code:

Changes

  • 1.4 optimizes aarch64 single-core parallel division for Linux and MacOS.
  • 1.3 adds i386 single-core parallel division for Linux.
  • 1.2 adds aarch64 single-core parallel division for Linux and MacOS.
  • 1.1 optimizes x86_64, adds single-core parallel division to AVX and ROR2.
  • 1.0 adds x86_64 single-core parallel division using speculative execution.
  • 0.13 adds support for x86_64 MacOS.
  • 0.12 adds an x64 AVX SIMD experiment.
  • 0.11 adds various experiments to speed up x64.
  • 0.10 adds improvements to i386 and aarch32.
  • 0.9 adds support for RISC-V 64.
  • 0.8 adds support for i386.
  • 0.7 adds support for both Raspberry pi's aarch32 and Apple Silicon's armv8.

Links

1402037514