© 2026 by Zack Smith. All rights reserved.
TL;DR
This is my benchmark that generates prime numbers without using unsigned integer division instructions and it instead uses the Eratosthenes algorithm, which is based on a bit array.
It maintains a bit array in RAM in which each bit indicates the presence or absence of a prime number.
Eratosthenes works by setting all multiples of every known prime throughout the bit array, then finding the first 0 bit after the last discovered prime number, which always will be a prime.
For every new prime that is discovered, I add it to the output buffer and then set all bits in the array that represent its multiples. Then repeat... until I reach the end of the array.
In my code the bit array represents only odd numbers, for efficiency's sake, because even numbers are never prime except the value 2. Thus:
- Bit 3 in the first byte of the bit array represents the number 7, which is prime.
- Bit 7 in the first byte of the bit array represents the number 15, which is not prime but is a multiple of the primes 3 and 5.
Because the entire bit array has to be updated for every new prime number that is discovered, this algorithm is susceptible to slowdown due to paging when the bit array's size is large enough to use all of the available RAM.
To ensure the bit array is located and stays entirely in RAM, there are a few measures that one can take: ,,
by first ensuring the bit array is as small as possible, and second by turning off swapping.
Findings
- Even implementing this in C, it is much faster than my unsigned division algorithm.
- Eratosthenes involves some overhead: When seeking X million primes, the bit array should be about 1.4X megabytes in size, where a megabyte is 1,048,576 bytes.
- The run is always slower at the beginning because it has to set more bits for lower primes than than for larger primes.
- As the run proceeds, the lower parts of the bit array can be ignored because earlier bits represent multiples of already-found primes.
- My assembly language optimizations are only very slightly more effective than GCC's optimized code.
Results
I generated all of these results on 64-bit systems.
The program can also run on 32-bit systems; I developed that mode in a container using podman.
100 million primes
I used a bit array of size 122 MB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 0.1 minutes |
| 3.1 GHz | Intel Core i5-4278U, Linux, 8GB DDR3L-1600 | C 0.9 | 0.4 minutes |
| 2.2 GHz | Snapdragon 750G, Android Samsung A52 5G, 6GB | C 0.9 | 0.4 minutes |
| 0.8 GHz | Intel Core i5-4278U, Linux, 8GB DDR3L-1600 | C+asm 0.10 | 0.9 minutes |
| 1.5 GHz | Pine64 Star64 RISC-V StarFive JH-7110, Linux, 8GB LPDDR4-1867 | C 0.9 | 4.4 minutes |
The resulting primes.dat file can be found
here.
1 billion primes
I used a bit array of size 1.328 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 1.1 minutes |
| 3.1 GHz | Intel Core i5-4278U, Linux, 8GB DDR3L-1600 | C 0.9 | 3.6 minutes |
| 2.2 GHz | Snapdragon 750G, Android Samsung A52 5G, 6GB | C 0.9 | 7.1 minutes |
| 0.8 GHz | Intel Core i5-4278U, Linux, 8GB DDR3L-1600 | C+asm 0.10 | 8.8 minutes |
| 1.5 GHz | Pine64 Star64 RISC-V StarFive JH-7110, Linux, 8GB LPDDR4-1867 | C 0.9 | 51.1 minutes |
The resulting primes.dat file can be found
here.
2 billion primes
I used a bit array of size 2.74 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 2.6 minutes |
| 3.1 GHz | Intel Core i5-4278U, Linux, 8GB DDR3L-1600 | C 0.7 | 8.6 minutes |
| 2.2 GHz | Snapdragon 750G, Android Samsung A52 5G, 6GB | C 0.9 | 18.8 minutes |
| 1.5 GHz | Pine64 Star64 RISC-V StarFive JH-7110, Linux, 8GB LPDDR4-1867 | C 0.9 | 113.5 minutes |
The resulting primes.dat file can be found
here.
3 billion primes
I used a bit array of size 4.19 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C+asm 0.10 | 3.9 minutes |
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 4.0 minutes |
| 3.1 GHz | Intel Core i5-4278U, Linux, 8GB DDR3L-1600 | C 0.7 | 12.4 minutes |
| 1.5 GHz | Pine64 Star64 RISC-V StarFive JH-7110, Linux, 8GB LPDDR4-1867 | C 0.9 | 178.9 minutes |
The resulting primes.dat file can be found
here.
4 billion primes
I used a bit array of size 5.65 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 5.5 minutes |
| 3.1 GHz | Intel Core i5-4278U, Linux, 8GB DDR3L-1600 | C | 18.8 minutes |
| 1.5 GHz | Pine64 Star64 RISC-V StarFive JH-7110, Linux, 8GB LPDDR4-1867 | C 0.9 | 244.4 minutes |
The resulting primes.dat file can be found
here.
5 billion primes
I used a bit array of size 7.13 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C+asm 0.10 | 6.8 minutes |
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 6.9 minutes |
| 3.1 GHz | Intel Core i5-4278U, Linux, 8GB DDR3L-1600 | C (2 arrays) | 24.0 minutes |
| 1.5 GHz | Pine64 Star64 RISC-V StarFive JH-7110, Linux, 8GB LPDDR4-1867 | C 0.9 | 309.5 minutes |
* The i5-4278U required the RAM pressure workaround, described below.
The resulting primes.dat file can be found
here.
6 billion primes
I used a bit array of size 8.619 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 8.4 minutes |
The resulting primes.dat file can be found
here.
8 billion primes
I used a bit array of size 11.64 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C+asm 0.10 | 11.5 minutes |
The resulting primes.dat file can be found
here.
10 billion primes
I used a bit array of size 14.675 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C+asm 0.10 | 14.6 minutes |
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 14.8 minutes |
The resulting primes.dat file can be found
here.
12 billion primes
I used a bit array of size 17.741671 GB.
| CPU frequency | CPU name and RAM type | Implementation | Duration |
|---|---|---|---|
| 4.464 GHz | Apple MBA-13 M5, MacOS, 24GB LPDDR5 | C 0.9 | 19.1 minutes |
The resulting primes.dat file can be found
here.
Download
The source code:
RAM pressure workaround
When testing a Macbook Pro with Intel Core i5-4278U and 8GB of RAM, which was running Debian Linux, I had to use two bit arrays totalling 7.13 GB, with includes a lower bit array of 1024 MB that was deallocated as early as possible during the run, because I found that Linux was eventually killing the program due to excessive RAM pressure, even in multi-user (non-graphical) mode.
The root cause was that Debian still uses enough of the available 8GB RAM by itself
to prevent running zEratosthenes for more than a few minutes without it being killed.
This may be attributable to systemd, because
if I compare this to the Star64 RISC-V single-board computer,
which in contrast uses a non-systemd distro,
I find that the Star64 has no trouble providing enough of
that board's 8GB RAM to run to zEratosthenes to completion without the two-array hack.
The RAM pressure mitigations that I used on the i5-4278U machine included:
- Boot Linux into multi-user (non-GUI) mode so that there is no GUI, which frees up RAM.
- On Linux,
sudo systemctl set-default multi-user
- On Linux,
- Tell
systemdto suspend unnecessary services like CUPS, Docker and pipewire.- On Linux, e.g.
sudo systemctl stop docker
- On Linux, e.g.
- Reduce the output array size to only a few megabytes. This negatively impacts performance but is necessary.
- Use the two-array option to relieve memory pressure during the run.
To ensure that no paging is performed, which significantly slows down the program, it is helpful to turn off swap partitions/files.
- On Linux,
sudo swapoff /dev/sdX
Useful links