Revision 58
© 2012-2024 by Zack Smith. All rights reserved.

Introduction

My program, called simply bandwidth, is an artificial benchmark for measuring memory bandwidth on computers of several architectures.

It is useful for identifying weaknesses in a computer's memory subsystem, in the bus architecture, in the cache architecture and in the processor itself.

This program is open source and covered by the GPL license. Although I wrote it mainly for my own benefit, I am also providing it pro bono, i.e. for the public good.

The latest bandwidth supports:

32- and 64-bit x86 running GNU/Linux
64-bit RISC-V running Linux
32- and 64-bit ARM running Linux
64-bit ARM computers running MacOS
64-bit x86 running MacOS
64-bit x86 running Windows

Thus it supports these instruction set architectures (ISAs):

i386
x86_64
ARM 64-bit (aarch64)
ARM 32-bit (aarch32)
RISC-V 64-bit (riscv64)

Why write the core routines in assembly language?

From the start, I've implemented optimized core routines in assembly language for each architecture.

The exact same core assembly language routines run on all computers of a given instruction set architecture.

This is analogous to using the exact same physical 12-inch ruler to measure multiple items in different contexts.

This approach is vital and yet some benchmarks do not use this approach, implementing core routines instead in a high level language like C.

If the core routines were written in C or C++, the final code that is executed would vary between systems and over time.

That would be like trying to measure something with multiple rulers of different lengths, each being somewhat more or less than 12 inches, which would lead to unreliable measurements.

Measurement of anything requires a standard, a specific known thing to compare everything to that is invariant in every context.

Compiled code varies because:

Different people might use different compilers e.g. GCC versus Clang.
Each version of a compiler might produce different code.
Each compilation might use different optimization options, resulting in different code e.g. -O3 versus -O2.
Some compilers (e.g. C++) might remove entire sections of code, incorrectly marking the code as unnecessary, in order to optimize it.
Various implementations of standard libraries will likely perform differently, for instance I recently discovered a bzero function that only cleared a maximum of maybe a megabyte, even if asked to clear more.

For these reasons, core benchmark routines should never be written in a high-level language nor use standard libraries.

They must always be coded in assembly to ensure reliable comparisons.

Comparing speeds of different memory types

How fast is each type of memory on a typical system? This is the kind of detail that students taking Computer Architecture are asked on exams.

Computer have registers, caches (typically 3 levels), dynamic RAM, and of course slow mass storage.

Here are results from a 2.4 GHz Core i5 520M and 1066 MHz RAM, from slowest to fastest.

Reading from the Crucial m4 SSD: 250 MB/second.
Reading from main memory (1066 MHz): 7 GB/second = 28 times faster.
Reading from L3 cache: maximum 21 GB/second = 3 times faster than main memory or 86 times faster than SSD.
Reading from L2 cache: maximum 29.5 GB/second = 1.4 times faster than L3; 4.2 times faster than main memory; or 120 times faster than SSD.
Reading from L1 cache: maximum 44.5 GB/second = 1.5 times faster than L2; 2.1 times faster than L3; 6.4 times faster than main memory; or 178 times faster than SSD.

Graphs generated by `bandwidth`

Snapdragon 888 in Samsung S21 FE

This was obtained by running bandwidth in Termux, which establishes a Unix-like environment within Android.

Pine's Star64 board with StarFive JH7110 RISC-V SoC

Apple M3, Macbook Pro 14 (2023), burst 4.1 GHz

Apple M2, Macbook Air 15 (2023), burst 3.5 GHz

Apple M1, Macbook Air 13 (2020), burst 3.2 GHz

Raspberry Pi 4 b 8GB, running 64-bit Raspbian, 1.8 GHz

Intel Core i5-1135G7, 2.4 GHz, burst 4.2 GHz

Ryzen 5 5500U, 2.1 GHz, burst 4 GHz

Intel Core i5-540M at 2.53 to 3.07 GHz with 3MB L3 cache, running 64-bit GNU/Linux:

Intel Core 2 Duo P8600 at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:

Download

Click here to download 1.14.10

Running `bandwidth` on Linux

I initially wrote bandwidth to run on x86 Linux, so the most that would be needed to prepare for running it is that you install gcc or clang, and for x86 to install nasm. Whereas for ARM and RISC-V I use the native assembler as.

If your processor has multiple types of cores, e.g. performance and efficiency, it might help to use the taskset command to ensure that bandwidth isn't moved between cores during its run.

In addition, can you use nice -20 to ensure that the kernel gives it priority.

Running `bandwidth` on MacOS

To run bandwidth on MacOS, if you don't have Xcode already installed then you'll at least need to install the command-line utilities, which includes gcc, make, etc. To install that, go to Terminal and type gcc or git.

Running `bandwidth` on x86 Windows

To run bandwidth on Windows, I typically install Cygwin, which provides a Unix-like environment within Windows. You'll want to install the packages: gcc, make, and nasm.

Cygwin

You might also be able to run bandwidth within Microsoft's Windows Subsystem for Linux (WSL).

Running `bandwidth` on Android

To run bandwidth on Android, you should install the Termux app, which provides a Unix-like terminal environment within Android in which programs can be edited and compiled and packages can be installed. You'll want to install the packages: clang, make, and binutils.

Termux on F-Droid

While Termux establishes an oasis for programmers, make sure you choose the package server wisely to avoid putting state-sponsored malware on your device.

You will need to run the command termux-setup-storage to enable access to /sdcard.

Since Android phones typically have multiple core types, e.g. my Snapdragon 888 has slow, medium and fast cores, you should run bandwidth with the taskset command to specify that you want it to run on the faster core(s).

Failing to specify a core could result in Android's Linux kernel switching bandwidth between fast and slow cores many times during execution, resulting in a useless, jaggy output graph.

For accurate results you should kill all other apps running on Android and turn off the Internet connection(s).

Finally, if you want to run bandwidth with the --slow option, note that Android will try to kill it because it runs longer than Android wants.

Observations of running one instance of bandwidth

The first interesting thing to notice is the difference in performance between 32, 64, and 128 bit transfers on the same processor. These differences show that when programmers go through the trouble to revise software to use 64 or 128 bit transfers, where appropriate and especially making them aligned to appropriate 8- or 16-byte boundaries and performing transfers sequentially, great speed-ups can be achieved.

A second observation is the importance of having fast DRAM. The case of Apple Silicon shows that the closer the DRAM is to the CPU, the faster it can be. soldered RAM is also considered faster than RAM in a SO-DIMM.

A third observation is the remarkable difference in speeds between memory types. In some cases the L1 cache is more than twice as fast as L2, and L1 is perhaps 9 times faster than main memory, whereas L2 is often 3 times faster than DRAM.

Click here for a table of sample output values

Running multiple instances of bandwidth simultaneously

Is bandwidth actually showing the maximum bandwidth to and from main memory? There is an easy way to test this. We can run one instance of bandwidth on each core of a multi-core CPU and add up the bandwidths to/from main memory for all instances to see whether they approach the published limits for our main memory system.

On my former Core i5-5257U dual-core Macbook Pro, with DDR3 (PC3-8500) memory with two SO-DIMMS and therefore dual channel, the maximum peak memory bandwidth ought to have been 8533 MB/second per channel and therefore the maximum should have been 17 GB/second total given that it had two channels. See Crucial's page. What bandwidth measures is sustained rate however, not peak.

Intel says in the i5-5257U spec sheet the maximum memory bandwidth for this CPU is 25.6 GB/second.

Running on just one core:

Reading, it maxed out at ~7050 MB/second from main memory.
Writing through the caches, it maxed out at ~5120 MB/second to main memory.
Writing and bypassing the caches, it maxed out at ~5520 MB/second to main memory.

When I had two instances of bandwidth running at the same time, one on each core, the picture was a little different but not much.

Reading, the total bandwidth from main memory was ~8000 MB/second, or 14% faster than running just one instance of bandwidth.
Writing without bypassing the caches, the total bandwidth to main memory was ~5650 MB/second, or 10% faster than one instance.
Writing with the cache bypass (non-temporal writes), the total bandwidth to main memory was ~6050 MB/second, or 10% faster than one instance.

In short, even when running two instances of bandwidth in order to push the system to its limits, this Core i5 system behaved more like it had single-channel RAM (just one SO-DIMM) than dual-channel.

Change log

Release 1.14

Support for running on Android in the Termux Unix-like environment.

Release 1.13

More AVX-512 support. ARM64 nontemporal transfers.

Release 1.12

RISC-V 64-bit support.

Release 1.11

AVX-512 support. Improved fonts. Fixed Win64/Cygwin support.

Release 1.10

ARM64 (AArch64) support and improved ARM32 (AArch32) support.

Release 1.9

More object-oriented improvements.

Release 1.8

More object-oriented improvements. Windows 64-bit support.

Release 1.7

Isolated Object-Oriented C library.

Release 1.6

Updated to use object-oriented C. Fixed Raspberry pi support.

Release 1.5

Improved 256-bit routines. Added --nice switch.

Release 1.4

I added randomized 256-bit routines for 64-bit Intel CPUs.

Release 1.3

I added CSV output. I updated the ARM code for the Raspberry pi 3 (AArch32).

Release 1.2

I put my old 32-bit ARM code back in for generic ARM systems.

Release 1.1

This release adds a second, larger font.

Release 1.0

This update separates out the graphing functionality. It also adds tests for the LODS[BWDQ] instructions, because while it is common knowledge that these instructions are slow and useless, sometimes widely-held beliefs are wrong, so I added this test which proves just how dramatically slow LODS instructions are.

Release 0.32

A little support for AVX.

Release 0.31

This release adds printing of cache information for Intel processors in 32-bit mode.

Release 0.30

This release adds printing of cache information for Intel processors in 64-bit mode.

Release 0.29

Further improved granularity with addition of 128-byte tests. Removed ARM support.

Release 0.28

Added proper feature checks using the CPUID instruction.

Release 0.27

Added 128-byte chunk size tests to x86 processors to improve granularity, especially around the 512-byte dip seen on Intel CPUs.

Commentary

Intel's Max Memory Bandwidth number

When Intel says you can achieve a Max Memory Bandwidth of e.g. 68 GB/sec from your 18-core processor, what they mean is the upper combined limit for all cores. To test this, you can run multiple copies of my bandwidth utility simultaneously, then add up the bandwidth values from each core accessing main memory. Each individual core may achieve quite a bit less bandwidth going to main memory. That's OK.

This larger number may at first seem like a marketing gimmick from Intel but it's a good number to know because when your system is extremely busy, this is the upper limit that will contrain all the cores' combined activity. What Intel should also do is give the per-core maximum alongside the collective maximum.

Why are 256-bit register transfers to/from main memory not faster than 128-bit?

The path to main memory is usually 128 bits wide, at least on a laptop, since it will typically have two SO-DIMMs. This is referred to as dual channel as there are two 64-bit memory buses going from RAM cards to the CPU.

The impact of an L4 cache

Level 4 caches are ostensibly for improving graphics performance, the idea being that the GPU shares it with the CPU. But does it impact on CPU performance?

A bandwidth user Michael V. provided a graph that shows that it does for the Intel Core i7-4750HQ. The 128MB L4 cache appears to be roughly twice as fast as main memory.

Sequential versus random memory access

Modern processor technology is optimized for predictable memory access behaviors, and sequential accesses are of course that. As the graphs above show, out-of-order accesses disrupt the cache contents, as well as the translation lookaside buffer (TLB) which translates virtual memory page addresses to physical memory page addresses, resulting in lower memory bandwidth. The random result is more like real-world memory performance.

Generalizations about memory and register performance

One has certain expectations about the performance of different memory subsystems in a computer. My program confirms these.

Reading is usually faster than writing.
Register-to-register transfers are the fastest possible transfers.
Register-to-stack and stack-to-register transfers are often half as fast as register-to-register transfers, because the stack typically finds itself in the L1 data cache.
L1 cache accesses are significantly faster than L2 accesses e.g. by a factor of 2.
L1 and L2 cache accesses are much faster than main memory accesses.
L2 cache writing is usually slower than L2 reading. This is because, if the data at the address being written to is not already in L2, existing data in the cache must often be flushed out to main memory before it can be replaced.
If the L2 cache is in write-through mode then L2 writing will be very slow and more on par with main memory write speeds. The modern form of write-through is the cache-bypassing writes (nontemporal is Intel's term for it).
Some CPUs buffer writes, giving a false appearance of fast writes. A memory barrier instruction can flush that buffer to give a more realistic measurement of write speed.
Main memory is slower to write than to read. This is just the nature of DRAM. It takes time to charge or discharge the capacitor that is in each DRAM memory cell whereas reading it is faster.
The more memory channels that your system has, the faster your system should perform. Laptops are typically dual-channel.
If memory sticks are mismatched sizes e.g. 8GB with 32GB, they will not be as fast as when they are matched.
The closer physically that DRAM is to the CPU, the faster it can perform because there is less electrical noise introduced in the shorter wires.
Linux framebuffer accesses are usually slower than main memory.
However framebuffer writing is usually faster than framebuffer reading.
C library memcpy and memset are often pretty slow; perhaps this is due to unaligned loads and stores and/or insufficient optimization.

Architectural advantages

As you can see from the above graphs, architecture matters too.

Apple's unified memory architecture puts the DRAM closer to the CPU, allowing for high bandwidth and low latency. In the case of the M1, sequential reads from main memory hover around 50 GB/second and writes are around 30 GB/second.
Intel's AVX512 allows very fast reads from L1 and L2 caches.

A historical addendum

One factor that reduces a computer's bandwidth is a write-through cache, be it L2 or L1. These were used in early Pentium-based computers but were quickly replaced with more efficient write-back caches.

Today's equivalent is the nontemporal or cache-bypassing accesses, which are needed for data that don't belong in the caches, such as streamed video data.

SSE4 vector-to/from-register transfers

While transfers between the main registers and the XMM vector registers using the MOVD and MOVQ instructions perform well, transfers involving the PINSR* and PEXTR* instructions are slower than expected. In general, to move a 64-bit value into or out of an XMM register using MOVQ is twice as fast as using PINSRQ or PEXTRQ, suggesting a lack of optimization on Intel's part of the latter instructions.

What is dual-channel asynchronous mode?

If you have two SO-DIMMs that are not the same capacity e.g. 8GB and 16GB, they will operate in dual-channel asynchronous mode, which is faster than single-channel but slower than synchronous dual-channel mode. The latter mode is what you get when the two SO-DIMMs' sizes are the same.

My Xeon has a 20 MB shared L3. Will it be fast?

A shared L3 means that if you have n cores and each is running a program that is very memory-intensive, like bandwidth, or the genetics program BLAST, the amount of L3 that each core can effectively use will be its fraction of the total. If Y is the number of megabytes of L3, this would be Y/n.

This was proven when a person with a dual-CPU Xeon 2690 system (20 MB L3, 8 cores and 4 channels per CPU) ran an instance of bandwidth on each of the 8 cores, resulting in each core effectively having only 5 MB of L3.

zs3.me