© 2012-2024 by Zack Smith. All rights reserved.
Download |
Introduction
My program, called simply bandwidth
, is an artificial benchmark for measuring memory bandwidth on
computers of several architectures.
It is useful for identifying weaknesses in a computer's memory subsystem, in the bus architecture, in the cache architecture and in the processor itself.
This program is open source and covered by the GPL license. Although I wrote it mainly for my own benefit, I am also providing it pro bono, i.e. for the public good.
The latest bandwidth
supports:
- 32- and 64-bit x86 running GNU/Linux
- 64-bit RISC-V running Linux
- 32- and 64-bit ARM running Linux
- 64-bit ARM computers running MacOS
- 64-bit x86 running MacOS
- 64-bit x86 running Windows
Thus it supports these instruction set architectures (ISAs):
- i386
- x86_64
- ARM 64-bit (aarch64)
- ARM 32-bit (aarch32)
- RISC-V 64-bit (riscv64)
Why write the core routines in assembly language?
From the start, I've implemented optimized core routines in assembly language for each architecture.
The exact same core assembly language routines run on all computers of a given instruction set architecture.
This is analogous to using the exact same physical 12-inch ruler to measure multiple items in different contexts.
This approach is vital and yet some benchmarks do not use this approach, implementing core routines instead in a high level language like C.
If the core routines were written in C or C++, the final code that is executed would vary between systems and over time.
That would be like trying to measure something with multiple rulers of different lengths, each being somewhat more or less than 12 inches, which would lead to unreliable measurements.
Measurement of anything requires a standard, a specific known thing to compare everything to that is invariant in every context.
Compiled code varies because:
- Different people might use different compilers e.g. GCC versus Clang.
- Each version of a compiler might produce different code.
- Each compilation might use different optimization options, resulting in different code e.g. -O3 versus -O2.
- Some compilers (e.g. C++) might remove entire sections of code, incorrectly marking the code as unnecessary, in order to
optimize
it. - Various implementations of
standard
libraries will likely perform differently, for instance I recently discovered a bzero function that only cleared a maximum of maybe a megabyte, even if asked to clear more.
For these reasons, core benchmark routines should never be written in a high-level language nor use standard libraries.
They must always be coded in assembly to ensure reliable comparisons.
Comparing speeds of different memory types
How fast is each type of memory on a typical system? This is the kind of detail that students taking Computer Architecture are asked on exams.
Computer have registers, caches (typically 3 levels), dynamic RAM, and of course slow mass storage.
Here are results from a 2.4 GHz Core i5 520M and 1066 MHz RAM, from slowest to fastest.
- Reading from the Crucial m4 SSD: 250 MB/second.
- Reading from main memory (1066 MHz): 7 GB/second = 28 times faster.
- Reading from L3 cache: maximum 21 GB/second = 3 times faster than main memory or 86 times faster than SSD.
- Reading from L2 cache: maximum 29.5 GB/second = 1.4 times faster than L3; 4.2 times faster than main memory; or 120 times faster than SSD.
- Reading from L1 cache: maximum 44.5 GB/second = 1.5 times faster than L2; 2.1 times faster than L3; 6.4 times faster than main memory; or 178 times faster than SSD.
Graphs generated by bandwidth
Snapdragon 888 in Samsung S21 FE
This was obtained by running bandwidth
in Termux, which establishes a Unix-like environment within Android.
Pine's Star64 board with StarFive JH7110 RISC-V SoC
Apple M3, Macbook Pro 14 (2023), burst 4.1 GHz
Apple M2, Macbook Air 15 (2023), burst 3.5 GHz
Apple M1, Macbook Air 13 (2020), burst 3.2 GHz
Raspberry Pi 4 b 8GB, running 64-bit Raspbian, 1.8 GHz
Intel Core i5-1135G7, 2.4 GHz, burst 4.2 GHz
Ryzen 5 5500U, 2.1 GHz, burst 4 GHz
Intel Core i5-540M at 2.53 to 3.07 GHz with 3MB L3 cache, running 64-bit GNU/Linux:
Intel Core 2 Duo P8600 at 2.4 GHz with 3MB L2 cache, running Mac OS/X Snow Leopard, 64-bit routines:
Download
Click here to download 1.14.10
Running bandwidth
on Linux
I initially wrote bandwidth
to run on x86 Linux, so the most
that would be needed to prepare for running it is
that you install gcc
or clang
, and for x86 to install nasm
.
Whereas for ARM and RISC-V I use the native assembler as
.
If your processor has multiple types of cores, e.g.
performance and efficiency,
it might help to use the taskset
command to ensure that
bandwidth
isn't moved between cores during its run.
In addition, can you use nice -20
to ensure that the kernel
gives it priority.
Running bandwidth
on MacOS
To run bandwidth
on MacOS, if you don't have Xcode already installed
then you'll at least need to install the
command-line utilities, which includes gcc
, make
, etc.
To install that, go to Terminal and type gcc
or git
.
Running bandwidth
on x86 Windows
To run bandwidth
on Windows, I typically install Cygwin,
which provides a Unix-like environment within Windows.
You'll want to install the packages: gcc, make, and nasm.
You might also be able to run bandwidth
within Microsoft's
Windows Subsystem for Linux (WSL).
Running bandwidth
on Android
To run bandwidth
on Android, you should install
the Termux app, which provides a Unix-like terminal environment within Android
in which programs can be edited and compiled and packages can be installed.
You'll want to install the packages: clang, make, and binutils.
While Termux establishes an oasis for programmers, make sure you choose the package server wisely to avoid putting state-sponsored malware on your device.
You will need to run the command termux-setup-storage
to enable access to /sdcard.
Since Android phones typically have multiple core types, e.g.
my Snapdragon 888 has slow, medium and fast cores,
you should run bandwidth
with the taskset
command
to specify that you want it to run on the faster core(s).
Failing to specify a core could result in Android's Linux kernel switching
bandwidth
between fast and slow cores many times during execution,
resulting in a useless, jaggy output graph.
For accurate results you should kill all other apps running on Android and turn off the Internet connection(s).
Finally, if you want to run bandwidth
with the --slow
option,
note that Android will try to kill it
because it runs longer than Android wants.
Observations of running one instance of bandwidth
The first interesting thing to notice is the difference in performance between 32, 64, and 128 bit transfers on the same processor. These differences show that when programmers go through the trouble to revise software to use 64 or 128 bit transfers, where appropriate and especially making them aligned to appropriate 8- or 16-byte boundaries and performing transfers sequentially, great speed-ups can be achieved.
A second observation is the importance of having fast DRAM. The case of Apple Silicon shows that the closer the DRAM is to the CPU, the faster it can be. soldered RAM is also considered faster than RAM in a SO-DIMM.
A third observation is the remarkable difference in speeds between memory types. In some cases the L1 cache is more than twice as fast as L2, and L1 is perhaps 9 times faster than main memory, whereas L2 is often 3 times faster than DRAM.
Click here for a table of sample output values
Running multiple instances of bandwidth simultaneously
Is bandwidth
actually showing the maximum bandwidth to and from main memory? There is an easy way to test this. We can run one instance of bandwidth on each core of a multi-core CPU and add up the bandwidths to/from main memory for all instances to see whether they approach the published limits for our main memory system.
On my former Core i5-5257U dual-core Macbook Pro,
with DDR3 (PC3-8500) memory with two SO-DIMMS and therefore dual channel,
the maximum peak memory bandwidth ought to have been 8533 MB/second per channel
and therefore the maximum should have been 17 GB/second total given that it had two channels.
See Crucial's page. What bandwidth
measures is sustained rate however, not peak.
Intel says in the i5-5257U spec sheet the maximum memory bandwidth for this CPU is 25.6 GB/second.
Running on just one core:
- Reading, it maxed out at ~7050 MB/second from main memory.
- Writing through the caches, it maxed out at ~5120 MB/second to main memory.
- Writing and bypassing the caches, it maxed out at ~5520 MB/second to main memory.
When I had two instances of bandwidth
running at the same time, one on each core, the picture was a little different but not much.
- Reading, the total bandwidth from main memory was ~8000 MB/second, or 14% faster than running just one instance of
bandwidth.
- Writing without bypassing the caches, the total bandwidth to main memory was ~5650 MB/second, or 10% faster than one instance.
- Writing with the cache bypass (non-temporal writes), the total bandwidth to main memory was ~6050 MB/second, or 10% faster than one instance.
In short, even when running two instances of bandwidth
in order to push the system to its limits,
this Core i5 system behaved more like it had single-channel RAM (just one SO-DIMM) than dual-channel.
Change log
Release 1.14
Support for running on Android in the Termux Unix-like environment.Release 1.13
More AVX-512 support. ARM64 nontemporal transfers.Release 1.12
RISC-V 64-bit support.Release 1.11
AVX-512 support. Improved fonts. Fixed Win64/Cygwin support.Release 1.10
ARM64 (AArch64) support and improved ARM32 (AArch32) support.Release 1.9
More object-oriented improvements.Release 1.8
More object-oriented improvements. Windows 64-bit support.Release 1.7
Isolated Object-Oriented C library.Release 1.6
Updated to use object-oriented C. Fixed Raspberry pi support.Release 1.5
Improved 256-bit routines. Added --nice switch.Release 1.4
I added randomized 256-bit routines for 64-bit Intel CPUs.Release 1.3
I added CSV output. I updated the ARM code for the Raspberry pi 3 (AArch32).Release 1.2
I put my old 32-bit ARM code back in for generic ARM systems.Release 1.1
This release adds a second, larger font.Release 1.0
This update separates out the graphing functionality. It also adds tests for the LODS[BWDQ] instructions, because while it is common knowledge that these instructions are slow and useless, sometimes widely-held beliefs are wrong, so I added this test which proves just how dramatically slow LODS instructions are.Release 0.32
A little support for AVX.Release 0.31
This release adds printing of cache information for Intel processors in 32-bit mode.Release 0.30
This release adds printing of cache information for Intel processors in 64-bit mode.Release 0.29
Further improved granularity with addition of 128-byte tests. Removed ARM support.Release 0.28
Added proper feature checks using the CPUID instruction.Release 0.27
Added 128-byte chunk size tests to x86 processors to improve granularity, especially around the 512-byte dip seen on Intel CPUs.
Commentary
Intel's Max Memory Bandwidth
number
When Intel says you can achieve a Max Memory Bandwidth
of e.g. 68 GB/sec from your 18-core processor, what they mean is the upper combined limit for all cores. To test this, you can run multiple copies of my bandwidth utility simultaneously, then add up the bandwidth values from each core accessing main memory. Each individual core may achieve quite a bit less bandwidth going to main memory. That's OK.
This larger number may at first seem like a marketing gimmick from Intel but it's a good number to know because when your system is extremely busy, this is the upper limit that will contrain all the cores' combined activity. What Intel should also do is give the per-core maximum alongside the collective maximum.
Why are 256-bit register transfers to/from main memory not faster than 128-bit?
The path to main memory is usually 128 bits wide, at least on a laptop, since it will typically have two SO-DIMMs.
This is referred to as dual channel
as there are two 64-bit memory buses
going from RAM cards to the CPU.
The impact of an L4 cache
Level 4 caches are ostensibly for improving graphics performance, the idea being that the GPU shares it with the CPU. But does it impact on CPU performance?
A bandwidth
user Michael V. provided a graph that shows that it does for the Intel Core i7-4750HQ. The 128MB L4 cache appears to be roughly twice as fast as main memory.
Sequential versus random memory access
Modern processor technology is optimized for predictable memory access behaviors, and sequential accesses are of course that. As the graphs above show, out-of-order accesses disrupt the cache contents, as well as the translation lookaside buffer (TLB) which translates virtual memory page addresses to physical memory page addresses, resulting in lower memory bandwidth. The random result is more like real-world memory performance.
Generalizations about memory and register performance
One has certain expectations about the performance of different memory subsystems in a computer. My program confirms these.
- Reading is usually faster than writing.
- Register-to-register transfers are the fastest possible transfers.
- Register-to-stack and stack-to-register transfers are often half as fast as register-to-register transfers, because the stack typically finds itself in the L1 data cache.
- L1 cache accesses are significantly faster than L2 accesses e.g. by a factor of 2.
- L1 and L2 cache accesses are much faster than main memory accesses.
- L2 cache writing is usually slower than L2 reading. This is because, if the data at the address being written to is not already in L2, existing data in the cache must often be flushed out to main memory before it can be replaced.
- If the L2 cache is in write-through mode then L2 writing will be very slow and more on par with main memory write speeds. The modern form of write-through is the cache-bypassing writes (
nontemporal
is Intel's term for it). - Some CPUs buffer writes, giving a false appearance of fast writes. A memory barrier instruction can flush that buffer to give a more realistic measurement of write speed.
- Main memory is slower to write than to read. This is just the nature of DRAM. It takes time to charge or discharge the capacitor that is in each DRAM memory cell whereas reading it is faster.
- The more memory channels that your system has, the faster your system should perform. Laptops are typically dual-channel.
- If memory sticks are mismatched sizes e.g. 8GB with 32GB, they will not be as fast as when they are matched.
- The closer physically that DRAM is to the CPU, the faster it can perform because there is less electrical noise introduced in the shorter wires.
- Linux framebuffer accesses are usually slower than main memory.
- However framebuffer writing is usually faster than framebuffer reading.
- C library memcpy and memset are often pretty slow; perhaps this is due to unaligned loads and stores and/or insufficient optimization.
Architectural advantages
As you can see from the above graphs, architecture matters too.
- Apple's unified memory architecture puts the DRAM closer to the CPU, allowing for high bandwidth and low latency. In the case of the M1, sequential reads from main memory hover around 50 GB/second and writes are around 30 GB/second.
- Intel's AVX512 allows very fast reads from L1 and L2 caches.
A historical addendum
One factor that reduces a computer's bandwidth is a write-through cache, be it L2 or L1. These were used in early Pentium-based computers but were quickly replaced with more efficient write-back caches.
Today's equivalent is the nontemporal
or cache-bypassing accesses, which are needed for
data that don't belong in the caches, such as streamed video data.
SSE4 vector-to/from-register transfers
While transfers between the main registers and the XMM vector registers using the MOVD and MOVQ instructions perform well, transfers involving the PINSR* and PEXTR* instructions are slower than expected. In general, to move a 64-bit value into or out of an XMM register using MOVQ is twice as fast as using PINSRQ or PEXTRQ, suggesting a lack of optimization on Intel's part of the latter instructions.
What is dual-channel asynchronous mode?
If you have two SO-DIMMs that are not the same capacity e.g. 8GB and 16GB, they will operate in dual-channel asynchronous mode, which is faster than single-channel but slower than synchronous dual-channel mode. The latter mode is what you get when the two SO-DIMMs' sizes are the same.
My Xeon has a 20 MB shared L3. Will it be fast?
A shared L3 means that if you have n cores and each is running a program that is very memory-intensive, like bandwidth
, or the genetics program BLAST, the amount of L3 that each core can effectively use will be its fraction of the total. If Y is the number of megabytes of L3, this would be Y/n.
This was proven when a person with a dual-CPU Xeon 2690 system (20 MB L3, 8 cores and 4 channels per CPU) ran an instance of bandwidth
on each of the 8 cores, resulting in each core effectively having only 5 MB of L3.