Subscribe to receive notifications of new posts:

Every 7.8μs your computer’s memory has a hiccup

2018-11-23

5 min read
This post is also available in 简体中文, Deutsch, Español and Français.
640px-2013_Transcend_TS512MLK72V6N--straightened-

Modern DDR3 SDRAM. Source: BY-SA/4.0 by Kjerish

During my recent visit to the Computer History Museum in Mountain View, I found myself staring at some ancient magnetic core memory.

240px-KL_CoreMemory

Source: BY-SA/3.0 by Konstantin Lanzet

I promptly concluded I had absolutely no idea on how these things could ever work. I wondered if the rings rotate (they don't), and why each ring has three wires woven through it (and I still don’t understand exactly how these work). More importantly, I realized I have very little understanding on how the modern computer memory - dynamic RAM - works!

cpumemory.8

Source: Ulrich Drepper's series about memory

I was particularly interested in one of the consequences of how dynamic RAM works. You see, each bit of data is stored by the charge (or lack of it) on a tiny capacitor within the RAM chip. But these capacitors gradually lose their charge over time. To avoid losing the stored data, they must regularly get refreshed to restore the charge (if present) to its original level. This refresh process involves reading the value of every bit and then writing it back. During this "refresh" time, the memory is busy and it can't perform normal operations like loading or storing bits.

This has bothered me for quite some time and I wondered... is it possible to notice the refresh delay in software?

Dynamic RAM refresh primer

Each DIMM module is composed of "cells" and "rows", "columns", "sides" and/or "ranks". This presentation from the University of Utah explains the nomenclature. You can check the configuration of memory in your computer with decode-dimms command. Here's an example:

$ decode-dimms
Size                                       4096 MB
Banks x Rows x Columns x Bits              8 x 15 x 10 x 64
Ranks                                      2

For today we don't need to get into the whole DDR DIMM layout, we just need to understand a single memory cell, storing one bit of information. Specifically we are only interested in how the refresh process is performed.

Let's review two sources:

Each bit stored in dynamic memory must be refreshed, typically every 64ms (called Static Refresh). This is a rather costly operation. To avoid one major stall every 64ms, this process is divided into 8192 smaller refresh operations. In each operation, the computer’s memory controller sends refresh commands to the DRAM chips. After receiving the instruction a chip will refresh 1/8192 of its cells. Doing the math - 64ms / 8192 = 7812.5 ns or 7.81 μs. This means:

  • A refresh command is issued every 7812.5 ns. This is called tREFI.

  • It takes some time for the chip to perform the refresh and recover so it can perform normal read and write operations again. This time, called tRFC is either 75ns or 120ns (as per the mentioned Micron datasheet).

When the memory is hot (>85C) the memory retention time drops and the static refresh time halves to 32ms, and tREFI falls to 3906.25 ns.

A typical memory chip is busy with refreshes for a significant fraction of its running time - between 0.4% to 5%. Furthermore, memory chips are responsible for a nontrivial share of typical computer's power draw, and large chunk of that power is spent on performing the refreshes.

For the duration of the refresh action, the whole memory chip is blocked. This means each and every bit in memory is blocked for more than 75ns every 7812ns. Let's measure this.

Preparing an experiment

To measure operations with nanosecond granularity we must write a tight loop, perhaps in C. It looks like this:

    for (i = 0; i < ...; i++) {
		// Perform memory load. Any load instruction will do
		*(volatile int *) one_global_var;

		// Flush CPU cache. This is relatively slow
		_mm_clflush(one_global_var);

		// mfence is needed, otherwise sometimes the loop
		// takes very short time (25ns instead of like 160). I
		// blame reordering.
		asm volatile("mfence");

		// Measure and record time
		clock_gettime(CLOCK_MONOTONIC, &ts);
    }

Full code is available on Github.

The code is really straightforward. Perform a memory read. Flush data from CPU caches. Measure time.

(Note: in second experiment I attempted to use MOVNTDQA to perform the data load, but this requires special uncacheable memory page, which needs root access.)

On my computer it generates data like this:

# timestamp, loop duration
3101895733,     134
3101895865,     132
3101896002,     137
3101896134,     132
3101896268,     134
3101896403,     135
3101896762,     359
3101896901,     139
3101897038,     137

Typically I get ~140ns per loop, periodically the loop duration jumps to ~360ns. Sometimes I get odd readings longer than 3200ns.

Unfortunately, the data turns out to be very noisy. It's very hard to see if there is a noticeable delay related to the refresh cycles.

Fast Fourier Transform

At some point it clicked. Since we want to find a fixed-interval event, we can feed the data into the FFT (fast fourier transform) algorithm, which deciphers the underlying frequencies.

I'm not the first one to think about this - Mark Seaborn of Rowhammer fame implemented this very technique back in 2015. Even after peeking at Mark's code, getting FFT to work turned out to be harder than I anticipated. But finally I got all the pieces together.

First we need to prepare the data. FFT requires input data to be sampled with a constant sampling interval. We also want to crop the data to reduce noise. By trial and error I found the best results are when data is preprocessed:

  • Small (smaller than average * 1.8) values of loop iterations are cut out, ignored, and replaced with readings of "0". We really don't want to feed the noise into the algorithm.

  • All the remaining readings are replaced with "1", since we really don't care about the amplitude of the delay caused by some noise.

  • I settled on sampling interval of 100ns, but any number up to a Nyquist value (double expected frequency) also work fine.

  • The data needs to be sampled with fixed timings before feeding to FFT. All reasonable sampling methods work ok, I ended up doing basic linear interpolation.

The algorithm is roughly:

UNIT=100ns
A = [(timestamp, loop_duration),...] 
p = 1
for curr_ts in frange(fist_ts, last_ts, UNIT):
    while not(A[p-1].timestamp <= curr_ts < A[p].timestamp):
        p += 1
    v1 = 1 if avg*1.8 <= A[p-1].duration <= avg*4 else 0
    v2 = 1 if avg*1.8 <= A[p].duration <= avg*4 else 0
    v = estimate_linear(v1, v2, A[p-1].timestamp, curr_ts, A[p].timestamp)
    B.append( v )

Which on my data produces fairly boring vector like this:

[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]

The vector is pretty long though, typically about ~200k data points. With data prepared like this, we are ready to feed it into FFT!

C = numpy.fft.fft(B)
C = numpy.abs(C)
F = numpy.fft.fftfreq(len(B)) * (1000000000/UNIT)

Pretty simple, right? This produces two vectors:

  • C contains complex numbers of the frequency components. We are not interested in complex numbers and we can flatten them out by calling abs().

  • F contains labels to what frequency bin lies in which place in vector C. We need to normalize it to Hz - by multiplying it by the input vector sampling frequency.

The result can be charted:

fft1a

Y axis is unit-less since we normalized the delay times. Even though, it clearly shows spikes at some fixed frequency intervals. Let's zoom in:

fft2a

We can clearly see first three spikes. After a bit of wishy-washy arithmetic, involving filtering the reading at least 10 times the size of average, we can extract the underlying frequencies:

127850.0
127900.0
127950.0
255700.0
255750.0
255800.0
255850.0
255900.0
255950.0
383600.0
383650.0

Doing the math: 1000000000 (ns/s) / 127900 (Hz) = 7818.6 ns

Hurray! The first frequency spike is indeed what we were looking for, and indeed does correlate with the refresh times.

The other spikes at 256kHz, 384kHz, 512kHz and so on, are multiplies of our base frequency of 128kHz called harmonics. These are a side effect of performing FFT on something like a square wave and totally expected.

For easier experimentation, we prepared a command line version of this tool. You can run the code yourself. Here is a sample run on my server:

~/2018-11-memory-refresh$ make
gcc -msse4.1 -ggdb -O3 -Wall -Wextra measure-dram.c -o measure-dram
./measure-dram | python3 ./analyze-dram.py
[*] Verifying ASLR: main=0x555555554890 stack=0x7fffffefe2ec
[ ] Fun fact. I did 40663553 clock_gettime()'s per second
[*] Measuring MOVQ + CLFLUSH time. Running 131072 iterations.
[*] Writing out data
[*] Input data: min=117 avg=176 med=167 max=8172 items=131072
[*] Cutoff range 212-inf
[ ] 127849 items below cutoff, 0 items above cutoff, 3223 items non-zero
[*] Running FFT
[*] Top frequency above 2kHz below 250kHz has magnitude of 7716
[+] Top frequency spikes above 2kHZ are at:
127906Hz    7716
255813Hz    7947
383720Hz    7460
511626Hz    7141

I must admit the code is not perfectly stable. In case of trouble, consider disabling Turbo Boost, CPU frequency scaling and tuning for performance.

Finale

There are two major takeaways from this work.

We've seen that the low level data is pretty hard to analyze and seem to be pretty noisy. Instead of trying to figure something out with naked eye, we can always employ good old FFT. Some wishful thinking is needed when preparing the data.

Most importantly, we showed it's often possible to measure subtle hardware behaviours from a simple userspace process. This kind of thinking led to the discovery of the original Rowhammer vulnerability, was used in Meltdown/Spectre attacks and showed again in recent ECC-defeating reincarnation of Rowhammer.

There is so much more to be said. We barely scratched the surface of the inner workings of the memory subsystem. I recommend further reading:

Finally, here is a good reading about the old magnetic core memory:

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
DevelopersProgramming

Follow on X

Marek Majkowski|@majek04
Cloudflare|@cloudflare

Related posts

October 24, 2024 1:00 PM

Durable Objects aren't just durable, they're fast: a 10x speedup for Cloudflare Queues

Learn how we built Cloudflare Queues using our own Developer Platform and how it evolved to a geographically-distributed, horizontally-scalable architecture built on Durable Objects. Our new architecture supports over 10x more throughput and over 3x lower latency compared to the previous version....