Understanding CPU caching and performance

Introduction

Back when Ars first started, Intel had just released the first Celeron processor aimed at the low end market, and since it lacked an off-die backside cache like its cousin the PII it turned out to be extremely overclockable. The Celeron really pushed the overclocking craze into the mainstream in a big way, and Ars got its start by providing seats on the bandwagon to anyone with a web browser and a desire to learn the hows and whys of Celeron overclocking. Frank Monroe's Celeron Overclocking FAQ was one of the most relentlessly popular articles on Ars for what seemed like forever, and "Celeron" and "overclocking" were the two main search terms that brought people in from Yahoo, which at the time was our number one referrer. (In fact, some time ago my girlfriend mentioned that she had actually come across Ars via Yahoo and read the OC FAQ years before she and I ever even met.)

Along with its overclockability, there was one peculiar feature of the "cacheless wonder," as the Celeron was then called, that blew everyone's mind: it performed almost as well on Quake benchmarks as the cache-endowed PII. What became evident in the ensuing round of newsgroup and BBS speculation over this phenomenon was that few people actually understood how caching works to improve performance. My suspicion is that this situation hasn't changed a whole lot since the Celeron's heyday. However, what has changed since then is the relative importance of caching in system design. Despite the introduction of RAMBUS, DDR, and other next-gen memory technologies, CPU clockspeed and performance have grown significantly faster than main memory performance. As a result L1, L2, and even L3 caches have become a major factor in preventing relatively slow RAM from holding back overall system performance by failing to feed code and data to the CPU at a high enough rate.

The current article is intended as a general introduction to CPU caching and performance. The article covers fundamental cache concepts like spatial and temporal locality, set associativity, how different types of applications use the cache, the general layout and function of the memory hierarchy, et cetera, et cetera. Building on all of this, the next installment will address real-world examples of caching and memory subsystems in Intel's P4 and Motorola's G4e-based systems. (I hope to include some discussion of the XServe hardware, so stay tuned.) But for those who've wondered why cache size matters more in some applications than in others, or what people mean when they talk about "tag RAM," then this article is for you.

Caching basics

In order to really understand the role of caching in system design, it helps to think of the CPU and memory subsystem as operating on a consumer-producer (or client-server) model: the CPU consumes information provided to it by the hard disks and RAM, which act as producers. Driven by innovations in process technology and processor design, CPUs have increased their ability to consume at a significantly higher rate than the memory subsystem has increased its ability to produce. The problem is that CPU clock cycles have gotten shorter at a faster rate than memory and bus clock cycles, so the number of CPU clock cycles that the processor has to wait before main memory can fulfill its requests for data has increased. So with each CPU clockspeed increase, memory is getting further and further away from the CPU in terms of the number of CPU clock cycles.


Slower CPU Clock	Faster CPU Clock

To visualize the effect that this widening speed gap has on overall system performance, imagine the CPU as a downtown furniture maker's workshop and the main memory as a lumberyard that keeps getting moved further and further out into the suburbs. Even if we start using bigger trucks to cart all the wood, it's still going to take longer from the time the workshop places an order to the time that order gets filled.

Note: I'm not the first person to use a workshop and warehouse analogy to explain caching. The most famous example of such an analogy is the Thing King game, which I first saw in this book by Peter van der Linden (reviewed).

Sticking with our furniture workshop analogy, one solution to this problem would be to rent out a smaller warehouse in-town and store the most recently requested types of lumber, there. This smaller, closer warehouse would act as a cache for the workshop, and we could keep a driver on-hand who could run out at a moment's notice and quickly pick up whatever we need from the warehouse. Of course, the bigger our warehouse the better, because it allows us to store more types of wood, thereby increasing the likelihood that the raw materials for any particular order will be on-hand when we need them. In the event that we need a type of wood that isn't in the closer warehouse, we'll have to drive all the way out of town to get it from our big, suburban warehouse. This is bad news, because unless our furniture workers have another task to work on while they're waiting for our driver to return with the lumber, they're going to sit around in the break room smoking and watching Oprah. And we hate paying people to watch Oprah.

Tech —

Understanding CPU caching and performance

An introduction to the concepts of CPU caching and performance.

Introduction

Caching basics

Channel Ars Technica

Introduction

Caching basics

reader comments

Channel Ars Technica