Midterm Recap

(in class)

Computer organization

The computer, as we’ll deal with it, will still be a bit of an abstraction. Even when working in assembly language, there are still several layers of “magic” between what we write and what really happens!

The von Neumann Architecture

Most modern computers have an organization that falls under the banner of a “von Neumann architecture”. The essential elements of a von Neumann architecture are:

A processor (the formal description divides this into an processing unit a control unit, but we can treat them as one) connected to
A single memory which stores both program and data.

I.e., the processor is separate from the memory (distinguishing it from things like the Connection Machine, where every memory unit also contains a small processor), and from Harvard architecture machines, which have two separate memories, one for programs and one for data. There are other architectures (e.g., the Connection machine) in which both memories exist on the processor, rather than being separate.

Measuring memory access speed

Because the processor and memory are distinct functional units, it takes some time for data to get from memory to the CPU, and vice versa. The CPU and memory are connected via a bus (which also connects to other devices on the system).

The usual way to measure the speed of memory accesses is as throughput, bytes (or bits) per second. This measure, however, can be deceptive. Which of the following is the “best” way to transfer data from SoCal to New York, in terms of throughput:

Send it over the Internet.
Send it over a dedicated connection.
Fill a minivan with USB drives and drive cross-country.

In terms of throughput, option (3) is the “best”! This, despite the fact that it will take two days to drive straight through. Throughput, as a measure, is deceptive because it conflates two distinct notions:

How much data is being transferred
How long it will take to transfer data

The first is called bandwidth, while the second is called latency. Throughput is simply bandwidth divided by latency. The minivan-of-USBs has terrible latency (two days), but this is masked by its enormous bandwidth.

Bandwidth is the largest amount of data that can be transfered “in one go”. E.g., if we have a 64-bit bus (8-byte), then we can transfer 8 bytes to/from the CPU in one shot. This also implies that if we transfer fewer than 8 bytes, it takes the same amount of time as if we had transfered the full 8. Example: suppose we have a 64-bit bus, with a latency of 1ns. How long will it take to transfer 39 bytes? 39/8 rounded up is 5, 5ns. We have to round up because the partial transfer of 7 bytes, whether it is done at the beginning or the end, will still require 1ns. The worst case would be something like 41 bytes, which rounds up to 6 ns. The 1 extra byte still requires a full 1ns cycle to transfer.

Consider two memory architectures:

	Arch. 1	Arch. 2
Latency	100ms	10ms
Bandwidth	100KB	1KB

Which computer is “faster”? If we look only at latency, it would appear that arch. 2 is faster, as it can access memory 10x faster than arch. 1. But can load 100x more data per request!

The von Neumann Bottleneck

In the earliest computers, clock speeds were slow enough that the time taken to access memory was roughly comparable to the time needed to perform operations on the CPU. Hence, “memory access” was just another kind of CPU operation, equal in performance to every other. However, CPU clock rates have increased dramatically, and while the sizes of memories available to us have increased, the latency of memory accesses has not kept up. It is now many times slower to access main memory than it is to perform an operation on the CPU. This speed difference is known as the von Neumann Bottleneck.

As an example of the consequences of the bottleneck, consider the task of needing to perform trigonometric operations in real-time (e.g., for rotation in a game). Historically, the fastest way to do this was to precompute a table of sin values and store them in memory; when you needed to compute sin(x) you’d simply lookup x in the table. Now, however, this approach is much too slow: it’s actually faster to compute the first dozen or so terms in the Taylor series approximation of sin on the CPU, than it is to make a round-trip to memory!

The main trade-off we face with memory is this:

If we make the memory small, then (because it doesn’t take up much physical space) we can place it close to the CPU, perhaps even on the CPU itself. This decreases latency.
If we want a large memory, then (because it will be physically large, too), we must place it physically farther away from the CPU, increasing latency.

In other words, small memory = fast, big memory = slow.

To try and get the best of both worlds, modern computer architectures use caches. A cache is a smaller memory that lies between the CPU and main memory, and stores a frequently-used subset of the main memory. E.g., while main memory might be 8GB in size, a cache might be 64KB in size. Because the cache is smaller, it can be physically closer to the CPU and thus memory accesses which are “in” the cache (“cache hits”) are faster. Memory accesses which are not “in” the cache (“cache misses”) are just as slow, but when a miss occurs we take that as a hint that the address, or others near it, will be accessed again in the near future and load it into the cache. Thus, the first access to a memory address will be a miss (and slow), but subsequent accesses will be hits (and fast).

Caching is based on the idea that program typically do not access memory unpredictably, but rather access memory in predictable patterns. In particular, if we were to observe the pattern of memory accesses over a single run of a program, we would notice two things:

Spatial locality: If a program accesses memory address x, it will probably also access addresses x+1 or x-1. In other words, access at an address suggests that other addresses near it will also be accessed.
Temporal locality: If a program accesses memory address x, it will probably access x again, in the near future.

For example, consider this program which multiplies together the values in a pair of 1024-int arrays, and then sums up all the multiples (this is called a “sum-of-products” and is used in geometric and signal-processing applications quite frequently):

int add_arrays(int* a, int* b)
{
  int sum = 0;
  for(int i = 0; i < 1024; ++i)
    sum += a[i] * b[i];

  return sum;
}

This program has both spatial and temporal locality:

We first access a[0] and b[0], but the next time through the loop, we will access a[1], b[1]. This holds throughout the loop: accessing a[i], b[i] means that we will soon also access a[i+1], b[i+1].
The variable sum must live somewhere in memory, and as it is not an array, it’s address does not change in the loop. So accessing sum at a given time (loop iteration) suggests that we will need to access sum again, in the near future.

(You might wonder how it is that we can load any significant amount of data from main memory into the cache without that operation taking a significant amount of time. E.g., if we want to load in 1024 bytes into the cache, won’t that take 1024 times as much time as a single memory access, thus eliminating any benefits of using the cache? The answer to this comes from the other main memory performance variable, bandwidth.)

This answers the question as to how we can afford to load a reasonable amount of data into the cache: thanks to wide memory bandwidths, although the memory request is slow, it can have a relatively large payload.

Modern memory architectures will usually have not just one, but two or three caches, of increasing size (and increasingly slow). If we include disk-based storage as a kind of “memory” (even slower than main memory), we get the memory hierarchy, from smallest/fastest on the left, to biggest/slowest on the right:

Image of the memory hierarchy

The farther from the CPU we get, the larger the storage devices become, but the slower they are to access.

Cache organization

A cache is organized into lines and sets. For example, a 16KB cache might have 16 sets, each with one line, meaning each line is 1KB. Each line can store a cached region of main memory. Caches use multiple sets/lines so that independent memory accesses can be cached. E.g., consider a program reading from two arrays in a loop. If the cache could only hold one memory region at once, then the cache would constantly be switching between the two arrays, and every memory access would be a cache miss. If the cache has at least two lines in total, then after the first two accesses, both arrays will hopefully be loaded into the cache, and all subsequent accesses will be cache misses. E.g., in the above example, arrays a and b would be loaded into separate cache lines, assuming they are big enough/far apart enough in memory that the do not both fit in a single line.

The address of a line is the starting address of the memory region which it contains. This is always a multiple of the cache line size. For example, if the line size is 1KB, then a line with a starting address of 0x000400 will contain memory addresses 0x000400 through 0x0007ff. Any memory access in this range will be a cache hit.

If a cache has multiple sets, how do we know which set to place a memory access into? The cache set is chosen automatically, based on the address, divided by the line size, modulo the number of sets. For example, if the cache has 16 sets of size 1kB, then the first kB of memory (addresses 0x0 through 0x3ff) will go into line 0, addresses 0x400 through 0x7ff will go into line 1, etc. Eventually, as the addresses get high enough, they will wrap around back to set 0.

A single-set cache does not have very good performance, because it replaces lines too frequently. A cache needs to be a bit smarter about which lines it replaces. To accomplish this, we put more than one line in each set, and allow the cache some flexibility as to which line in the set it replaces. While the set within the cache is still chosen automatically (by address, divided by line size, modulo lines/set), the line within the set can be chosen in a more intelligent way, the cache replacement policy:

If we could see the future, we could choose the line which will not be used for the longest amount of time going forward (we obviously don’t want to replace a line that the very next memory instruction is going to use). This is the ideal cache replacement policy, but of course it’s not actually possible to implement, because we can’t see the future.
A reasonable alternative is to use the past as a proxy for the future: choose the cache line which has not been recently used. This is the least recently used (LRU) cache replacement policy. It requires every line to store an age along with its address; every cache hit increments the age of all other lines, and a cache miss (load) sets the age of the new line to 0. LRU produces good performance for relatively little cost.

The number of lines per set is called the associativity of a cache. The simple cache above, with 1 line per set, is called a set associative cache. If there are 4 lines per set then we say the cache is 4-way associative. A cache with 1 set is called fully associative; the cache replacement policy determines the performance of the cache. The assignment asks you to implement a cache which is fully associative: there is only one set, so the cache replacement policy is always used to select a line.

Typically, a mixture of sets and lines is used: set selection, because it’s a simple modulo, can be done very quickly. Line selection is slower, but allows the cache to be used more efficiently.

Example

Suppose we have a 2-way associative cache; this means that every set has 2 lines in it. Our cache has a line-size of 1KB (1024 bytes), and has 2 sets. Thus, the total cache size is

1024       ×     2     ×    2      =    4KB
line size       sets       assoc.

Suppose we access the following memory addresses (remember that 1024 is 0x400 in hexadecimal):

0x400    = 1024
0x800    = 2048
0xc07    = 3079
0x401    = 1025
0x801    = 2049
0xc07    = 3079
0x402    = 1026
0x802    = 2050
0xc07    = 3079

(This is basically the memory access pattern from the example above: array a starts at address 0x400, b at address 0x800, and sum is located at address 0xc07.)

In order to “simulate” the cache, we don’t actually need to store the values from memory, we just need to keep track of two pieces of information for each line:

What range of memory addresses is it storing? (Because all lines have the same size, we can just store the starting address of each line.)
How old is it? The oldest line is the “least recently used” and thus will be chosen for replacement, when needed.

	Set 0		Set 1
	Line	Line	Line	Line
Address
Age	∞	∞	∞	∞

(To mark lines as empty, I’ve given them an age of infinity, so that they will be the first to be replaced.)

When we access the first address, 0x400, we divide it by the line size (1024) and then take the result mod 2 (the number of sets). This tells us that the first access should go in set 1.
In set 1, we use the LRU policy to choose which line to load into. Because both lines are empty, we can choose either (it won’t matter). Arbitrarily, we choose the left line. Cache miss. This results in the following cache:

	Set 0		Set 1
	Line	Line	Line	Line
Address			0x400
Age	∞	∞	0	∞

When we access address 0x800, we again divide by 1024 (= 2) and then take mod 2, giving 0. So this access goes in set 0.
Again, we arbitrarily place it in the left line. At the same time, we increment the age of every other line in the cache. (New lines get an age of 0.) Cache miss.

	Set 0		Set 1
	Line	Line	Line	Line
Address	0x800		0x400
Age	0	∞	1	∞

Address 0xc07, divided by 1024, gives 3, mod 2 is 1, so this access goes in set 1. (Note that the starting address for this line is 3072 = 0xc07 rounded to the nearest lower multiple of 1024.)
In set 1, we have two lines: one with an age of 1, and one with an age of ∞. We pick the older line, and increment the age of all other lines. Cache miss.

	Set 0		Set 1
	Line	Line	Line	Line
Address	0x800		0x400	0xc00
Age	1	∞	2	0

The next address is 0x401. Again, divide by 1024 (gives 1), mod 2, gives 1. So this access goes in set 1. Starting address is 0x400.
But in set 1, we find that a line with that starting address already exists, so we have a cache hit! All we do is increment the age of all other lines.

	Set 0		Set 1
	Line	Line	Line	Line
Address	0x800		0x400	0xc00
Age	2	∞	2	1

You can continue the example through all the addresses if you like; all of them should be cache hits. As we’ll see below, cache misses and cache hits have different latencies, so we could go through this process, adding up the amount of time taken by each access, to get the total amount of time spent accessing memory.

Cache misses

A “cache miss” occurs when we look for an address which is not in any of the lines in the cache. Why do cache misses occur? For a number of different reasons:

Obviously, the first memory access will be a miss. Even an infinitely large cache would take some time to be populated with the lines being used.
If the program is using more memory than the size of the cache, then it simply isn’t possible for all the active memory to be in the cache. Some accesses will have to be misses.
If the cache with no associativity is used, then some lines will be forced out just by chance of modulo-ing to the same value as an address being accessed. Even with associativity, some addresses will map to the same set, and none of the lines in that set might match the address.

Write strategies

It’s not just memory reads that access the cache; memory writes also modify the cache. Two strategies are available for dealing with writes:

Write-back: memory writes modify the cache. When a line is selected for replacement, we first check to see if it is dirty (has had any writes to it). If it is, then we first write it back to memory before replacing it.
Write-through: memory writes modify the cache and, in the background, are simultaneously sent to the next lower level of memory. There is no need to keep track of which lines are dirty, as main memory is always in sync. with the contents of the cache.

Cache Examples

Caches are organized as a number of sets, each of which contains a number of lines. When a request for a memory access is received, a cache set is selected by taking the address of the request, dividing by the line size, and then taking the result modulo the number of sets. Example:

Suppose each line is 256B (i.e., 0x100 bytes) and there are 8 sets
Memory addresses 0x0 through 0xff will go into set 0
Memory addresses 0x100 through 0x1ff will go into set 1
…
Memory addresses 0x700 through 0x7ff will go into set 7
Addresses 0x800 through 0x8ff will wrap around (modulo) back into set 0.

Each set contains some number of lines. While the set is selected based purely on the address, the line within that set can be selected based on other, more intelligent criteria (typically involving things like the “age” of the line, how long it has been since a memory access within that line was received).

A cache in which each set has only a single line is called set associative. It does not use any intelligence in keeping or replacing cache lines, just the modulo calculation. This is fast, but tends to throw out lines that should really be kept in the cache. (Remember that part of the purpose of the cache is to keep frequently used data close to the CPU.)
A cache with a single set with some number of lines is fully associative, it chooses every line replacement “intelligently”. This is probably the best from the perspective of keeping frequently-used lines in the cache, but maintaining the associativity information requires some time (the contents of each set are effectively little hash tables, with the address range of each line serving as the keys).
If there are n lines per set, then we say that the cache is n-way associative.

Because the lines within a set are not organized in any particular order, this implies that each line must store information about the range of addresses that it contains. Because each line has a fixed size, it’s usually sufficient to just store the starting address of the range saved by that line.

If a memory request is found within some set and line, then that is a cache hit. We update whatever aging information is saved within the cache line(s) and then return the result of the request to the CPU. If a memory request is not found, then we have a cache miss; we have to request a line’s worth of data from main memory (or, in a multi-level cache, the next lower cache in the system), and then we have to pick a line to put it in (remember that its set is determined by the address of the request). If there are no unused lines available in the set, then an in-use line must be selected to be replaced. How the cache chooses which line to replace is known as its cache replacement policy.

One of the easiest effective replacement policies is least-recently used: the line is selected which has not been used (had a cache hit within it) in the longest time. In order to implement this, the cache stores with each line an age. Whenever a memory request is received which is a hit, all of the other lines in that set have their ages incremented. When a line needs to be selected for replacement, the line with the largest age value is picked. (Replacement also zeros the age, as the new line coming in to the cache was “just born”.)

Example: Two-level cache

Here we are going to simulate a system with a two-level caches, a smaller L1 cache (closest to the CPU) and then a larger L2 cache.

	L1 Cache	L2 Cache
Line size	256	1024
Sets	1	4
Lines/set	4	1

Here the L1 cache is fully associative, while L2 is set-associative.

Furthermore, we’re going to calculate the total time needed to perform these memory accesses, by giving the latencies of each cache and main memory:

	L1 Cache	L2 Cache	Main memory
Latency	1ns	10ns	1000ns

If we have a cache hit, the we add the cache’s latency to the cumulative time. If we have a cache miss, then we add the total time necessary to access the next lower cache. (E.g., if the L1 cache misses, then we pay at least 10ns to access the L2 cache, and if the L2 cache misses, another 1000ns on top of that to access main memory!)

We will simulate the following series of memory accesses:

Address	(in decimal)	Latency
0x01	= 1
0x100	= 256
0x807	= 2055
0x02	= 2
0x110	= 272
0x907	= 2311
0x03	= 3
0x120	= 288
0x1007	= 4103
0x04	= 4

Estimating cache performance

Instead of calculating exactly how much time a given sequence of memory accesses will take, we can approximate the average memory-access latency, if we know the probability that each cache in the system will hit.

We will use the following equation:

$$\text{Avg. memory access time} = \text{Fraction of cache hits} \times \text{Cache hit latency} + (1 - \text{Fraction of cache hits}) \times \text{Cache miss latency}$$

For example, suppose that 90% of our memory accesses are cache hits, and cache hits have a latency of 1ms, while cache misses have a latency of 100ms. Then our average memory access time is

$$\text{Avg. memory access time} = 0.9 \times 1 + 0.1 \times 100 = 10.9 \text{ms}$$

Note that this is just the average memory access time, not the exact memory access time, for which we would need to know the exact sequence of memory access.

Multi-level caching

As mentioned above, it’s common for computer systems to have more than one cache. Each “level” of cache is larger, but slower. For example, we might have

Level	Latency	Hit percentage
L1	1ms	80%
L2	10ms	90%
Main memory	100ms	100% (obviously)

(Note that hit percentages are given from the viewpoint of that level, not overall. Thus, 90% of the accesses that reach the L2 cache are hits.)

If we know the cache hit percentages for both cache levels, we can estimate the average memory access time for this multi-level cache. Using the numbers above, we find

$$\text{Avg. memory access time} = 0.8 \times 1 + 0.2 \times (0.9 \times 10 + 0.1 \times 100) = 4.6\text{ms}$$

But note that this is only an average over many memory accesses. If we knew the exact pattern of memory accesses, we could simulate the state of the caches (which sets/lines were loaded) and determine the exact amount of time. The first assignment asks you to do just this.

Program execution performance

The instruction cache

The normal L1,2,… caches are used for data, a separate cache is maintained for program code called the instruction cache or I-cache for short. There are two main reasons why a separate cache is used for code instead of just using the normal data cache(s):

On x86 systems, data and code normally live in different memory sections, so if we used one cache for both, it would constantly be switching between loading in part of the program, and loading in the data being used, and we would lose much of the advantages of caching.
The I-cache is small, typically only large enough to hold a short loop. If an entire loop can fit in the I-cache, then it will run much, much faster. The I-cache is effectively a level-1 cache, as close to the CPU as possible, so for loops that fit in the I-cache, after the initial load, take almost no time to bring in each instructions.

A common rule of thumb is that “a program spends 90% of its time executing the same 10% of the program”; if true, optimizing for this common case by caching that 10% can produce dramatic performance improvements.

Some CPUs go even further and cache the “decoded” instructions (i.e., the internal microcode that the instructions we write are translated into), so that instructions in the I-cache take no time to load or to decode.

Instruction performance

Over a reasonably-sized program, we can estimate the runtime using a simple performance equation:

$$\text{Clock cycles} = \text{Total instructions} \times \text{Avg. cycles / instruction}$$

The term “average clock cycles per instruction” is often shortened to “CPI”. Of course, different instructions take different numbers of cycles, but over all the instructions (or perhaps just those used in our program), we can consider the average number of cycles per instruction. The simplest way to make a program run faster is to run it on a processor which can execute the instructions in fewer cycles.

The above equation measures cycles, not actual time. Time depends on the clock rate (seconds per cycle) of the processor:

$$\text{CPU time} = \frac{\text{Total instructions} \times \text{CPI}}{\text{Clock rate}} $$

Thus, the other way to speed up a program is to increase the clock rate.

You might expect that the smallest CPI (cycles per instruction) would be 1, but as we’ll see, it is actually possible to have a CPI < 1.

Amdahl’s law

Amdahl’s law is an equation which describes how much of a performance improvement we can expect by improving part of a system:

$$\text{Improved time} = \frac{\text{Time affected by improvement}}{\text{Amount of improvement}} + \text{Time unaffected}$$

For example, suppose we have a program which spends 70 seconds doing multiplication, and 30 seconds doing other things. Currently, this program runs in 100 seconds. If we double the speed of multiplication, how much faster will it run?

$$\text{Improved time} = \frac{70}{2} + 30 = 65 \text{s}$$

The program would run 100/65 = 1.5 times faster!

Suppose we want the program to run twice as fast: that is, our target improved time is 50 (100/50 = 2x). How much faster would multiplication need to run in this case?

$$50 = \frac{70}{n} + 30$$ $$20 = \frac{70}{n}$$ $$n = \frac{70}{20} = 3.5$$

Suppose we want the program to run five times as fast: our target improved time is 20 (100/20 = 5x).

$$20 = \frac{70}{n} + 30$$ $$-10 = \frac{70}{n}$$ $$n = -7$$

It’s not actually possible to make the program run 5x faster by only improving multiplication; at some point, the “other things” will start to dominate the program’s runtime, and improving multiplication will give us diminishing returns.