Midterm Recap

(in class)

Computer organization

The computer, as we’ll deal with it, will still be a bit of an abstraction. Even when working in assembly language, there are still several layers of “magic” between what we write and what really happens!

The von Neumann Architecture

Most modern computers have an organization that falls under the banner of a “von Neumann architecture”. The essential elements of a von Neumann architecture are:

I.e., the processor is separate from the memory (distinguishing it from things like the Connection Machine, where every memory unit also contains a small processor), and from Harvard architecture machines, which have two separate memories, one for programs and one for data. There are other architectures (e.g., the Connection machine) in which both memories exist on the processor, rather than being separate.

Measuring memory access speed

Because the processor and memory are distinct functional units, it takes some time for data to get from memory to the CPU, and vice versa. The CPU and memory are connected via a bus (which also connects to other devices on the system).

The usual way to measure the speed of memory accesses is as throughput, bytes (or bits) per second. This measure, however, can be deceptive. Which of the following is the “best” way to transfer data from SoCal to New York, in terms of throughput:

  1. Send it over the Internet.

  2. Send it over a dedicated connection.

  3. Fill a minivan with USB drives and drive cross-country.

In terms of throughput, option (3) is the “best”! This, despite the fact that it will take two days to drive straight through. Throughput, as a measure, is deceptive because it conflates two distinct notions:

The first is called bandwidth, while the second is called latency. Throughput is simply bandwidth divided by latency. The minivan-of-USBs has terrible latency (two days), but this is masked by its enormous bandwidth.

Bandwidth is the largest amount of data that can be transfered “in one go”. E.g., if we have a 64-bit bus (8-byte), then we can transfer 8 bytes to/from the CPU in one shot. This also implies that if we transfer fewer than 8 bytes, it takes the same amount of time as if we had transfered the full 8. Example: suppose we have a 64-bit bus, with a latency of 1ns. How long will it take to transfer 39 bytes? 39/8 rounded up is 5, 5ns. We have to round up because the partial transfer of 7 bytes, whether it is done at the beginning or the end, will still require 1ns. The worst case would be something like 41 bytes, which rounds up to 6 ns. The 1 extra byte still requires a full 1ns cycle to transfer.

Consider two memory architectures:

Arch. 1 Arch. 2
Latency 100ms 10ms
Bandwidth 100KB 1KB

Which computer is “faster”? If we look only at latency, it would appear that arch. 2 is faster, as it can access memory 10x faster than arch. 1. But can load 100x more data per request!

The von Neumann Bottleneck

In the earliest computers, clock speeds were slow enough that the time taken to access memory was roughly comparable to the time needed to perform operations on the CPU. Hence, “memory access” was just another kind of CPU operation, equal in performance to every other. However, CPU clock rates have increased dramatically, and while the sizes of memories available to us have increased, the latency of memory accesses has not kept up. It is now many times slower to access main memory than it is to perform an operation on the CPU. This speed difference is known as the von Neumann Bottleneck.

As an example of the consequences of the bottleneck, consider the task of needing to perform trigonometric operations in real-time (e.g., for rotation in a game). Historically, the fastest way to do this was to precompute a table of sin values and store them in memory; when you needed to compute sin(x) you’d simply lookup x in the table. Now, however, this approach is much too slow: it’s actually faster to compute the first dozen or so terms in the Taylor series approximation of sin on the CPU, than it is to make a round-trip to memory!

The main trade-off we face with memory is this:

In other words, small memory = fast, big memory = slow.

To try and get the best of both worlds, modern computer architectures use caches. A cache is a smaller memory that lies between the CPU and main memory, and stores a frequently-used subset of the main memory. E.g., while main memory might be 8GB in size, a cache might be 64KB in size. Because the cache is smaller, it can be physically closer to the CPU and thus memory accesses which are “in” the cache (“cache hits”) are faster. Memory accesses which are not “in” the cache (“cache misses”) are just as slow, but when a miss occurs we take that as a hint that the address, or others near it, will be accessed again in the near future and load it into the cache. Thus, the first access to a memory address will be a miss (and slow), but subsequent accesses will be hits (and fast).

Caching is based on the idea that program typically do not access memory unpredictably, but rather access memory in predictable patterns. In particular, if we were to observe the pattern of memory accesses over a single run of a program, we would notice two things:

For example, consider this program which multiplies together the values in a pair of 1024-int arrays, and then sums up all the multiples (this is called a “sum-of-products” and is used in geometric and signal-processing applications quite frequently):

int add_arrays(int* a, int* b)
{
  int sum = 0;
  for(int i = 0; i < 1024; ++i)
    sum += a[i] * b[i];

  return sum;
}

This program has both spatial and temporal locality:

(You might wonder how it is that we can load any significant amount of data from main memory into the cache without that operation taking a significant amount of time. E.g., if we want to load in 1024 bytes into the cache, won’t that take 1024 times as much time as a single memory access, thus eliminating any benefits of using the cache? The answer to this comes from the other main memory performance variable, bandwidth.)

This answers the question as to how we can afford to load a reasonable amount of data into the cache: thanks to wide memory bandwidths, although the memory request is slow, it can have a relatively large payload.

Modern memory architectures will usually have not just one, but two or three caches, of increasing size (and increasingly slow). If we include disk-based storage as a kind of “memory” (even slower than main memory), we get the memory hierarchy, from smallest/fastest on the left, to biggest/slowest on the right:

Image of the memory hierarchy

The farther from the CPU we get, the larger the storage devices become, but the slower they are to access.

Cache organization

A cache is organized into lines and sets. For example, a 16KB cache might have 16 sets, each with one line, meaning each line is 1KB. Each line can store a cached region of main memory. Caches use multiple sets/lines so that independent memory accesses can be cached. E.g., consider a program reading from two arrays in a loop. If the cache could only hold one memory region at once, then the cache would constantly be switching between the two arrays, and every memory access would be a cache miss. If the cache has at least two lines in total, then after the first two accesses, both arrays will hopefully be loaded into the cache, and all subsequent accesses will be cache misses. E.g., in the above example, arrays a and b would be loaded into separate cache lines, assuming they are big enough/far apart enough in memory that the do not both fit in a single line.

The address of a line is the starting address of the memory region which it contains. This is always a multiple of the cache line size. For example, if the line size is 1KB, then a line with a starting address of 0x000400 will contain memory addresses 0x000400 through 0x0007ff. Any memory access in this range will be a cache hit.

If a cache has multiple sets, how do we know which set to place a memory access into? The cache set is chosen automatically, based on the address, divided by the line size, modulo the number of sets. For example, if the cache has 16 sets of size 1kB, then the first kB of memory (addresses 0x0 through 0x3ff) will go into line 0, addresses 0x400 through 0x7ff will go into line 1, etc. Eventually, as the addresses get high enough, they will wrap around back to set 0.

A single-set cache does not have very good performance, because it replaces lines too frequently. A cache needs to be a bit smarter about which lines it replaces. To accomplish this, we put more than one line in each set, and allow the cache some flexibility as to which line in the set it replaces. While the set within the cache is still chosen automatically (by address, divided by line size, modulo lines/set), the line within the set can be chosen in a more intelligent way, the cache replacement policy:

The number of lines per set is called the associativity of a cache. The simple cache above, with 1 line per set, is called a set associative cache. If there are 4 lines per set then we say the cache is 4-way associative. A cache with 1 set is called fully associative; the cache replacement policy determines the performance of the cache. The assignment asks you to implement a cache which is fully associative: there is only one set, so the cache replacement policy is always used to select a line.

Typically, a mixture of sets and lines is used: set selection, because it’s a simple modulo, can be done very quickly. Line selection is slower, but allows the cache to be used more efficiently.

Example

Suppose we have a 2-way associative cache; this means that every set has 2 lines in it. Our cache has a line-size of 1KB (1024 bytes), and has 2 sets. Thus, the total cache size is

1024       ×     2     ×    2      =    4KB
line size       sets       assoc.

Suppose we access the following memory addresses (remember that 1024 is 0x400 in hexadecimal):

0x400    = 1024
0x800    = 2048
0xc07    = 3079
0x401    = 1025
0x801    = 2049
0xc07    = 3079
0x402    = 1026
0x802    = 2050
0xc07    = 3079

(This is basically the memory access pattern from the example above: array a starts at address 0x400, b at address 0x800, and sum is located at address 0xc07.)

In order to “simulate” the cache, we don’t actually need to store the values from memory, we just need to keep track of two pieces of information for each line:

Set 0Set 1
LineLineLineLine
Address
Age

(To mark lines as empty, I’ve given them an age of infinity, so that they will be the first to be replaced.)

  1. When we access the first address, 0x400, we divide it by the line size (1024) and then take the result mod 2 (the number of sets). This tells us that the first access should go in set 1.

  2. In set 1, we use the LRU policy to choose which line to load into. Because both lines are empty, we can choose either (it won’t matter). Arbitrarily, we choose the left line. Cache miss. This results in the following cache:

Set 0Set 1
Line Line Line Line
Address 0x400
Age 0
  1. When we access address 0x800, we again divide by 1024 (= 2) and then take mod 2, giving 0. So this access goes in set 0.

  2. Again, we arbitrarily place it in the left line. At the same time, we increment the age of every other line in the cache. (New lines get an age of 0.) Cache miss.

Set 0Set 1
Line Line Line Line
Address0x800 0x400
Age 0 1
  1. Address 0xc07, divided by 1024, gives 3, mod 2 is 1, so this access goes in set 1. (Note that the starting address for this line is 3072 = 0xc07 rounded to the nearest lower multiple of 1024.)

  2. In set 1, we have two lines: one with an age of 1, and one with an age of ∞. We pick the older line, and increment the age of all other lines. Cache miss.

Set 0Set 1
Line Line Line Line
Address0x800 0x4000xc00
Age 1 20
  1. The next address is 0x401. Again, divide by 1024 (gives 1), mod 2, gives 1. So this access goes in set 1. Starting address is 0x400.

  2. But in set 1, we find that a line with that starting address already exists, so we have a cache hit! All we do is increment the age of all other lines.

Set 0Set 1
Line Line Line Line
Address0x800 0x4000xc00
Age 2 21

You can continue the example through all the addresses if you like; all of them should be cache hits. As we’ll see below, cache misses and cache hits have different latencies, so we could go through this process, adding up the amount of time taken by each access, to get the total amount of time spent accessing memory.

Cache misses

A “cache miss” occurs when we look for an address which is not in any of the lines in the cache. Why do cache misses occur? For a number of different reasons:

Write strategies

It’s not just memory reads that access the cache; memory writes also modify the cache. Two strategies are available for dealing with writes:

Cache Examples

Caches are organized as a number of sets, each of which contains a number of lines. When a request for a memory access is received, a cache set is selected by taking the address of the request, dividing by the line size, and then taking the result modulo the number of sets. Example:

Each set contains some number of lines. While the set is selected based purely on the address, the line within that set can be selected based on other, more intelligent criteria (typically involving things like the “age” of the line, how long it has been since a memory access within that line was received).

Because the lines within a set are not organized in any particular order, this implies that each line must store information about the range of addresses that it contains. Because each line has a fixed size, it’s usually sufficient to just store the starting address of the range saved by that line.

If a memory request is found within some set and line, then that is a cache hit. We update whatever aging information is saved within the cache line(s) and then return the result of the request to the CPU. If a memory request is not found, then we have a cache miss; we have to request a line’s worth of data from main memory (or, in a multi-level cache, the next lower cache in the system), and then we have to pick a line to put it in (remember that its set is determined by the address of the request). If there are no unused lines available in the set, then an in-use line must be selected to be replaced. How the cache chooses which line to replace is known as its cache replacement policy.

One of the easiest effective replacement policies is least-recently used: the line is selected which has not been used (had a cache hit within it) in the longest time. In order to implement this, the cache stores with each line an age. Whenever a memory request is received which is a hit, all of the other lines in that set have their ages incremented. When a line needs to be selected for replacement, the line with the largest age value is picked. (Replacement also zeros the age, as the new line coming in to the cache was “just born”.)

Example: Two-level cache

Here we are going to simulate a system with a two-level caches, a smaller L1 cache (closest to the CPU) and then a larger L2 cache.

L1 CacheL2 Cache
Line size256 1024
Sets 1 4
Lines/set4 1

Here the L1 cache is fully associative, while L2 is set-associative.

Furthermore, we’re going to calculate the total time needed to perform these memory accesses, by giving the latencies of each cache and main memory:

L1 CacheL2 CacheMain memory
Latency1ns 10ns 1000ns

If we have a cache hit, the we add the cache’s latency to the cumulative time. If we have a cache miss, then we add the total time necessary to access the next lower cache. (E.g., if the L1 cache misses, then we pay at least 10ns to access the L2 cache, and if the L2 cache misses, another 1000ns on top of that to access main memory!)

We will simulate the following series of memory accesses:

Address(in decimal)Latency
0x01 = 1
0x100 = 256
0x807 = 2055
0x02 = 2
0x110 = 272
0x907 = 2311
0x03 = 3
0x120 = 288
0x1007 = 4103
0x04 = 4

Estimating cache performance

Instead of calculating exactly how much time a given sequence of memory accesses will take, we can approximate the average memory-access latency, if we know the probability that each cache in the system will hit.

We will use the following equation:

$$\text{Avg. memory access time} = \text{Fraction of cache hits} \times \text{Cache hit latency} + (1 - \text{Fraction of cache hits}) \times \text{Cache miss latency}$$

For example, suppose that 90% of our memory accesses are cache hits, and cache hits have a latency of 1ms, while cache misses have a latency of 100ms. Then our average memory access time is

$$\text{Avg. memory access time} = 0.9 \times 1 + 0.1 \times 100 = 10.9 \text{ms}$$

Note that this is just the average memory access time, not the exact memory access time, for which we would need to know the exact sequence of memory access.

Multi-level caching

As mentioned above, it’s common for computer systems to have more than one cache. Each “level” of cache is larger, but slower. For example, we might have

Level Latency Hit percentage
L1 1ms 80%
L2 10ms 90%
Main memory 100ms 100% (obviously)

(Note that hit percentages are given from the viewpoint of that level, not overall. Thus, 90% of the accesses that reach the L2 cache are hits.)

If we know the cache hit percentages for both cache levels, we can estimate the average memory access time for this multi-level cache. Using the numbers above, we find

$$\text{Avg. memory access time} = 0.8 \times 1 + 0.2 \times (0.9 \times 10 + 0.1 \times 100) = 4.6\text{ms}$$

But note that this is only an average over many memory accesses. If we knew the exact pattern of memory accesses, we could simulate the state of the caches (which sets/lines were loaded) and determine the exact amount of time. The first assignment asks you to do just this.

Program execution performance

The instruction cache

The normal L1,2,… caches are used for data, a separate cache is maintained for program code called the instruction cache or I-cache for short. There are two main reasons why a separate cache is used for code instead of just using the normal data cache(s):

A common rule of thumb is that “a program spends 90% of its time executing the same 10% of the program”; if true, optimizing for this common case by caching that 10% can produce dramatic performance improvements.

Some CPUs go even further and cache the “decoded” instructions (i.e., the internal microcode that the instructions we write are translated into), so that instructions in the I-cache take no time to load or to decode.

Instruction performance

Over a reasonably-sized program, we can estimate the runtime using a simple performance equation:

$$\text{Clock cycles} = \text{Total instructions} \times \text{Avg. cycles / instruction}$$

The term “average clock cycles per instruction” is often shortened to “CPI”. Of course, different instructions take different numbers of cycles, but over all the instructions (or perhaps just those used in our program), we can consider the average number of cycles per instruction. The simplest way to make a program run faster is to run it on a processor which can execute the instructions in fewer cycles.

The above equation measures cycles, not actual time. Time depends on the clock rate (seconds per cycle) of the processor:

$$\text{CPU time} = \frac{\text{Total instructions} \times \text{CPI}}{\text{Clock rate}} $$
Thus, the other way to speed up a program is to increase the clock rate.

You might expect that the smallest CPI (cycles per instruction) would be 1, but as we’ll see, it is actually possible to have a CPI < 1.

Amdahl’s law

Amdahl’s law is an equation which describes how much of a performance improvement we can expect by improving part of a system:

$$\text{Improved time} = \frac{\text{Time affected by improvement}}{\text{Amount of improvement}} + \text{Time unaffected}$$

For example, suppose we have a program which spends 70 seconds doing multiplication, and 30 seconds doing other things. Currently, this program runs in 100 seconds. If we double the speed of multiplication, how much faster will it run?

$$\text{Improved time} = \frac{70}{2} + 30 = 65 \text{s}$$

The program would run 100/65 = 1.5 times faster!

Suppose we want the program to run twice as fast: that is, our target improved time is 50 (100/50 = 2x). How much faster would multiplication need to run in this case?

$$50 = \frac{70}{n} + 30$$ $$20 = \frac{70}{n}$$ $$n = \frac{70}{20} = 3.5$$

Suppose we want the program to run five times as fast: our target improved time is 20 (100/20 = 5x).

$$20 = \frac{70}{n} + 30$$ $$-10 = \frac{70}{n}$$ $$n = -7$$

It’s not actually possible to make the program run 5x faster by only improving multiplication; at some point, the “other things” will start to dominate the program’s runtime, and improving multiplication will give us diminishing returns.