In this assignment you’ll build a “simple” simulation of a multi-level memory:
a CPU with one register, three levels of cache, and a main memory. You’ll run
your simulation on a small program of load
instructions, moving values from
main memory into the register, and determine the total number of cycles needed
to execute the program.
You can complete this assignment in C++, C, or any other programming language you are familiar with (Java, Python, R, etc.).
Memory structure
The system you will be simulating is intentionally simplified. Time measurements are in CPU cycles rather than nanoseconds. It offers
A single CPU register
A 1KB level-1 cache fully associative cache, consisting of 1 set, 4 lines, of 256 bytes per line. Accessing the L1 cache (i.e., moving data from L1 into the CPU) takes 1 cycle.
An 64KB level-2 cache, fully associative, consisting of 1 set containing 64 lines of 1k bytes. Accessing the L2 cache (moving data from L2 into the CPU and L1) takes 10 cycles. Note that this can be done in parallel with loading the data into the L1 cache, so there isn’t a separate cycle cost for that.
A 1MB level-3 cache, consisting of 1 set containing 256 lines of 4k bytes each. Accessing the L3 cache takes 100 cycles. This can be done in parallel with loading the data into L2 and L1.
All caches use LRU as the cache replacement policy, and all are fully associative (1 set).
A 1GB main memory. Accessing main memory takes 1000 cycles.
Here’s a summary:
Line size | Sets | Lines/set | Latency | |
---|---|---|---|---|
L1 | 256 | 1 | 4 | 1 |
L2 | 1024 | 1 | 64 | 10 |
L3 | 4096 | 1 | 256 | 100 |
Memory | n/a | 1000 |
CPU Instruction
Our simulated CPU only supports a single instruction: load
address
which
loads
the value stored in the specified memory address into the CPU. If the
memory address requested is available in the L1 cache, then it will be loaded
from there, otherwise, a L1 cache miss occurs, and it will search the L2 cache,
and so forth, until either the address is found in one of the caches, or
we have to retrieve it from memory. Thus, the fastest memory access (L1 hit)
takes only 1 CPU cycle, while the slowest (misses on all caches) takes
1 + 10 + 100 + 1000 = 1111 CPU cycles.
Cache lines and loading
When a cache miss occurs, a cache will attempt to pull in an entire line of addresses. For L1, this means 256 addresses. As mentioned in class, the requested address is “rounded off” to the nearest lower multiple of 256 (line size) and then the next 256 addresses starting from that address are loaded into the “oldest” line in the cache (i.e., we are using LRU replacement policy). Newly-loaded lines have their age set to 0.
Each cache line needs to store its starting address (always a multiple of the line size), and its “age”. Somewhat unrealistically, we store a line’s age as an integer (normally, a small number of bits would be dedicated to age). When a line is accessed, the age of all other lines is incremented. The oldest line is the one with the greatest age. Of course, when a line is replaced, its age is reset to 0. (Note that cache lines and memory do not need to store their actual contents!) When a cache is accessed, whether it is a hit or a miss, all other lines have their ages incremented. (If it is a hit, then the hit line has its age left the same; if a miss, the newly-loaded line is set to 0.)
Simulated program
Here is the program which you should run your simulation on; addresses are given in hexadecimal.
0x1000
0x1024
0x1599
0x100
0x10245
0x10246
0x1027
0x10247
0x1600
0x1601
0x1700
If you have simulated the caches perfectly, you should get a total of 3,471 cycles for this sequence of addresses.
Submission
Submit your program by uploading it onto the server into the directory
~/cs241/assign1/
.