Memory management

Inside of our process, our address space starts at address 0x400000 and extends up to the process memory limit (at least 2GB). But our computer does not physically have enough memory for every process to be given its own ~2GB, and obviously they cannot overlap in the memory they use, so there is a translation between the logical view of memory, as seen by our processes, and the physical layout of memory. This translation is partially handled by the processor, and partially by the operating system.

Logical address: An address within our process’s address space. Sometimes logical addresses are called virtual addresses.

Physical address: An address as seen by the memory unit of the computer, ranging from 0 up to the amount of RAM available.

Goals of memory management:

Allow each process to have its own uniform address space
Protection of processes’ memory from each other: one process should not be able to access another process’s memory space.
Flexibility of address space allocation, allowing the operating system to make optimal use of the system’s limited memory.
Enable extensions such as shared memory, virtual memory, etc.

The mapping from logical to physical addresses is accomplished by a CPU component called a memory management unit or MMU. Since nearly every memory access will require translation, it is important that the MMU be tightly integrated into the CPU, and that translation be as fast as possible. On the other hand, the MMU must be programmable, as the operating system needs to have control over how memory is allocated to processes.

Memory management schemes

Flat memory: one process can be running at a time
Relocation register: multiple processes, no protection or sharing
Relocation with limit: multiple processes, better protection
Partitioning
Paging
Multilevel paging
Paging with segmentation

Flat memory scheme

The simplest memory model is the flat scheme: virtual addresses are mapped directly to physical addresses with no translation. This often means that only a single process can be running at one time. If multiple processes are running, then they each get a separate “slice” of physical memory (e.g., one process’s address space starts at 0x10000, another at 0x110000, etc.) and there is no protection stopping one process from accessing another process’s memory. This method was used by early home computers: DOS, early versions of Mac OS and Windows, etc. These days it’s only suitable for single-purpose systems: embedded systems and microcontrollers where there is no operating system, just a single process.

Note that if multiple processes are allowed to run, they must be compiled so that each will be loaded into a separate region of memory (or they must be compiled as “position independent” and the loader must take care of placing them in different memory regions).

Relocation register

The simplest memory model which supports some flexibility relies on a special CPU register called a relocation register. The value in this register is added to every memory address, automatically, to translate logical to physical. Every process has a different relocation base, hence while every process sees its own address space starting at 0x0, these are mapped to different parts of physical memory.

For example: a process might have a relocation base of 0x10000. A memory access for logical address 0x123 by this process will be mapped to physical address 0x10123 by the CPU. Note that the process never sees real physical addresses. Ideally, a non-privileged process cannot even examine the value in the relocation register, so it has no way of known where in physical memory it is mapped to.

This scheme allows each process to have an identical logical address space, but allows the operating system to map them to different regions of physical memory. However, it requires a process’s entire logical address space to be mapped to a contiguous region of physical memory. If a process has a 2GB logical address space, then a continuous chunk of 2GB physical memory must be found somewhere for that process; it cannot be “split up”.

Furthermore, unless other systems are added, nothing prevents a process from accessing addresses outside its own address space, which, by necessity, will be mapped to physical addresses inside some other process’s logical space. Hence, there is no real memory protection.

Despite the simplicity of this model, it emphasizes the key element of memory management: translation between logical addresses and physical addresses.

When an Intel CPU is running in 16-bit mode, it uses four relocation registers, called segment registers. The sections of the executable (code, data, stack, heap) are mapped to the four segments, and each segment can have a different relocation base, and thus be located in a different region of memory. This is somewhat more flexible than a single relocation register, as it allows the process’s address space to be broken up and not translated to a single contiguous block of physical memory.

This can be extended to a segment table, which each address is broken into two parts: segment and offset. The segment is looked up in a segment table, which defines the physical base address of that segment, as well as its size. (The primary difference between segmentation and paging, discussed later, is that there are typically very few segments, and their size can vary.)

Relocation with limit

An easy extension to the relocation register scheme is to add another per-process register containing a limit, which specifies the size of a process’s memory space. The translation from logical to physical address now involves two steps:

Compare address to limit register. If ≥ limit, STOP, raise a memory addressing error.
Add relocation register to address. The result is the physical address.

This simple extension allows us to prevent a process from accessing addresses outside its address space, and hence, protects processes from each other. This also protects the operating system from unwanted memory accesses by the processes.

Partitioning

Partitioning is essentially the heap-allocation strategy we discussed under the name of a free list, applied to allocating physical memory to processes. Within a process, the limited resource of the heap must be allocated to fulfill requests for additional memory resources (i.e., malloc or new), and also to allow memory to be released when it is no longer needed (free or delete). Similarly, the operating system has to allocate the limited resource of physical memory to processes, as processes start (request memory) and end (release memory).

In partitioning, the operating system maintains a list of partitions (regions of physical memory in use by processes) and holes (unused regions).

When a process starts, we search the list of holes for a chunk of available space of sufficient size, and then add it to the list of partitions. The starting address of this hole becomes the process’s relocation base.

As with heap allocation, we have several choices of algorithms when searching for a hole: we can pick the first hole that fits, or the best fit (least leftover space), or the worst fit (most leftover space). Simulations have shown that first-fit and best-fit are better than worst-fit in terms of memory utilization, and first-fit is faster, so it is typically used.
When a process ends, we add its partition to the list of holes, coalescing any adjacent holes so that every hole is always as large as possible.

As with the heap, external fragmentation (many small holes but no large ones) is a problem for the partitioning scheme as well. Simulations have shown that after N bytes have been allocated using first-fit, on average another 0.5N bytes will be “lost” due to fragmentation. One third of memory may be effectively unusable, due to being in too-small blocks to be useful to any process.

The operating system has one additional tool available which the heap allocator did not: it can move processes within the physical address space, by copying their (physical) memory contents to a new location and then updating their relocation register values. The physical copy is obviously slow, but if the memory space is becoming fragmented (many small holes between partitions), it may be worthwhile to compact the partitions, moving processes around in memory so that the free space is all together in a single large hole. Compaction is expensive, and requires some kind of memory translation to be workable.

Partitioning also allows processes to change the size of the address space during runtime: a process can request more or less address space. If it requests more, then the partition assigned to be can be expanded (if there is a hole after it), or it can be relocated. Shrinking a partition may create a hole, which may require coalescing with adjacent holes.

Paging

Paging tries to avoid the fragmentation problem by allowing each process’s address space to be mapped to a non-contiguous region of physical memory. E.g., logical address 0x100 might be mapped to physical 0x400, but logical address 0x101 might be mapped to physical 0x2301, a completely different part of physical memory.

Physical memory is divided into fixed-size regions called frames. Logical memory, for each process, is similarly divided into pages, where the page size is equal to the frame size, and the page size is fixed and the same for every process. (Often the page size is fixed by the CPU architecture and unchangeable.) Each logical address is broken into two parts: page and offset (where the offset is in the range of 0 to page size - 1). Instead of a relocation register, each process has its own page table which specifies, for each of its pages, which physical frame it is mapped to. The process of mapping logical to physical is thus:

Extract the page portion of the address.
Lookup the page in the page table to get the frame.
Replace the page portion of the address with the frame. The offset is not changed. The result is the physical address.

Note that this translation, like any other, must be done by the MMU on the CPU, and must be invisible to the process itself. The operating system sets up the page table for a process when it schedules the process to run. Because all pages are the same size, allocation of pages to frames is much simpler than in partitioning: the operating system only needs to maintain, for each frame, whether it is free or in-use, and then when a request for a frame is received, it can be satisfied by the first frame which is free.

Page sizes are usually a power of 2, so that the portion of the address which corresponds to the page/frame is simply some number of high bits. For example, suppose we have a 1Kb address space (10-bit addresses), and the page size is 256. This implies that each process has 4 pages of 256 bytes each, so the page table has 4 entries. The high 2 bits of each address specifies the page/frame, while the low 8 bits give the offset:

Logical memory
0-255	page 0
256-511	page 1
512-767	page 2
768-1023	page 3

Page table
0	1
1	4
2	3
3	7

Physical memory
frame number
0
1	page 0
2
3
4	page 2
5	page 1
6
7	page 3

This also demonstrates another feature of paging: the physical address space can be larger than the logical address space. Logical addresses are 10 bit, but physical addresses must be at least 11 bits, to be able to hold frames 0 through 7. Each process can have a different address space size, in terms of the numbers of pages allocated to it, and processes have no access to other processes’ pages (frames). Thus, each process “sees” the system as if it was the only process running, with complete freedom to use its (logically contiguous) memory space as desired.

Paging eliminates external fragmentation: any free frame can be given to a process which needs it. The only form of fragmentation which can occur is internal fragmentation: if a process requests an amount of memory which is not a multiple of the page size, then a portion of some page will go unused. Thus, the choice of the page size influences how much (internal) memory will be unused. For optimal memory utilization, small page sizes are best, but small page sizes result in very large page tables.

The operating system, of course, must maintain information about the complete set of frames: which are free, and which are allocated to processes (and what process/page they are allocated to). This is maintained in an operating system structure called a frame table. The operating system may have to manually perform address translation, when dealing with addresses from processes. For example, when we issue a write syscall, we pass it address of a string. Because this address is within our process’s logical address space, and not within the kernel’s address space, the kernel must manually translate this address into its physical address, by examining our process’s page table.

If the number of pages is small enough, the page table may be stored on the CPU for fast translation of addresses. Usually, though, the entire page table is usually not stored on the CPU; rather, it is stored in a separate (operating system) portion of memory, and a page table base register is set to point to it. Thus, to switch between processes, the operating system simply updates the value of this register. The disadvantage to this method is that it dramatically increases the amount of time a memory access takes, as a single logical memory access now requires two physical access: one to look up the entry in the page table, and then the “real” access after translation.

The solution to the 2x increase in memory access time is a specialized cache specifically for page-table lookups called a translation lookaside buffer (TLB). The TLB stores a small number of page table entries. When a memory request is received, its page is first checked in the TLB; if found, the TLB entry is used instead of going to the full page table. If not found, only then do we access the full page table and we add the entry to the TLB so that future accesses to that page will be faster. Because the TLB, like any other cache, resides on the CPU, accesses to it are much faster than accesses to main memory. Because the TLB is process-specific, it must be flushed when we switch from one process to another, which makes the first few memory accesses after a process-switch slower, as they have to reload the TLB.

A TLB flush is expensive enough that some TLBs attach a tag to each entry, indicating which process owns that entry. Thus, the TLB does not need to be flushed when processes are switched, as a process will only see TLB entries which match its tag.

As with a traditional cache, we can ask how the TLB affects memory access time by looking at its hit ratio and hit/miss access times. Suppose a TLB has an 80% hit ratio (80% of memory accesses are found within the TLB) and the memory access time for a TLB hit is 20ns, while for a main memory access it is 100ns. Then we have an effective access time of

0.8 * (20 + 100) + 0.2 * (20 + 100 + 100) = 140 ns

A “hit” access takes 20 ns to search the TLB and then 100 ns to do the “real” memory access. A miss takes 20 ns to search the TLB and fail, 100 ns to access the page table, and then 100 ns to do the “real” access.

The difference between paging and segmentation are that

The number of pages is typically large, in the thousands, while the number of segments is typically small, 4-8.
Every page has the same size, while segments are intended to have different sizes, depending on their uses. Because of this, allocation of segments to physical memory is more complex, closer to partitioning.
Pages are a low-level translation from logical to physical memory, while segments are intended to reflect the semantic organization of a process’s memory space: there is a segment for the heap, a segment for executable code, a segment for the stack, etc.

Memory protection

Attached to each page table entry can be a set of protection bits, describing what kinds of memory accesses are allowed within that page: read/write or read-only. Some systems allow pages to be marked as no execute, meaning that the CPU will generate a fault if the instruction pointer ever points into such a page. Similarly, writing to a read-only page will generate a process (operating system) fault. Pages can also be marked as valid/invalid, with invalid pages “not existing”; this effectively allows for “holes” in a process’s logical address space.

If we think of a process as “owning” its logical memory then it does not make much sense for the process to be unable to write to its own pages, but read-only pages allow for a great deal of flexibility in memory management by allowing page sharing. A single logical frame can be mapped into more than one process’s address space. This allows for libraries to be shared between processes, or for processes to communicate via shared memory.

In order for a library to be shared between processes, we need to divide the library’s address space into the portion which never changes (code, read-only data), and the portion which might change (global variables). The unchanging portion can be shared between processes, but each process must get its own unique copy of the changeable portions. Again, the intent is for each process to view itself as the only process executing on the system; if one process could affect other processes via a library, that would break this abstraction.

The protection bits for each page can be used to enforce these restrictions, so that a library cannot modify its code or read-only data.

Shared paging can also be used to implement copy-on-write pages. A copy-on-write page is initially shared between multiple processes. Initially, the page is marked as read-only. When a process tries to write to the page, the operating system intercepts the write, makes a copy of the page which is now owned by the process, rather than shared, and then allows the write to go through. Thus, pages can be initially shared, reducing memory usage, until they are written to, and of course, all of this is transparent to user processes.

Multilevel paging

Given the large physical address spaces of modern machines, a single page table with any reasonable page size might have millions of entries, which is too large to be manageable. Instead, modern systems use multilevel or hierarchical paging, using a set of page tables instead of just one. We’ll consider a two-level scheme, but the ideas generalized to more levels. Multilevel paging works by paging the page table.

Suppose we have 16-bit addresses, with a page size of 4KB (i.e., 12 bits for page offset). The remaining 4 bits are divided in two, 2 bits for page table 1, and two bits for page table 2:

page number		page offset
table 1	table 2	d
2	2	12

There are four level-1 entries in table 1. Each entry is a pointer to a level-2 table, and each level 2 table contains 4 entries:

0	0x100
1	0x400
2	0x300
3	0x700

0x100
0	0x6
1	0x8
2	0x7
3	0x9

0x300
0	0x2
1	0x1
2	0x4
3	0x3

0x400
0	0x0
1	0x5
2	0x10
3	0x12

0x700
0	0x11
1	0x13
2	0x15
3	0x14

(Note that frame addresses in the level-2 page tables are all multiples of the page size: 4KB = 0x1000.)

To perform a memory translation on an address, e.g., 1001101101001110b = 39758 we split the address into its table 1 index, table 2 index, and offset:

Table 1	Table 2	Offset
`10`	`01`	`101101001110`

In table 1, we look in entry 10b = 2. Entry 2 refers us to the level 2 page table at 0x300. In this table we look at entry 01b = 1. Entry 1 contains 0x1 = 0001b. We replace both page table portions of the original address with 0001b, yielding the physical address 0001101101001110b = 6990.

Note that each table 1 entry effectively manages a “huge page” of size 16KB. If we wanted to, we could allow level 1 entries to map, not just to level 2 tables, but directly to physical frame addresses of this “huge” size. This would give us a paging system with two page sizes: normal pages of size 4KB, and “huge” pages of size 16KB. The x86-64 memory architecture supports this at all levels of its multilevel paging scheme, giving it a range of page sizes that can be supported and mixed.

Although this extra lookup might seem to make memory accesses even slower, all levels of the page table are cached in the TLB, so memory access is not much worse. A TLB is essential for making performance acceptable, however.

Inverted page table

Instead of storing a page table per-process, we can “invert” the system and store a table of frames, storing for each frame, what process and page it is mapped from. When a memory access is received, the MMU must search the “frame table” for the page and process which owns a particular frame, and then performs the translation. Searching the frame table linearly would be much too slow, so a hash table, keyed by process ID and page number, is used to speed searches.

The advantage to an inverted page table is that there is only one page table for the entire system, and the number of entries in it is fixed by the amount of physical memory. With a normal page table, the total size of all page tables is roughly proportional to the number of processes running, not to the total amount of physical memory.

Frame #	Process ID	Page #
0	12	3
1	147	5
2	6	2
3	135	0
…

Segmentation with paging

It’s possible, and not uncommon, to combine segmentation with paging. This gives the semantic benefits of segmentation, together with the efficiency and flexibility benefits of paging.

A logical address consists of a segment and a segment offset, where the segment offset is further broken into a page and a page offset. The segment table stores, for each segment, the length of that segment and a page table base address; each segment effectively gets its own page table (though pages can be shared between segments if needed). The translation procedure is

Extract the segment from the address.
Lookup the segment in the segment table; check the segment offset against the segment length. If the offset is out of range, STOP and signal a memory error.
Use the segment table entry’s page table address to find the page table for this segment.
Lookup the page in the page table to find the frame.
Replace both the segment, and the page, in the address with the frame; the page offset portion of the address remains unchanged. This is the physical address.

As an example, consider a 16-bit address space with pages of 256 bytes, and 4 segments. Each segment has a maximum length of 16384 bytes (2^14). Addresses are arranged as

2 bits	6 bits	8 bits
	Page	Page offset
Segment	Segment offset

The segment table contains 4 entries:

Index	Length	Page table addr
0	1024
1	300
2	2000
3	4000

(Note that the size of each segment need not be a multiple of the page size, although an integer number of pages will be allocated to each segment.)

The page table for segment 1 contains two entries:

Page index	Frame
0	3 (= `00000011b`)
1	6 (= `00000110b`)

We want to translate the address 0x410c = 0100000100001100.

Split the address into segment and segment offset: 01 00000100001100. Segment = 01 = 1.
Lookup segment 1 in the segment table. Check the segment offset (00000100001100 = 268) against the segment length (300). Offset is within the segment, so continue.
Use the segment table’s page table base to find the page table for segment 1.
Split the segment offset into page and page offset: 000001 00001100. Page = 1, page offset = 12.
Lookup page 1 in the page table to get its frame: 00000110.
Replace the segment and page in the logical address (i.e., the high 8 bits) with the frame, yielding the physical address 00000110 00001100 = 0x60c.

Virtual memory

Not every page/segment of a process’s address space is used with equal frequency. If pages are used infrequently it is possible to page them out to disk, allowing the frame to be used by another process. When a page is written to disk, it’s page table entry is marked as invalid. When a memory access for that page is received, we first reload the page from disk, and then try the memory access again; because the page is now valid, the memory access will succeed.

When paging a page out to disk, we want to avoid writing it to disk if it is not needed. If a page has not been modified since it was last loaded in, then we don’t need to copy it to disk. Hence, the page table stores, for each page, a dirty bit, indicating whether or not the page has been modified since last load. This bit is cleared when a page is loaded, and set whenever a write is made to an address within that page.

When we need to page out to disk, if the dirty bit is set, the contents of the page are written to disk before reallocating the frame. If the dirty bit is unset, nothing needs to be written to disk: the frame is available for immediate use.

x86-64 Linux memory management

x86-64 distinguishes three kinds of addresses, at various levels of abstraction:

Logical address: addresses as seen by processes, divided into segment and offset. The segment is always 16 bits, while the offset is 32 bits.
Linear address: a 64-bit unsigned integer which can be used to address the entire address space. The MMU translates logical addresses into linear addresses using the segmentation scheme described above.
Physical address: addresses as the memory chips see them. The MMU translates linear addresses into physical addresses by applying the paging scheme.

Thus, x86-64 uses segmentation with paging, although by configuring the MMU one or both can be disabled. It is possible to use pure paging (1 segment per process), or pure segments (one page per segment), or to map a process’s memory space directly to physical addresses with no translation. Linux prefers to use paging without segmentation when it is available (not every architecture supports paging without segmentation), and hence on x86-64 uses pure paging. Segmentation is considered “legacy” in x86-64, although the support is still there.

Intel can use multi-level paging with up to five levels, with a default page size of 4KB, but individual pages can be sized larger at 4MB, and even 1GB. The ability to mix different page sizes brings the paging scheme a bit closer to partitioning.

The address of the top-level page directory is always pointed to by control register 3 (CR3). Paging is enabled if the relevant bits (“paging” and “protection”) of CR0 are set, otherwise the value of CR3 is ignored. The CPU will automatically flush the TLB whenever the value of CR3 is changed (as this corresponds to a change in translation).

Each page table entry consists of a page address (either the physical address of the page, or the address of the next level’s page table) and set of bits:

Page size: if set, the page uses the 4MB size, otherwise it is 4KB.
Accessed: set by the processes whenever a page is accessed.
Disable cache: if set, the processor will never load memory from this page into the cache.
Write-through: enables cache write-through if set
User/supervisor: determines the privilege level of this page. OS kernel-owned pages should have the supervisor bit set; only processes running in privileged mode are allowed to access supervisor-tagged pages.
Read-only/Read-write: if set, the page is read-write
Present: if set, the page is actually in physical memory. If unset, accessing the page will cause a page fault which an OS can catch and use to (e.g.) load in the page from disk.
Three bits are left for operating system use.

The lowest-level of the page table, where the entries store actual physical addresses, store a slightly different set of bits:

Dirty: set if the page has been modified.

Multi-level paging (without segmentation)

x86-64 uses multi-level paging, typically with four levels. The top-level pages are quite large, often multiple GB. Top-level pages can be mapped either to lower-level pages, or to an entire contiguous region of memory. Linux disables the segmentation system, preferring to use paging for all memory organization.

A 64-bit address in Linux on x86-64 is divided up as

9 bits	9 bits	9 bits	9 bits	12 bits
Page global directory	Page upper dir.	Page middle dir.	Page table	Offset

(Originally Linux used a three-level scheme, omitting the “page upper directory” portion. The four-level scheme was merged into the kernel in 2007. In 2017 an optional five level paging scheme was merged in, supporting address spaces of up to 128 petabytes.)

Note that only 48-bits of the address are used. The remaining 16 bits of each address are currently unused and sign-extended from bit 47. Physical addresses, after translation, are 52 bits wide; after translation, the high 36 bits are replaced with a 40-bit frame.

Each page index is an index into a table which contains the address of the next lower table to search for the following index. Any table entry can also be null, indicating that that address is not valid. The location of the page global directory for a process is stored within the CR3 register (thus, a process-switch only needs to update the value of this register.)

Page table entries

Each page table entry is defined as a struct of type pte_t, pmd_t, pud_t and pgd_t. These structures are defined identically, but are given different names for future extensibility. Aside from the location of the following page table, each entry stores a set of status and protection bits (most of which correspond to the bits described above):

Present: is the page loaded into memory (as opposed to swapped to disk).
protnone: is the page loaded but not accessible for some reason
Read/write: page may be written to
User: page is accessible from user space. (The kernel maps some of its pages into process address spaces for its own use, but the process is not allowed to access these.)
Dirty: has the page been modified since last load.
Accessed: has the page been accessed (read or written) since last load.

The kernel also maintains an inverted page table, mapping frames to processes/pages, for fast inverse translations.

Reserved frames

The kernel maintains a list of reserved frames, frames that should never be allocated to pages. These include frames used by the kernel itself, as well as frames used for things that must exist at fixed addresses (e.g., memory-mapped IO devices). For example, in certain video modes, video RAM is mapped directly to a range of physical addresses; obviously we don’t want any process’s pages allocated into this range, so the frames containing these addresses are marked as reserved.

“Huge” pages

x86-64 defaults to a page size of 4KB, but can use pages of size 2MB, 4MB, and 1GB. Rather than forcing all pages to use the larger size, the Linux kernel can allocate some of a processes pages as “huge pages” using the larger sizes. The difficulty with huge pages is that each huge page must be mapped to a contiguous region of physical memory, and must be aligned to the larger frame size. A 2MB huge page corresponds to 512 pages, all of which must be contiguous, and must be placeable at a multiple of 2MB. If the system has been running for any length of time, there is unlikely to be such a big chunk of unused frames available.

Instead, huge pages are typically pre-allocated at system startup, for specific processes (processes which, because the system is still starting up, have not yet started yet!). This marks a certain number of physical frames as reserved, and later requests for huge pages will draw from this pool of frames.