The List abstract data type
We’re going to look at a singly linked list. You’ll see what this means.
A linked list tries to solve the problem of linear-time inserts (at any position) by adding a layer of indirection. Because an array arranges all of its elements linearly in memory, inserting one necessarily requires moving the existing things around. If we allowed elements to be stored in different locations, then we could do this without moving existing elements.
A (singly) linked list consists of a collection of nodes. Each node is
allocated independently of all the others; in memory, there’s no organization
to where nodes are placed. Each node contains a value (whatever it stores)
and a pointer to the next node. The list
class itself maintains pointers
to the first (required) and last (for convenience) nodes of the list.
Traditionally, these are called the “head” and “tail” of the list.
Linked lists give up the ability randomly access any element in constant time. The fastest way to access a list element is if you already have the previous one. Whereas a list over an array looks like this:
for(int i = 0; i < size; ++i)
// do something with arr[i]
for a linked list, the loop looks like this, following the pointers to the end:
for(node* i = _head, i != nullptr; i = i->next)
// do something with i->value
As we’ll see, if you write an array-style loop on a list, you’re going to have a bad time…
The basic linked list class looks something like this:
template<typename T>
class list {
public:
struct node {
T value;
node* next;
};
list() : hd(nullptr), tl(nullptr) {}
node* head() { return hd; }
node* tail() { return tl; }
void insert(node* prior, T value);
node* at(int i);
void erase(node* prior);
void clear();
void push_back(T value);
void pop_back();
void push_front(T value);
void pop_front();
int size();
private:
node* hd, tl;
}
(Draw picture)
(Demonstrate all operations, as we go)
Note that if we want to talk about node
s outside the list
class we
must use its fully-qualified name: list<T>::node
The insert
method
To insert a new value into a list, we need a pointer to the node before
it, because we have to modify that node’s next
pointer. Hence, insert
takes two parameters: a pointer to a node
, and a value
.
As a special case, we allow the prior
node to be nullptr
; if this
is the case, we assume we are inserting at the beginning of the list
(which has no node before it, hence prior == nullptr
).
There as several cases to consider:
If
prior == nullptr
and the list is empty then we are inserting the first node of the list, so it becomes both the head and tail.If
prior == nullptr
but the list is non-empty then the new node becomes the new head, followed by the old head.If
prior != nullptr
then the sequence of nodes isprior
, new node,prior->next
.If
prior == tl
then the new node becomes the new tail.
void list::insert(node* prior, T value) {
if(prior == nullptr) {
if(empty())
hd = tl = new node{value, nullptr}; // Empty
else
hd = new node{value, _head}; // At least one node
}
else {
prior->next = new node{value, prior->next};
// Update tail if it changed
if(prior == tl)
tl = prior->next;
}
}
How long does insert take? Well, there’s no loop in its body, so it can’t take more than constant time.
Because we wrote insert
first, and made it sufficiently flexible, other
operations which add nodes to the list can be implemented in terms of
insert
, making them much simpler.
push_front()
With our version of insert
, adding a new node to the front of the list
is just an insert
with prior == nullptr
:
void push_front(T value) {
insert(nullptr, value);
}
This takes constant time, same as insert.
push_back()
Similarly, push_back
is just an insert
at the tail:
void push_back(T value) {
insert(tl, value);
}
and takes constant time, like insert
.
The .at()
method, to find a specific node by position
node* list::at(int i) {
node* current = hd;
while(i != 0 && hd != nullptr) {
current = current->next;
--i;
}
return current;
}
.at
has to walk down the list until it finds the element you’re looking for.
The loop depends on i, so this method takes \(O(i)\) time to execute.
(A far cry from the \(O(1)\) of an array or vector!)
If you give an i
that is larger than the number of elements in the list,
at
will return nullptr
.
erase
the node after the one given
Erase is allowed the same special case as insert
: if prior
is nullptr
,
then we erase the first node of the list (which has nothing before it).
void list::erase(node* prior) {
if(prior == nullptr) {
if(hd == tl) {
// One or zero nodes
delete hd;
hd = tl = nullptr;
}
else {
node* t = hd;
hd = hd->next;
delete t;
}
}
else if(prior == tl)
return; // Nothing after the tail!
else {
node* t = prior->next;
prior->next = t->next;
// Update tail
if(t == tl)
tl = prior;
delete t;
}
}
Again, there’s nothing here but pointer shuffling and if-elses, so constant
time. Also like insert
, because we wrote erase
first and made it
sufficiently flexible, other operations that remove nodes can be implemented
in terms of erase
.
pop_back
In order to pop_back
, we have to find the element before _tail
void list::pop_back() {
// One node, no prior
if(hd == tl)
erase(nullptr);
else {
// Find node before tl
node* prior = hd;
while(prior->next != tl)
prior = prior->next;
erase(prior);
}
}
This amounts to finding the second-to-last element, which requires walking
(almost) the entire list, thus, \(O(n)\). (We could find the second-to-last
element by doing at(size()-2)
but that would require traversing the list
twice; still \(O(n)\), but doing more work than necessary.)
Could we make pop_back
a constant time operation by storing a pointer to the
element before it? We’d have to update it in all the other operations (e.g.,
when we insert or splice at the end of the list). But notice that we’d have to
update it during pop_back
as well (since it changes the end of the list).
How are we going to update it? By scanning through the list to find the new
next-to-last element. So we haven’t really gained anything.
pop_front
Popping the front element is much easier
T pop_front() {
erase(nullptr);
}
Constant time.
clear
, deletes all nodes
void list::clear() {
while(hd != nullptr)
pop_front();
}
Because a vector was a single allocation, we can clear it just by doing a
delete
and updating size
and capacity
. Here, each node is its own
allocation, so if we want to properly clean up after ourselves, we have to
delete them all, taking \(O(n)\) time.
Notice the interesting parallel: in a vector, if you want to delete an arbitrary element, you must shift everything after it down, making it \(O(n)\), but you can clear the whole vector for \(O(1)\). Here, deleting a single element is \(O(1)\), but deleting the entire list is \(O(n)\)!
size()
, giving the number of nodes in the list
Computing the size requires walking the list, and thus is \(O(n)\):
int list::size() {
if(_head == _tail)
return 0;
int s = 1;
node* current = _head;
while(current != nullptr) {
current = current->next;
s++;
}
return s;
}
This could be made constant time by simply storing the size of the list in
a data member and then updating it in insert
, erase
and clear
.
Reversing a list
Suppose we want to reverse a list in place, without constructing a new
list (if you are allowed to construct a new list, then reversing is easy:
just loop through the original and push_front
all the elements onto a new
list). That is, we want to do it purely by switching pointers around.
void reverse() {
...
}
List operation complexity summary
Operation | Complexity class |
---|---|
a.head(),tail() |
\(O(1)\) |
a.insert() |
\(O(1)\) |
a.at(n) |
\(O(n)\) |
a.delete() |
\(O(1)\) |
a.clear() |
\(O(n)\) (deletes everything) |
a.push_back() |
\(O(1)\) |
a.pop_back() |
\(O(n)\) |
a.push_front() |
\(O(1)\) |
a.pop_front() |
\(O(1)\) |
a.size() |
\(O(n)\) (but could be \(O(1)\) easily) |
Let’s take another look at our sorting algorithm and see how fast it would run on a list, instead of an array-like structure:
void sort(list<int>& data) {
for(int i = 0; i < data.size() - 1; ++i) {
// Find smallest in i..data.size-1
int smallest = data.at(i)->value;
node* smallest_index = i;
for(int j = i; j < data.size(); ++j)
if(data.at(j)->value < smallest) {
smallest = data.at(j)->value;
smallest_index = j;
}
// Swap it into place
std::swap(data.at(smallest_index)->value, data.at(i)->value);
}
}
It still performs roughly \(O(n^2)\) “operations”, however, now each operation is itself \(O(n)\), making the total runtime \(O(n^3)\)! This emphasizes how the choice of data structure can have a significant effect on the performance of your program.
Note that it is possible to rewrite a selection sort to have \(O(n^2)\) runtime on a list. Indeed, it’s actually a bit easier, because instead of doing the swap, we can just directly move the node containing the largest element to the end of the “sorted” part of the list, by doing a delete, followed by an insert. It’s \(O(1)\) to shuffle list elements around, pulling them out of one location and storing them in another. We also don’t need to access the size; just looking for the nullptr at the end is enough.
List variations
I’ve used null pointers to indicate the end of the list. This actually requires a few checks (which I’ve omitted) to make sure we don’t dereference a null pointer. Some people prefer a sentinel node implementation, where we create a special empty node to mark the end of the list. That is, we have a member
node sentinel; // Not a pointer!
and then the last node has its next
set to sentinel
, and the sentinel
node has its next
pointing to the first node of the list. This means that every node always has a ->next()
so
it makes the loops somewhat simpler. Similarly, every node has a node
before it, so inserting at the beginning does not need a special case.
Another clever way of implementing a singly-linked list is like this:
class list {
int value;
list *rest;
};
This is an inductive definition, here a list is defined to be a list*
, which
can either be:
nullptr
, representing the empty list ora pointer to a
list
, consisting of avalue
and another list.
This formation makes recursive functions on lists much more natural:
int length(list* l) {
if(l == nullptr)
return 0;
else
return 1 + length(l->rest);
}
Finally, a doubly-linked list gives each node two pointers, next
and prev
. This
allows you to navigate through the list in either direction. It makes
pop_back
constant time, because we can now find the next-to-last node
easily, it’s just _tail->prev
. Managing the extra pointers takes a bit of
work, but not in a way that would make any operations more complex. It’s
more of a time-space tradeoff. You can accellerate walking backwards
through the list, at the cost of doubling the number of pointers you must store.
A puzzle
Suppose I give you a linked-list, like the above, except that someone has been messing with the pointers. It’s possible that there is a cycle in this list, a point where some node’s next pointer actually points to a node that is before it in the list. Sketch an algorithm that can detect whether or not a list has a cycle in it.
Floyd’s cycle detection trick: the idea is to keep two “current nodes”.
One we advance using ->next
while the other we advance using ->next->next
.
That is, one steps down the list in steps of 1, while the other uses steps of
2. Consider what happens when both enter a cycle.
If the cycle has even length, then the two-step pointer will have made two circuits, while the one-step has made one, but they will eventually meet, at the node where they entered the cycle.
If the cycle has odd length, then the number of steps to detection depends on the length of the cycle, but there will still come a point where the two pointers are equal.
On the other hand, if there is no cycle, then the two-step pointer will reach the end of the list.
So our “cycle detection” is just p1 == p2
at any point.
Cons list
Finally, we’ll look at a very old variation on singly-linked lists, called
a cons list. Technically, a cons cell has no larger class around it;
there’s no head/tail pointers to keep track of (of course, this means that
finding the end of the list requires scanning down it). To understand the
motivation behind a cons list, think about the problem of representing a list
of sublists. If we have a templated list type, we could make a
“list of lists of something” (e.g., list<list<int>>
for a list-of-lists-of-ints),
but the number of “levels” of nesting is fixed. We can’t create a list where
one element is an int, and the next a list of ints, and the next a list of lists
of ints.
The key innovation of a cons-style list is that the “value” part of each node can either a value (of the value type) or a pointer to another node. That is, each cell can contain zero (in the case of the last node), one (in the case of a normal linked list node) or two node pointers. This latter option is usually interpreted as being a “sublist”. E.g., the cons list
(1 2 (3 4) 5)
Is represented as
1-->2-->#-->5-->nullptr
|
V
3-->4-->nullptr
The empty cons list is just nullptr
; a single element list looks like
1-->nullptr
The two parts of a cons cell are called the car
and cdr
(for historical
reasons); pronounced “CAR” and “COODER” (rhymes with “could her”).
These are also the names of the operations that extract them. E.g.,
given the list (1 2 3)
the car
of this is 1, while the cdr
of this is
(2 3)
. Similarly, for the cons list ((1 2) (2 3))
, with
car = (1 2)
and cdr = ((2 3))
. Note that the latter is not equivalent
to (2 3)
. ((2 3))
is a one element list, while (2 3)
is a two-element
list.
To implement a cons cell, we have a few choices:
We can include both the car pointer, and car value, along with a bool to switch between them:
struct cons { int car_value; cons* car_p; bool car_is_value; cons* cdr; };
This, however, wastes space, because we are storing the
int
and thecons*
separately, even though they will never both be required at once.We can use a union. In C++, a
union
is like a struct that only ever holds one of its data members, because they all overlap in memory:struct cons { cons(int v, cons* c = nullptr) { car.value = v; car_is_value = true; cdr = c; } cons(cons* ca, cons* cd) { car.p = ca; car_is_value = false; cdr = cd; } union { int value; cons* p; } car; bool car_is_value; cons* cdr; };
We still need the bool to tell us which “half” is currently in use, but we save some space, because
value
andp
are allocated “on top of each other”.
Traditionally, cons cells are immutable, meaning they cannot be altered after they are created. Thus, all cons list operations work by constructing new lists, never by modifying an existing list. Although this might appear more expensive (because you have to copy existing lists all the time), it actually makes some operations more efficient, by allowing us to reuse an existing list’s elements.
For example, consider the task of list concatenation: a + b
, which appends
the elements of b onto the end of a. For mutable lists, we have to copy both
lists to do this, because otherwise the new list might magically change, if
either a or b were modified. With cons cells, this looks like this:
cons* append(cons* a, cons* b) {
// Copy a
cons* last = nullptr;
while(a) {
// Make a copy of *a into a new cell
cons* c = new cons{*a};
// Update the cdr of the last cell
// to point to c
if(last)
last->cdr = c;
// Advance
a = a->cdr;
last = c;
}
last->cdr = b; // Link to remainder of the list
}
Some things to note:
If any of the cons cells in a have sublists, those are reused, not copied.
We reuse all of b, by simply making the cdr of the last cell in the copy of a point to it.
I.e., the only copy we have to make is of a!
The above version is a bit hard to follow, because we have to store the
last cons cell we constructed, in order to update its cdr
after we create
the following cell. Another version is the recursive definition:
cons* append(cons* a, cons* b) {
if(a == nullptr)
return b;
else {
cons* c = new cons{*a};
c->cdr = append(a->cdr, b);
return c;
}
}
In order to write a recursive version, we only need to ask what to do for the empty list, and for the non-empty list:
If the list
a
is empty, thena
appended tob
is justb
.If
a
is not empty, then we make a new cons cell containing a copy ofa
‘s car. We then append the rest ofa
ontob
, and set the cdr of the new cell to that, and return the result.
The recursive version is often hard to think about (as you have to define how to append lists in terms of how to append lists!) but once you grasp it is often easier and shorter than the equivalent iterative version.
Cons list can be used to imitate all kinds of fancy data structures. E.g., a matrix might look like
((1 2 3)
(4 5 6)
(3 4 5))
A binary tree might be
(6 (3 (1 4)) (8 (6 10)))
A mathematical expression can be represented by putting the operator in the car, and the operands in the cdr:
(+ (* 2 3) 4 5)
represents the expression 2 * 3 + 4 + 5
.
Many operations on cons lists are recursive; they are defined in terms of
functions that call themselves. For example, here is the definition of a
length
function that computes the length of the top-level of a cons-list:
int length(cons* head) {
if(head == nullptr)
return 0;
else
return 1 + length(head->cdr);
}
Breaking this down:
If the list is empty, then its length is 0
If the list is non-empty, then its length is 1 (for the car) plus the length of the cdr.
Suppose we want to count the number of elements in a cons list, including those in sublists. I.e., we want the “length” at all levels, not just the top level. In order to do this, we have to not only look down the cdr of a list, we may also have to look down its car-list as well. A loop can’t “branch out” like this: we need recursion:
int elements(cons* head) {
if(head == nullptr)
return 0;
else if(head->car_is_value)
return 1 + elements(head->cdr);
else
return elements(head->car.p) + elements(head->cdr);
}
Breaking this down:
If the list is empty, the number of elements is 0.
If the car is a value, then the number of elements is 1 plus however many are in the cdr.
If the car is a sublist, the the number of elements is the number of elements in the car, plus the number of elements in the cdr.
The last case is the one that a while
loop cannot handle.
As a final example of recursive cons-list processing, let’s consider the problem of flattening a cons list. This means taking a list like this:
((1 2) 3 (4 5 (6 7)) 8)
(Draw diagram)
and turning it into
(1 2 3 4 5 6 7 8)
(Draw diagram)
As usual, we’ll construct a new list as we go
cons* flatten(cons* head) {
if(head == nullptr)
return nullptr;
if(head->car_is_value) {
cons* c = new cons{*head}; // Copy
c->cdr = flatten(head->cdr); // Flatten remainder
return c;
}
else
cons* ca = flatten(head->car.p); // Flatten car
cons* cd = flatten(head->cdr); // Flatten cdr
return append(ca, cd); // Combine
}