The List abstract data type

We’re going to look at a singly linked list. You’ll see what this means.

A linked list tries to solve the problem of linear-time inserts (at any position) by adding a layer of indirection. Because an array arranges all of its elements linearly in memory, inserting one necessarily requires moving the existing things around. If we allowed elements to be stored in different locations, then we could do this without moving existing elements.

A (singly) linked list consists of a collection of nodes. Each node is allocated independently of all the others; in memory, there’s no organization to where nodes are placed. Each node contains a value (whatever it stores) and a pointer to the next node. The list class itself maintains pointers to the first (required) and last (for convenience) nodes of the list. Traditionally, these are called the “head” and “tail” of the list.

Linked lists give up the ability randomly access any element in constant time. The fastest way to access a list element is if you already have the previous one. Whereas a list over an array looks like this:

for(int i = 0; i < size; ++i)
    // do something with arr[i]

for a linked list, the loop looks like this, following the pointers to the end:

for(node* i = _head, i != nullptr; i = i->next)
    // do something with i->value

As we’ll see, if you write an array-style loop on a list, you’re going to have a bad time…

The basic linked list class looks something like this:

template<typename T>
class list {
  public:

    struct node {
        T value;
        node* next;
    };

    list() : hd(nullptr), tl(nullptr) {}

    node* head() { return hd; }
    node* tail() { return tl; }

    void insert(node* prior, T value);

    node* at(int i);    

    void erase(node* prior);
    void clear();

    void push_back(T value);
    void pop_back();

    void push_front(T value);
    void pop_front();

    int size();    

  private:
    node* hd, tl;
}

(Draw picture)

(Demonstrate all operations, as we go)

Note that if we want to talk about nodes outside the list class we must use its fully-qualified name: list<T>::node

The `insert` method

To insert a new value into a list, we need a pointer to the node before it, because we have to modify that node’s next pointer. Hence, insert takes two parameters: a pointer to a node, and a value.

As a special case, we allow the prior node to be nullptr; if this is the case, we assume we are inserting at the beginning of the list (which has no node before it, hence prior == nullptr).

There as several cases to consider:

If prior == nullptr and the list is empty then we are inserting the first node of the list, so it becomes both the head and tail.
If prior == nullptr but the list is non-empty then the new node becomes the new head, followed by the old head.
If prior != nullptr then the sequence of nodes is prior, new node, prior->next.
If prior == tl then the new node becomes the new tail.

void list::insert(node* prior, T value) {
    if(prior == nullptr) {
        if(empty())
            hd = tl = new node{value, nullptr}; // Empty
        else
            hd = new node{value, _head}; // At least one node
    }
    else {
        prior->next = new node{value, prior->next};

        // Update tail if it changed
        if(prior == tl)
            tl = prior->next; 
    }
}

How long does insert take? Well, there’s no loop in its body, so it can’t take more than constant time.

Because we wrote insert first, and made it sufficiently flexible, other operations which add nodes to the list can be implemented in terms of insert, making them much simpler.

`push_front()`

With our version of insert, adding a new node to the front of the list is just an insert with prior == nullptr:

void push_front(T value) {
    insert(nullptr, value);
}

This takes constant time, same as insert.

`push_back()`

Similarly, push_back is just an insert at the tail:

void push_back(T value) {
    insert(tl, value);
}

and takes constant time, like insert.

The `.at()` method, to find a specific node by position

node* list::at(int i) {
    node* current = hd;
    while(i != 0 && hd != nullptr) {
        current = current->next;
        --i;
    }
    return current;
}

.at has to walk down the list until it finds the element you’re looking for. The loop depends on i, so this method takes \(O(i)\) time to execute. (A far cry from the \(O(1)\) of an array or vector!)

If you give an i that is larger than the number of elements in the list, at will return nullptr.

`erase` the node after the one given

Erase is allowed the same special case as insert: if prior is nullptr, then we erase the first node of the list (which has nothing before it).

void list::erase(node* prior) {
    if(prior == nullptr) {
        if(hd == tl) {
            // One or zero nodes
            delete hd;
            hd = tl = nullptr; 
        }
        else {
            node* t = hd;
            hd = hd->next;
            delete t;
        }
    }
    else if(prior == tl)
        return; // Nothing after the tail!
    else {
        node* t = prior->next;        
        prior->next = t->next;

        // Update tail
        if(t == tl)
            tl = prior;

        delete t;
    }
}

Again, there’s nothing here but pointer shuffling and if-elses, so constant time. Also like insert, because we wrote erase first and made it sufficiently flexible, other operations that remove nodes can be implemented in terms of erase.

`pop_back`

In order to pop_back, we have to find the element before _tail

void list::pop_back() {
    // One node, no prior
    if(hd == tl) 
        erase(nullptr);
    else {
        // Find node before tl
        node* prior = hd;
        while(prior->next != tl)
            prior = prior->next;

        erase(prior);
    }
}

This amounts to finding the second-to-last element, which requires walking (almost) the entire list, thus, \(O(n)\). (We could find the second-to-last element by doing at(size()-2) but that would require traversing the list twice; still \(O(n)\), but doing more work than necessary.)

Could we make pop_back a constant time operation by storing a pointer to the element before it? We’d have to update it in all the other operations (e.g., when we insert or splice at the end of the list). But notice that we’d have to update it during pop_back as well (since it changes the end of the list). How are we going to update it? By scanning through the list to find the new next-to-last element. So we haven’t really gained anything.

`pop_front`

Popping the front element is much easier

T pop_front() {
    erase(nullptr);
}

Constant time.

`clear`, deletes all nodes

void list::clear() {
    while(hd != nullptr) 
        pop_front();    
}

Because a vector was a single allocation, we can clear it just by doing a delete and updating size and capacity. Here, each node is its own allocation, so if we want to properly clean up after ourselves, we have to delete them all, taking \(O(n)\) time.

Notice the interesting parallel: in a vector, if you want to delete an arbitrary element, you must shift everything after it down, making it \(O(n)\), but you can clear the whole vector for \(O(1)\). Here, deleting a single element is \(O(1)\), but deleting the entire list is \(O(n)\)!

`size()`, giving the number of nodes in the list

Computing the size requires walking the list, and thus is \(O(n)\):

int list::size() {
    if(_head == _tail)
        return 0;

    int s = 1;
    node* current = _head;
    while(current != nullptr) {
        current = current->next;
        s++;
    }

    return s;
}

This could be made constant time by simply storing the size of the list in a data member and then updating it in insert, erase and clear.

Reversing a list

Suppose we want to reverse a list in place, without constructing a new list (if you are allowed to construct a new list, then reversing is easy: just loop through the original and push_front all the elements onto a new list). That is, we want to do it purely by switching pointers around.

void reverse() {
    ...
}

List operation complexity summary

Operation	Complexity class
`a.head(),tail()`	\(O(1)\)
`a.insert()`	\(O(1)\)
`a.at(n)`	\(O(n)\)
`a.delete()`	\(O(1)\)
`a.clear()`	\(O(n)\) (deletes everything)
`a.push_back()`	\(O(1)\)
`a.pop_back()`	\(O(n)\)
`a.push_front()`	\(O(1)\)
`a.pop_front()`	\(O(1)\)
`a.size()`	\(O(n)\) (but could be \(O(1)\) easily)

Let’s take another look at our sorting algorithm and see how fast it would run on a list, instead of an array-like structure:

void sort(list<int>& data) {
  for(int i = 0; i < data.size() - 1; ++i) {
    // Find smallest in i..data.size-1

    int smallest = data.at(i)->value;
    node* smallest_index = i;
    for(int j = i; j < data.size(); ++j) 
      if(data.at(j)->value < smallest) {
        smallest = data.at(j)->value;
        smallest_index = j;
      }

    // Swap it into place
    std::swap(data.at(smallest_index)->value, data.at(i)->value);
  }
}

It still performs roughly \(O(n^2)\) “operations”, however, now each operation is itself \(O(n)\), making the total runtime \(O(n^3)\)! This emphasizes how the choice of data structure can have a significant effect on the performance of your program.

Note that it is possible to rewrite a selection sort to have \(O(n^2)\) runtime on a list. Indeed, it’s actually a bit easier, because instead of doing the swap, we can just directly move the node containing the largest element to the end of the “sorted” part of the list, by doing a delete, followed by an insert. It’s \(O(1)\) to shuffle list elements around, pulling them out of one location and storing them in another. We also don’t need to access the size; just looking for the nullptr at the end is enough.

List variations

I’ve used null pointers to indicate the end of the list. This actually requires a few checks (which I’ve omitted) to make sure we don’t dereference a null pointer. Some people prefer a sentinel node implementation, where we create a special empty node to mark the end of the list. That is, we have a member

node sentinel; // Not a pointer!

and then the last node has its next set to sentinel, and the sentinel node has its next pointing to the first node of the list. This means that every node always has a ->next() so it makes the loops somewhat simpler. Similarly, every node has a node before it, so inserting at the beginning does not need a special case.

Another clever way of implementing a singly-linked list is like this:

class list {
  int value;
  list *rest;
};

This is an inductive definition, here a list is defined to be a list*, which can either be:

nullptr, representing the empty list or
a pointer to a list, consisting of a value and another list.

This formation makes recursive functions on lists much more natural:

int length(list* l) {
  if(l == nullptr)
      return 0;
  else
      return 1 + length(l->rest);
}

Finally, a doubly-linked list gives each node two pointers, next and prev. This allows you to navigate through the list in either direction. It makes pop_back constant time, because we can now find the next-to-last node easily, it’s just _tail->prev. Managing the extra pointers takes a bit of work, but not in a way that would make any operations more complex. It’s more of a time-space tradeoff. You can accellerate walking backwards through the list, at the cost of doubling the number of pointers you must store.

A puzzle

Suppose I give you a linked-list, like the above, except that someone has been messing with the pointers. It’s possible that there is a cycle in this list, a point where some node’s next pointer actually points to a node that is before it in the list. Sketch an algorithm that can detect whether or not a list has a cycle in it.

Floyd’s cycle detection trick: the idea is to keep two “current nodes”. One we advance using ->next while the other we advance using ->next->next. That is, one steps down the list in steps of 1, while the other uses steps of 2. Consider what happens when both enter a cycle.

If the cycle has even length, then the two-step pointer will have made two circuits, while the one-step has made one, but they will eventually meet, at the node where they entered the cycle.
If the cycle has odd length, then the number of steps to detection depends on the length of the cycle, but there will still come a point where the two pointers are equal.
On the other hand, if there is no cycle, then the two-step pointer will reach the end of the list.

So our “cycle detection” is just p1 == p2 at any point.

Cons list

Finally, we’ll look at a very old variation on singly-linked lists, called a cons list. Technically, a cons cell has no larger class around it; there’s no head/tail pointers to keep track of (of course, this means that finding the end of the list requires scanning down it). To understand the motivation behind a cons list, think about the problem of representing a list of sublists. If we have a templated list type, we could make a “list of lists of something” (e.g., list<list<int>> for a list-of-lists-of-ints), but the number of “levels” of nesting is fixed. We can’t create a list where one element is an int, and the next a list of ints, and the next a list of lists of ints.

The key innovation of a cons-style list is that the “value” part of each node can either a value (of the value type) or a pointer to another node. That is, each cell can contain zero (in the case of the last node), one (in the case of a normal linked list node) or two node pointers. This latter option is usually interpreted as being a “sublist”. E.g., the cons list

(1 2 (3 4) 5)

Is represented as

1-->2-->#-->5-->nullptr
        |
        V
        3-->4-->nullptr

The empty cons list is just nullptr; a single element list looks like

1-->nullptr

The two parts of a cons cell are called the car and cdr (for historical reasons); pronounced “CAR” and “COODER” (rhymes with “could her”). These are also the names of the operations that extract them. E.g., given the list (1 2 3) the car of this is 1, while the cdr of this is (2 3). Similarly, for the cons list ((1 2) (2 3)), with car = (1 2) and cdr = ((2 3)). Note that the latter is not equivalent to (2 3). ((2 3)) is a one element list, while (2 3) is a two-element list.

To implement a cons cell, we have a few choices:

We can include both the car pointer, and car value, along with a bool to switch between them:
```
 struct cons {
     int   car_value;
     cons* car_p;
     bool  car_is_value;
     cons* cdr;
 };
```
This, however, wastes space, because we are storing the int and the cons* separately, even though they will never both be required at once.

We can use a union. In C++, a union is like a struct that only ever holds one of its data members, because they all overlap in memory:

 struct cons {
     cons(int v, cons* c = nullptr) {
         car.value = v;
         car_is_value = true;
         cdr = c;
     }

     cons(cons* ca, cons* cd) {
         car.p = ca;
         car_is_value = false;
         cdr = cd;
     }

     union {
         int value;
         cons* p;
     } car;
     bool car_is_value;
     cons* cdr;
 };

We still need the bool to tell us which “half” is currently in use, but we save some space, because value and p are allocated “on top of each other”.

Traditionally, cons cells are immutable, meaning they cannot be altered after they are created. Thus, all cons list operations work by constructing new lists, never by modifying an existing list. Although this might appear more expensive (because you have to copy existing lists all the time), it actually makes some operations more efficient, by allowing us to reuse an existing list’s elements.

For example, consider the task of list concatenation: a + b, which appends the elements of b onto the end of a. For mutable lists, we have to copy both lists to do this, because otherwise the new list might magically change, if either a or b were modified. With cons cells, this looks like this:

cons* append(cons* a, cons* b) {
    // Copy a
    cons* last = nullptr;
    while(a) {
        // Make a copy of *a into a new cell
        cons* c = new cons{*a}; 

        // Update the cdr of the last cell 
        // to point to c
        if(last)
            last->cdr = c;

        // Advance
        a = a->cdr;
        last = c;
    }
    last->cdr = b; // Link to remainder of the list
}

Some things to note:

If any of the cons cells in a have sublists, those are reused, not copied.
We reuse all of b, by simply making the cdr of the last cell in the copy of a point to it.

I.e., the only copy we have to make is of a!

The above version is a bit hard to follow, because we have to store the last cons cell we constructed, in order to update its cdr after we create the following cell. Another version is the recursive definition:

cons* append(cons* a, cons* b) {
    if(a == nullptr)
        return b;
    else {
        cons* c = new cons{*a};
        c->cdr = append(a->cdr, b);
        return c;
    }
}

In order to write a recursive version, we only need to ask what to do for the empty list, and for the non-empty list:

If the list a is empty, then a appended to b is just b.
If a is not empty, then we make a new cons cell containing a copy of a‘s car. We then append the rest of a onto b, and set the cdr of the new cell to that, and return the result.

The recursive version is often hard to think about (as you have to define how to append lists in terms of how to append lists!) but once you grasp it is often easier and shorter than the equivalent iterative version.

Cons list can be used to imitate all kinds of fancy data structures. E.g., a matrix might look like

((1 2 3)
 (4 5 6)
 (3 4 5))

A binary tree might be

(6 (3 (1 4)) (8 (6 10)))

A mathematical expression can be represented by putting the operator in the car, and the operands in the cdr:

(+ (* 2 3) 4 5)

represents the expression 2 * 3 + 4 + 5.

Many operations on cons lists are recursive; they are defined in terms of functions that call themselves. For example, here is the definition of a length function that computes the length of the top-level of a cons-list:

int length(cons* head) {
    if(head == nullptr)
        return 0;
    else
        return 1 + length(head->cdr);
}

Breaking this down:

If the list is empty, then its length is 0
If the list is non-empty, then its length is 1 (for the car) plus the length of the cdr.

Suppose we want to count the number of elements in a cons list, including those in sublists. I.e., we want the “length” at all levels, not just the top level. In order to do this, we have to not only look down the cdr of a list, we may also have to look down its car-list as well. A loop can’t “branch out” like this: we need recursion:

int elements(cons* head) {
    if(head == nullptr)
        return 0;
    else if(head->car_is_value)
        return 1 + elements(head->cdr); 
    else
        return elements(head->car.p) + elements(head->cdr);
}

Breaking this down:

If the list is empty, the number of elements is 0.
If the car is a value, then the number of elements is 1 plus however many are in the cdr.
If the car is a sublist, the the number of elements is the number of elements in the car, plus the number of elements in the cdr.

The last case is the one that a while loop cannot handle.

As a final example of recursive cons-list processing, let’s consider the problem of flattening a cons list. This means taking a list like this:

((1 2) 3 (4 5 (6 7)) 8)

(Draw diagram)

and turning it into

(1 2 3 4 5 6 7 8)

(Draw diagram)

As usual, we’ll construct a new list as we go

cons* flatten(cons* head) {
  if(head == nullptr)
      return nullptr;
  if(head->car_is_value) {
      cons* c = new cons{*head};   // Copy
      c->cdr = flatten(head->cdr); // Flatten remainder
      return c;
  }
  else
    cons* ca = flatten(head->car.p); // Flatten car
    cons* cd = flatten(head->cdr);   // Flatten cdr
    return append(ca, cd);           // Combine
}