Last time: Mergesort and quicksort

Mergesort: the merge operation

void merge(int* in, int size, int* out) {
    int i = 0, j = size/2, k = 0;

    while(i < size/2 && j < size) {
      if(in[i] < in[j])
        out[k++] = in[i++];
      else
        out[k++] = in[j++];
    }

    // Copy any remaining element
    while(i < size/2) out[k++] = in[i++];
    while(j < size)   out[k++] = in[j++];
}

An alternate definition makes the main loop over k (i.e., the out array), rather than the input arrays:

void merge(int* in, int size, int* out) {
    int i = 0, j = size/2, k = 0;

    while(k < size) {
      if(j == size || in[i] < in[j])
        out[k++] = in[i++];
      else if(i == size/2 || in[i] >= in[j])
        out[k++] = in[j++];
    }
}

The conditions on each if are the situations in which we copy from i‘s or j’s portion of the array: we copy from the left half either if v[i] < v[j] or if the right have is exhausted. Similarly for the right half. (It may take some effort to convince yourself that this code never reads outside the input array!)

Note that the only reason why Mergesort cannot operate in place is only because the merge operation has to copy from one location to another. What would it take to write an in-place merge, and how would that affect the runtime complexity?

An in-place merge takes in a single array, split into two (already sorted) halves:

template<typename It>
void merge(It start, It finish, It mid) {
  ...
}

As before, we will keep two pointers, one to the left half (start to mid-1) and one to the right half (mid to finish-1):

void merge(int* arr, int size) {
  int mid = size/2;
  int i = 0, j = mid;

  while(i < mid && j < size) {
    if(arr[i] < arr[j])
      ++i; // "Add" arr[i] to output
    else {
      // Need to add arr[j], at i, by *inserting* it
      int x = arr[j];

      // Shift up
      for(int k = j; k > i; --k)
        arr[k] = arr[k-1];

      arr[i] = x;

      // Adjust counters because of the shift
      ++i; ++j; ++mid;
    }
  }
}

Analysis: best and worst case

Best case: O(n), same as the copying merge.
Worst case: O(n^2)

This means that the in-place mergesort has a worst-case runtime complexity of $O(n^2 \log n)$.

Quicksort: the partition operation

template<typename It>
It partition(It start, It finish) {
    It p = ...; // Choose pivot

    It i = start - 1;
    It j = finish;

    while(true) {
        do 
            i++;
        while(*i < *p);

        do
            j--;
        while(*j > *p);

        if(i >= j)
            return j + 1;

        std::swap(*i, *j);
    }
}

(I might ask you to write either or both of these on a test.)

Searching and search trees

Binary tree math

Some tree terminology

A tree is technically defined as a graph where there is at most one path from any node to any other node.
We’ll be dealing with binary trees, trees where each node can have at most 2 children. Usually, we distinguish between the two children, calling one left and the other right. This implies that there are technically two kinds of 1-child nodes: one with the child on the left, and another with it on the right.
The root node is the node at the top. If the root is nullptr then the tree is considered to be empty.
Every node has both left and right children, although one or both of these may be null.
Every node has both left and right subtrees, although one or both of these may be empty (if the corresponding child is).
If we consider all the nodes in the left or right subtrees, we call these the left/right descendants of a node.
A node with no children is called a leaf.
A node with at least one child is called an internal node.
The height of the tree is defined as the height of its deepest leaf. An empty tree has no height, a tree with only a root node has a height of 0, and so forth.
If we look at all the nodes that are on a given horizontal plane, this is called a “level” of the tree.
A binary tree where every node has 0 or 2 children is called full.
A binary tree where every node has 0 or 2 children, and all levels except possibly the last are full, then this is called a complete binary tree.
A balanced tree is one in which the left and right subtrees, at all levels, fulfill some “balancing” criteria, intended to keep the tree from leaning too heavily to the right or left, and closer to being complete.
A degenerate binary tree is one where every node has 0 or 1 children; this is equivalent to a linked list.

Given a binary tree of height $h$, how many nodes are potentially in its last (highest) level?

The empty tree has 1 (root)
The height 1 tree has 2
The height 2 tree has 4.
In general, the last level can have up to $2^h$ nodes.

(Inductive proof)

Given a binary tree with $n$ nodes, what are the maximum and minimum heights it could have?

The most “sparse” tree, in terms of height, is a degenerate one, where every level of the tree contains exactly one node. So then, by our definition of height, the tallest tree with $n$ nodes is

$$h = n - 1$$
The most “compact” binary tree (in terms of height) is a complete one. This will take a bit of work to figure out how its height relates to the number of nodes in it.
```
 a

   a
  / \
 b   c
```
The sequence of complete trees, where the last row is full, goes 1, 3, 7, etc. In other words

$$n = 2^{h+1} - 1$$

If the last row is not full, then it must have at least 1 node (if it had 0, we would have a tree of height $h-1$). So then it is

$$n = 2^{h+1-1} - 1 + 1 = 2^h$$

If we take the log of both sides, we get

$$h = \lfloor \log n \rfloor$$

So in the “best” (shortest) case, the height of the tree is $O(\log n)$ in the number of nodes, while in the worst case it is $O(n)$.

Proof that the number of nodes in a totally-complete tree of height $h$ is $2^{h+1} - 1$, by induction on $h$

Base case: $h = 0$. Trivial: $2^1 - 1 = 1$, and a tree of height 0 has just a root node.
Inductive case: $h > 0$. Inductive hypothesis: for all $h’ < h$, $n = 2^{h’ + 1} - 1$.

A tree of height \h’ = h - 1\ will have $2^{h-1}$ nodes in its last level, and $2^{h’ + 1} - 1 = 2^{h} - 1$ nodes in total (by the IH). Increasing its height by 1 will require giving each of the nodes in the last level 2 children, adding an additional $2 \cdot 2^{h-1} = 2^h$ nodes. The total number of nodes will thus be

$$n = 2^{h} - 1 + 2^{h} = 2 \cdot 2^h - 1 = 2^{h+1} - 1$$

QED.

We’re going to need this result later.

This also gives us another summation identity:

$$\sum_{i=0}^n 2^i = 2^{n+1} - 1$$

Search trees

We’ve looked at binary search before, but let’s look at it again: The idea is that if I have a sorted vector of data, and I’m looking at some particular element e, I look at the element in the middle of the vector. If it is equal to e, then I’m done. If it’s less than e, then we know e lies in the portion of the vector below the midpoint. If it’s greater than e, then we know e lies in the portion above the midpoint. Either way, we can redo our search in the newly restricted region.

If we write this using a loop, it looks like this:

template<typename T>
int binary_search(const vector<T>& data, T e) {
    int start = 0, end = data.size()-1;

    while(start <= end) {
        int mid = start + (end - start) / 2;

        if(data.at(mid) == e)
            return mid;
        else if(data.at(mid) > e) {
            // Search left
            end = mid - 1;
        }
        else  // data.at(mid) < e
            start = mid + 1;
    }

    return -1; // Not found
}

Recursively, it looks like this:

template<typename T>
int binary_search(const vector<T>& data, T e) {
    return binary_search(data,0,data.size()-1);
}

template<typename T>
int binary_search(const vector<T>& data, T e, int start, int end) {
    if(start > end)
        return -1;
    else {
        int mid = start + (end - start) / 2;
        if(data.at(mid) == e)
            return mid;
        else if(data.at(mid) > e)
            return binary_search(data,e,start,mid - 1);
        else
            return binary_search(data,e,mid + 1, end);
    }
}

To re-analyze this, in each time through the loop (or in each recursive call) we cut the search space in half. If the size of the vector is $n$ then it will take $\log n$ iterations/recursive calls in the worst case (if the element does not exist), making this algorithm $O(\log n)$.

A sorted array with binary search allows us to find a given element in $O(\log n)$. But what if we wanted to modify the array, adding or removing elements on the fly? In order to preserve the sortedness of the array, we would have to find the location of the element to be added/removed, but then what do we do? To insert an element, we have to shift every element after it up, assuming we have some extra room at the end of the array in which to do this. Similarly, to delete an element, we have to shift everything down. Both of these things take roughly $O(n)$ time to move the elements around.

We could use a sorted linked list to make the insertion/deletion fast, but the problem is that binary search requires random access. The linear access supported by linked lists doesn’t work with binary search. Let’s think about what kind of data structure we’d need to implement in order to be able to perform a binary search, but also be able to insert and remove elements.

Does binary search really “require” random access? Technically, no: observe the pattern of accesses in binary search. We examine the element at $n/2$; depending on the value of that, we might inspect the element at $n/4$ or $3n/4$, followed by one of $n/8$, $3n/8$, $5n/8$ or $7n/8$, and so forth.

At the same time, if we want to avoid $O(n)$ insertion/deletion, we’ll need to use indirection, as we did with linked-lists, so that changing the structure is just a matter of switching out some pointers. This implies that the elements will be allocated in nodes, which will be not be located adjacently in memory.

So we’ll have something like

struct node {
  int key; // Current value
  ...        // Pointers...
};

We want the pointers to encode all the decisions we might make. So from the node that represents the element at $n/2$, we need pointers to elements at $n/4$ and $3n/4$:

struct node {
  int key;
  node* left 
  node* right;
  // node* parent;
};

We can continue this structure recursively. Each node has pointers to the nodes that are to the left and right of it, the nodes that would be searched next in a binary search. This gives us a binary search tree.

A binary search tree is a binary tree in which, at any node, all the nodes in the left subtree are less than the node, and all nodes in the right subtree are greater than the node.

For example, here’s a binary search tree:

        5
      /   \
     3     7
    / \   / \
    1  4  6  9
   /        / \
 -2        8   11

Note that not all nodes need to be present. Depending on how the tree is constructed, some nodes will have zero, one, or two children.

Binary tree implementation

Some binary trees, like ours, only give nodes pointers to their children. There are some operations for which it can be useful to also give nodes pointers to their parent node. We’ll note when this would be useful.

We may also find it useful to write a function that returns the height of a binary tree:

int height(node* root) {
  if(!root)
    return 0; // Empty tree
  else {
    return 1 + max(height(root->left), height(root->right));
  }
}

This definition is different from the mathematical one: it defines the empty tree to have a height of 0, the 1-node tree to have a height of 1, etc. The relationship between n and h for a complete tree is thus

$$n = 2^h - 1$$

Constructing a binary search tree

How do we build a binary search tree? If we have an already-sorted array, we can build it directly, by pretending to do a binary search and creating nodes as we go:

node* binary_tree(const vector<int>& data, int low, int high) {

  if(low > high)
    return nullptr;

  int mid = low + (high - low) / 2;

  // Recursively construct left and right subtrees
  node* left =  binary_tree(data, low, mid-1);
  node* right = binary_tree(data, mid+1, high);

  // Return new root node
  return new node{data.at(mid), left, right};
}

node* binary_tree(const vector<int>& data) {
  return binary_tree(data, 0, data.size()-1);
}

The difference here is that instead of going only down the left or right path, here we go down both. What is the complexity of this operation? Well, we have to visit every element in the vector, so it has to take at least $O(n)$ time. We’re not going any extra work per-element, so linear time it is.

Checking a BST for validity

Suppose we have a binary tree, and we want to check it to see if it really is a proper search tree. That is, we want to make sure the search order property is maintained throughout the tree. A first stab at this might be to check the property at each node (i.e., make sure that the node lies between its left and right children, if they exist) and then recursively check the children, however, this is not sufficient. Consider the tree

This tree has the “local” property (every node is greater than its right child and less than its left child) but fails the “global” order property (every node is less than all its right descendants).

In order to properly check a tree, we need to keep track, for each subtree, what the bounds are on its value. E.g., when we go into the right subtree of 8, we need to remember that 8 is the lower bound. When we then go into the left subtree of 12, we will know that all values should be in the range $(8,12)$. Since 7 is not, we fail.

This looks like this:

#include<limits>

bool is_bst(node* root) {
  return is_bst(root, 
                numeric_limits<int>::min(), 
                numeric_limits<int>::max());
}

bool is_bst(node* root, int low, int high) {
  if(!root)
    return true;
  else if(root->key <= low || root->key >= high)
    return false; // Out of bounds
  else
    return is_bst(root->left,  low, root->key) &&
           is_bst(root->right, root->key, high);
}

(numeric_limits<int>::min() is the C++ equivalent to INT_MIN.)

Constructing a tree, the general case

The above assumes that we already have a sorted vector to start with. What if we don’t?

To construct a binary tree from “scratch”, we can simply take all the elements and insert the into the tree. insert is an operation that adds an element to the tree, while preserving the tree ordering structure. To do an insert, we simply find the location in the tree where the element would go (if it doesn’t already exist) and then create it at that point. E.g., suppose we wanted to insert 9 into the tree above. We proceed as if we wanted to find 9 (i.e., as if 9 was already in the tree), by going right, right, left, right from the root. When we see that 9 does not exist, we simply add it as the right child of 8.

If we do this for every element in the input, we’ll end up with a binary search tree.

node* insert(node* root, int key) {
    if(root == nullptr)
        return new node{key, nullptr, nullptr};
    else if(root->key == key)
        return root;
    else if(key < root->key)
        root->left = insert(root->left, key);
    else // key > root->key
        root->right = insert(root->right, key);
    return root;
}

insert is interesting in that it returns a “new” tree into which the given key has been inserted. In most cases, the pointer it returns is exactly the same as the one it has been given, except when we insert into the empty tree: in that case, it constructs a new node and returns it. Note that when we insert into the left/right subtrees, we replace the existing left/right pointer with the result of the insert, in case it has changed.

Structurally, this is equivalent to doing a find operation (below) looking for the point where the key should be in the tree. This is, in the worst case, proportional to the height of the tree and thus, in a balanced tree, should be $O(\log n)$.

Let’s try constructing a tree from some inputs

5 2 9 5 8 7 10 -3

Insert these in order from first to last.

1 2 3 4 5 6 7 8 9

Insert these in order from first to last. The result doesn’t look much like a “tree”. We’ll see later than inputs like these will cause us problems unless we’re careful.

Loop-based insert: It’s possible to build a loop-based insert, although it’s a bit more tricky. We have to keep two pointers, one to the “current” node, and one to the “previous” node, which will be the parent of the current node. This is because when we reach the point where the new node should exist, the current node will be nullptr, and changing it won’t do anything! We have to update the left/right pointer in its parent in order to update the structure of the tree.

node* insert(node* root, int key) {
  if(root == nullptr)
    return new node{ key, nullptr, nullptr };

  node* n = root;    // Current node
  node* p = nullptr; // Previous node

  while(n != nullptr) {
    p = n;
    if(key == n->key)
      break;
    else if(key < n->key)
      n = n->left;
    else // key > n->key
      n = n->right;
  }

  if(n != nullptr)
    return root; // Already exists
  else if(key < p->key)
    p->left = new node{ key, nullptr, nullptr };
  else // key > p->key
    p->right = new node{ key, nullptr, nullptr };

  return root;
}

Since we’ve mentioned it, let’s look at…

Finding an element

To find an element, we follow the pointers in the same order we would if we were doing a binary search. If the value at the current node is less than the search target, then we need to search to the right, so we recursively proceed down the right subtree, else if it’s greater than, we go down the left subtree. (If the value at the current node is the value we’re looking for, then obviously we’re done.) As mentioned above, find returns a reference to the pointer that points to either the target element, or where it would be, if it does not exist. Modifying this reference thus modifies the tree.

node* find(node* root, int key) {
  if(!root)
    return root; // Empty tree, or not found
  else if(root->key == key)
    return root;
  else if(root->key < key)
    return find(root->right, key);
  else if(root->key > key)
    return find(root->left, key);
}

Note that you can still test the return value to see if it is == nullptr to determine whether or not the element exists.

Analysis of this operation: ideally, this operation takes the same amount of time as a binary search, $O(\log n)$. But there are circumstances when it can take much worse time…

Let’s try finding some elements in the trees we constructed above. Find 2. Find 10. Find -2 (does not exist). Try finding some nodes in the “unbalanced” tree; we can get a feel for why it’s bad.

Loop-based find: It’s also possible to build a loop-based (non-recursive) version of find. This version may be slightly faster than the recursive implementation.

node* find(node* root, int key) {
  node* n = root;  
  while(n != nullptr) {
    if(key == n->key)
      break;
    else if(key < n->key)
      n = n->left;
    else // key > n->key
      n = n->right;
  }

  return n;
}

Deleting a node

Deleting a node is easy if it’s a leaf node: just remove the node and set the pointer to it in its parent to nullptr. But what if it’s an internal node? E.g.

Suppose we want to delete 9? We need to preserve the search tree ordering property. If a node has only one child, then we can simply replace the deleted node with that child (note that 7 could have children of its own; they get copied with it). If a node has more than one child, then the process is more complex:

This case is tricky, because both 7 and 10 (the obvious choices to replace 9) might have children of their own. The problem is that we need a value that can take 9’s place, but we need it to not have two children of its own. What does the search tree ordering property tell us about the values that could replace 9? In order for the search property to be maintained, the replacement for 9 must be greater than all the values in its left subtree, and less than all the values in its right subtree. (Note that if we choose a replacement from either subtree, it’s OK as long as we remove it from the subtree.)

In order to find a value in the left subtree that could replace 9, we need to find the value that is

a) greater than all the other values in the left subtree but also

b) less than 9 (so that, transitively, it is less than all the values in the right subtree)

This value has a name: the predecessor. It is the value that comes right before 9, if we were to list them out in order. (We could also use the right subtree, and look for the successor.) Note that the predecessor is guaranteed to have zero or one children, never two (if it had two, it would have a right subtree, which would contain values greater than it, and we have asserted that the predecessor is the greatest value which is less than 9).

We’ll implement predecessor later, for now, we’ll just assume that it returns a reference to the node pointer, like all the other functions.

node* remove(node*& root, int key) {

  if(root == nullptr)
      return; // Empty tree
  else if(root->key == key)
      return remove(root); // Remove root node
  else {

      node* parent = find_parent(root,key);

      if(key < parent->key)
          parent->left = remove(parent->left);
      else
          parent->right = remove(parent->right); 
  }
}

node* remove(node* n) {
    assert(n != nullptr);

    if(n->left == nullptr && n->right == nullptr) {
        // No children
        delete n;
        return nullptr;
    }
    else if(n->left == nullptr) {
        // Right child
        node* r = n->right;
        delete n;
        return r;
    }
    else if(n->right == nullptr) {
        // Left child
        node* l = n->left;
        delete n;
        return l;
    }
    else {
        // Two children
        return remove_smallest(n->right, n);
    }
}

node* remove_smallest(node* n, node* origin) {
    assert(n != nullptr);

    if(n->left == nullptr) {
        // Move this node's key to origin
        origin->key = n->key; 

        // This will always be an "easy" remove
        return remove(n);
    }
    else
        return remove_smallest(n->left, origin);
}

What is the runtime complexity of remove? It doesn’t search through the tree at all, and even the recursive call will only ever be run once, so it takes $O(1)$ time, assuming we already have a pointer to the node to be removed. The only tricky part is the complexity of finding the predecessor, which is still $\log n$ in the height of the tree. So delete is $O(\log n)$ in the height of the tree (it would be anyway, because we have to find the node to be deleted.)

Sets and maps

The above tree implements a “set” abstract data type: we can tell whether a key is in or out of the tree, but keys have no other information attached to them. We can also construct a map ADT: a map associates keys with values, so that we insert a key/value pair together, and then given a key, we can lookup its value.

Creating a map is as simple as adding

struct node {
  int key;
  string value;
  node* left;
  node* right;
};

Or whatever value type you want. A totally generic map would support different types for both keys and values: the requirement for keys is that they support comparisons; for values, that you can copy them:

template<typename Key, typename Val>
struct node {
  Key key;
  Val value;
  node<Key,Val>* left;
  node<Key,Val>* right;
};

(Note that we have to specify that both the left and right pointers point to the same kind of tree; one having the same key and value types.)

Finishing up `remove`

Finding the predecessor for remove is relatively easy, because we only ever do it in the case where the node is known to have two children. To find the predecessor, we simply look for the largest value in the node’s left subtree (that is, the largest value which is still less than the target). In general, finding the predecessor/successor when those child nodes may not exist is more difficult. Here’s the remove-specific predecessor operation:

node*& pred(node*& target) {
  return largest(target->left);
}

node*& largest(node*& root) {
  if(!root)
    return root;
  else if(!root->right)
    return root;
  else
    return largest(root->right);
}

largest finds the largest value in a (sub)tree, by simply going right as far as possible. A similar operation, smallest can easily be constructed:

node*& smallest(node*& root) {
  if(!root)
    return root;
  else if(!root->left)
    return root;
  else
    return smallest(root->right);
}

Predecessor and successor

To find the predecessor/successor in general, when a node is not guaranteed to have a right/left subtree, we have to be able to search the entire tree, so we need the root node, in addition to the target node. Predecessor and successor are both much simpler and faster if we have parent pointers. If a node does not have a left subtree, then its successor will be one of its ancestors, one of the nodes that lies on a path from the root to the node. E.g., consider finding the successor of 1 in this tree:

We have to go all the way up to 2 to find it. Similarly, the predecessor of 4 is its parent. To find the successor in general, we may have to search the entire tree.

node*& succ(node* root, node*& target) {
  if(target->right)
    return smallest(target->right);

  // Start from root and search for successor down the tree
  node** succ = nullptr;
  while (root)
      if (target->key < root->key) {
          succ = &root;
          root = root->left;
      }
      else if (target->key > root->key)
          root = root->right;
      else
         break;

  return *succ;
}

pred is analogous.

The complexity of these operations depends on the height of the tree, which in turn should be about $O(\log n)$.

Tree traversal

If we want to, we can build a function which will visit all the nodes in the tree, in ascending order. (We could also build a version that would visit them in descending order.) To do this, we build a recursive function that first visits the left subtree, then the root itself, then the right subtree. This forces all the values less than the root to be visited before it, and all the values greater than the root to be visited after it.

#include<functional>

void inorder(node* root, function<void(node*&)> visit) {
  if(!root) 
    return;

  if(root->left)
    inorder(root->left, visit);

  visit(root);

  if(root->right)
    inorder(root->right, visit);
}

The type function<void(node*&)> is the C++11 version of a function pointer. It can hold any kind of “functional object” that takes a node*& and returns void. This includes function pointers, classes that overload operator(), and lambda functions. E.g., if we do

inorder(tree, [](node*& n) { cout << n->key << " "; });

this will print all the values in the tree, in order.

This kind of binary tree traversal is called an inorder traversal, because we process a node in between its left and right subtrees. The other types of traversals aren’t meaningful for BSTs, but they are

Preorder traversal – Process the node before either of its left or right subtrees.

 void preorder(node* root, function<void(node*&)> visit) {
   if(!root) 
     return;

   visit(root);  

   if(root->left)
     preorder(root->left, visit);

   if(root->right)
     preorder(root->right, visit);
 }

Postorder traversal – Process the node after both its left and right subtrees.

 void postorder(node* root, function<void(node*&)> visit) {
   if(!root) 
     return;

   if(root->left)
     postorder(root->left, visit);

   if(root->right)
     postorder(root->right, visit);

   visit(root);  
 }