Balanced Binary Search Trees

AVL trees

An AVL tree is a binary search tree that is height balanced; for any node, the heights of its left and right child differ by no more than 1. That is,

abs(height(node->left) - height(node->right)) <= 1

The height of a node is defined, one again, as either 0 for empty trees, except that we don’t want to be constantly recalculating the heights from scratch, so we store them in the nodes themselves:

struct node {
  int key;
  node* left;
  node* right;
  node* parent;
  int height;
};

int height(node* root) {
  return root == nullptr ? -1 : root->height;
}

(As before, the empty tree, nullptr has a height of -1.)

Note, in particular, that if you start at a leaf and go upward, the height of the nodes you visit always increases.

In order to build an AVL tree, we have to store the height of each node in the node (technically we can get away with just storing the differences in heights between the left and right subtrees, which will always be -1, 0, or +1).

When we insert a node, we may break the height balance property, but we can use rotations to move things around to restore it. After an insert, we walk back up the path to the root, updating height values and doing rotations as necessary to fix things.

To understand how the different tree operations work on an AVL tree, let’s examine how the rotation affects the height of the subtrees by looking at an example:

Here we have an imbalance at p, caused by c having a height of \(h+2\) while Z has a height of \(h\). This could be the result of an insertion into X, increasing its height, or maybe a removal from Z. (We’ll see in the next example that a removal from Y would result in a different situation.)

To fix the imbalance we simply rotate c with its parent p. This method is used whenever the imbalance is on the “outside”:

The other possibility is when Y rather than X is the taller of c‘s subtrees. This is an “inside” imbalance, and this rotation is not enough to fix it:

Instead, we have to perform a double rotation: first, rotate c with b, then, rotate c with a:

Note that it doesn’t make any difference whether the X or Y is taller: the double-rotation works to balance things out if Y is 1 taller than X. Thus, we don’t actually care about X vs Y, we never need to look at them or their heights, they are handled automatically by the rotation of c.

(The right-inside case is just the mirror image of this.)

Insertion

We begin by inserting the new node as usual. Any violations of the height-balance property must be on the path from the root of the tree to the new node (because that is the only path that has been changed). So we walk up the path to the root (following parent pointers), checking heights as we go:

// node n has just been inserted into the tree
node* p = n->parent;

while(p && p->parent) {
  // Update this node's height
  p->height = 1 + max(height(p->left), height(p->right)); 

  if(abs(height(p->left) - height(p->right)) > 1) {
    // Unbalanced height, fix
    if(p == p->parent->left)
        p->parent->left = fix_height(p);
    else
        p->parent->right = fix_height(p);
  }

  p = p->parent;
}

The function fix_height(node*) (taking a pointer to the node at which the imbalance occurs) should check for the above situations and apply a rotation or double-rotation as needed to repair the height.

The code

    if(p == p->parent->left)
        p->parent->left = fix_height(p);
    else
        p->parent->right = fix_height(p);

is a fairly common pattern: in order to modify the tree in-place, we have to figure out how p is related to its parent: is it the left or right child? We then replace that child specifically.

(It’s possible to build AVL trees without parent pointers, by exploiting the fact that the same recursion that goes down into the tree must go back up, as it returns. We can do the fixup operation during the recursion, although getting the order right is tricky.)

Insertion example: insert 1 2 3 4 5 6 7 8 into an empty tree.

Deletion

Deletion is handled the same way: the only difference is the conceptual one: instead of thinking of the heights of the subtrees as \(h\) and \(h+1\), think of them as \(h-1\) and \(h\). The actual method for repairing the tree is the same: start at the modified node, and walk up to the root, fixing imbalanced nodes as you go. Note that in the two-child case, where we swap the target node with its predecessor/successor and then (recursively) delete the one-child node, we start the “fixing” at the one-child node, not at the original location.

Deletion example: delete 4 from the above tree. Delete 8, 1, 6, 2.

Splay trees

A splay tree is a BST which uses amortized rebalancing. After an insert or delete or a find we perform a number of splay operations, which tend to make the tree more balanced. Note that while AVL trees only rebalance the tree when its structure is modified, splay trees rebalance even when we are just find-ing a node; they take every opportunity to improve the structure of a tree, but they tend to not do very much at once.

A splay is divided into three cases:

To perform a full splay operation, we simply repeat the above until p is the root of the tree (has no parent). A full splay is performed after each find, and thus after every insert and remove, too. Each splay tends to make the tree slightly more balanced, while at the same time moving the node p up to the root (so that subsequent finds for it will take O(1) time). If p was a deep left child, then the full splay will have moved a good number of nodes over to the right side.

Deletion for splay trees splays the parent of the removed node, moving it all the way to the root. This is based on the assumption that if you delete x from the tree, you will probably delete or otherwise access other values close to x in the future, so it will be good to move them close to the root. (And, of course, it helps to balance the tree.)

Splay trees support a number of other interesting operations, all based on the fact that a splay lets us move an arbitrary value to the root of the tree.

Amortized analysis of Splay Trees

Because a single splay operation does not completely balance the tree, but merely makes it “more balanced” we need amortized analysis. We intend to show that the amortized cost of a find or insert operation is \(O(\log n)\); that is, that the tree is balanced “on average” over a series of operations. To do this, we will use a different amortized analysis, the potential method. Like the accounting method, the potential method associates with a data structure an extra amount of “resources”, called potential. An operation can have an amortized cost higher than its real cost, adding the extra to the structure’s potential, or it can spend some potential and have its amortized cost be lower than its real cost. As with the accounting method, the potential method serves as a way of smoothing out the differences in real cost over a series of operations.

The difference between the potential method and the accounting method is that while the accounting method focuses on the amount of “credit” accumulated in the structure — or, more often, in particular parts of the structure — in the potential method, the emphasis is usually on the change in potential caused by an operation. Similarly, potential is usually associated with the entire structure, rather than individual elements of it.

We define the potential of a tree to be a measurement of its “balanced-ness”: define

$$\text{size}(t) = \text{number of nodes in }t$$ $$\text{rank}(t) = \log_2 (\text{size}(t))$$ $$\text{Potential}: P(t) = \Sigma_{n \in t} \text{rank}(n)$$

P(t) will tend to be high for poorly-balanced trees, and low for well-balanced trees. To see why this is the case, note that for a well-balanced subtree, the rank will be roughly equal to its height. For an unbalanced tree, the height will be greater than the rank, closer to \(2^{\text{rank}(t)}\).

An important property of these definitions is that is c is a subtree of p, then \(\text{size}( c) \le \text{size}(p)\) and similarly \(\text{rank}( c) \le \text{rank}(p)\). It’s not possible for a subtree to have more nodes than the tree which contains it.

DIAGRAMS

A perfectly-balanced tree will have a rank exactly equal to the number of nodes in the tree.

To use the potential method, we calculate the change in potential ΔP for each of the cases in the splay operation. We are not yet looking at all the rotations done over the entire splay, just the change caused by a single Zig, Zig-Zig, or Zig-Zag step. If x is the target node, we have

Thus, we can conclude that in all three cases, the change in potential is

$$\Delta P \le 2 (\text{rank}(x’) - \text{rank}(x))$$

The total cost of any splay operation is the sum of all its steps, but since each step consists of \(\text{Ending P} - \text{Starting P}\), where the Ending P of one operation is the Starting P of the next, so the sum telescopes: all the interior terms cancel, and the total cost reduces to just

$$\le 2 (\text{rank}(\text{root_x}) - \text{rank}(x))$$

which is \(O(\log n)\). The amortized cost of m operations is \(O(m \log n)\) and each individual operation has an amortized cost of \(O(\log n)\).

Tree traversals

Preorder, postorder, inorder — All of these are implicitly based on a stack (the function call stack used to implement recursion).

Levelorder — Lists the nodes of the tree by level. E.g., in the tree

the nodes would be processed in numerical order. In order to write a level order, we need to somehow visit the nodes in order of increasing distance from the root.

In order to do this, we’ll use a queue to process nodes, instead of a stack:

void print_levelorder(node* root) {
  queue<node*> q;
  q.enqueue(root);

  while(!q.empty()) {
    node* n = q.dequeue();
    if(n != nullptr) {
      cout << n->value;
      q.enqueue(n->left);
      q.enqueue(n->right);
    }
  }
}

Let’s try this out and see if it works.

It works because queues make everybody “take turns”. So although we enqueue all the nodes of the next level while processing the current one, none of them get to start until all the nodes of the current level have been processed. We’ll use this later when we traverse graphs to visit nodes in order of increasing distance as well.