Module 1: C++ review and UML

For C++ I might ask you to implement some simple loops, or build some simple class. Hopefully nothing too tricky.

Module 2: Big-O Analysis, Vectors, Linked Lists, Stacks, and Queues

To start off with the Big-O stuff, we are interested in roughly describing how the runtime (or sometimes memory usage) of a function increases as the size of the input increases. For example, given this function:

bool is_subset(vector<int> a, vector<int> b) {
    for(int i : a) {
        bool found = false;
        for(int j : b) {
            if(i == j) {
                found = true;
                break;
            }
        }
        if(!found)
            return false;
    }
    return true;
}

what is the worst case performance (usually we are interested in the worst case, or sometimes best case, runtime complexity)? In the worst case, the outer loop will run over all the elements of a (will not return false early), and the inner loop will run over all elements of b (will not break early), so the runtime will be proportional to the size of a, times the size of b. If we let m = a.size(), n = b.size() then we say that the runtime complexity is $O(mn)$.

What about this algorithm?

vector<int> append(vector<int> a, vector<int> b) {
    vector<int> output{a.size() + b.size()};

    int k = 0;
    for(int i : a) 
        output.at(k++) = i;
    for(int j : b)
        output.at(k++) = j;

    return output;
}

The first loop takes time proportional to the size of a, while the second loop takes time proportional to the size of b, but because the loops are not nested, but one after another, the runtime is $O(m + n)$.

Sometimes the runtime is more tricky to figure out. Consider this useless loop:

vector<int> a = ...;

for(int i = 0; i < a.size(); i++)
    for(int j = i; j < i.size(); j++)
        a.at(j)++;

What is the runtime of this? The outer loop runs n times, but the time taken by the inner loop depends on the outer loop. In fact, we have a sum of the form

$$n + (n-1) + (n-2) + \cdots + 2 + 1$$

We showed that we could rearrange this sum as

$$S = n + (n-1) + (n-2) + \cdots + 2 + 1$$ $$2S = (n + 1) + ((n-1) + 2) + ((n-2) + 3) + \cdots + ((n-1) + 2) + (n + 1)$$ $$2S = n(n+1)$$ $$S = \frac{n(n+1)}{2}$$

If we Big-O-ify the final term, we have time proportional to $O(n^2)$. Remember that when we put a polynomial inside a big-O, everything but the highest degree term disappears.

On the other hand, suppose we have this loop:

vector<int> a = ...;
for(int i = 1; i < a.size(); i *= 2)
    for(int j = 0; j < i; j++)
        a.at(j)++;

This one is easier to analyze if we assume that a.size() is a power of 2. E.g., if a.size() == 256 then the outer loop will run exactly 8 times, for the values i = 1, 2, 4, 8, 16, 32, 64, 128. The inner loop runs exactly i times, so now we have a sum of the form

$$S = 1 + 2 + 4 + \cdots + 2^{l-1} + 2^{l}$$ $$l = \log_2 n$$

This is known as a geometric sum. We can solve it by multiplying by 2, and then subtracting:

$$2S - S = (2 + 4 + \cdots + 2^l + 2^{l+1}) - (1 + 2 + \cdots + 2^{l-1} + 2^l)$$ $$2S - S = 2^{l+1} - 1$$ $$S = 2^{l+1} - 1$$

because all of the middle terms cancel out. But remember that $l = \log_2 n$ so we really have $S = 2n - 1$, that is, the runtime is still $O(n)$!

Finally, as an example of a logarithmic algorithm, the venerable binary search:

int bin_search(vector<int> v, int t) {
    int begin = 0, end = v.size()-1;

    while(begin < end) {
        int mid = (begin + end) / 2;
        if(t < v.at(mid))
            end = mid - 1;
        else if(t > v.at(mid))
            begin = mid + 1;
        else
            return mid;
    }
    return -1;
}

In the worst case, every time through the loop will roughly cut the search space in half, so the loop will execute roughly $\log_2 n$ times, giving a runtime of $O(\log n)$.

The “order” of Big-O classes is roughly (from best to worst)

$$1, \log n, n, n \log n, n^2, n^3, 2^n, n!$$

In general, if we add together two different classes, the worse class dominates.

Vectors

For vectors we also looked at amortized analysis, where we average the complexity over many operations. This allows for a few slower operations to be canceled out by many faster operations.

We tried to implement vector, in such a way that push_back was amortized constant time. We found that adding a fixed number of elements to the vector was not sufficient, we had to double the size of the vector, so that as copy operations became more expensive, they became proportionally less frequent.

An interesting puzzle: suppose we implement vector as a linked list of arrays. When one array fills up, we create a new one and add it to the end of the list.

How big should the arrays be? Should their size be constant, or should it increase as they get later in the list? If so, how should they increase, by a constant amount (+ 100) or doubling, or something else?
How does our choice for the previous affect how long the list is? That is, for a size-n vector, how many nodes will be in the list?

Linked lists

class list {
    public:
        struct node {
            int value;
            node* next;
        };

    node* head = nullptr;
    node* tail = nullptr;
};

This is the basis of a singly-linked list of ints. To add a new element to the end, we have to handle the special case where the list is empty:

void add_end(list& l, int v) {
    if(l.head == l.tail)
        l.head = l.tail = new list::node{v, nullptr};
    else {
        l.tail->next = new list::node{v, nullptr};
        l.tail = l.tail->next;
    }
}

Similarly for adding to the beginning:

void add_begin(list& l, int v) {
    if(l.head == l.tail)
        l.head = l.tail = new list::node{v, nullptr};
    else
        l.head = new list::node{v, l.head->next};
}

Removing the first element is easy:

void remove_begin(list& l) {
    list::node* n = l.head;
    l.head = l.head->next;

    delete n;
}

To do any processing on the whole list, we have to walk down it until we see a ->next == nullptr:

int length(list& l) {
    int len = 0;
    list::node* c = l.head;
    while(c) {
        c = c->next;
        len++;
    }

    return len;
}

The tail pointer is just a convenience for adding elements to the end, we can also add elements to the end without using it, just by walking down the list:

// Add to the end, if we don't have tail pointers
void add_end(list& l, int v) {
    if(!l.head)
        l.head = new list::node{v, nullptr};
    else { 
        list::node* cur = l.head;
        while(cur->next)
            cur = cur->next;

        cur->next = new list::node{v, nullptr};
    }
}

Runtime complexity: adding new nodes is $O(1)$ if we have both head and tail pointers. Adding to the end without tail pointers is $O(n)$. Accessing the i-th element takes $O(i)$ time. Inserting after a given node takes $O(1)$ time:

void insert_after(list::node* n, int v) {
    n->next = new list::node{v, n->next};
}

In a doubly linked list every node has two pointers, one next and one prev. Inserting before, moving to the previous node, etc. all become constant-time operations.

Stacks

A stack is a data structure like a list where elements can only be inserted and removed at one end. It is easily implemented on top of a linked list, by having push add an element to the front of the list, and pop remove from the front of the list.

Stacks are useful for simulation function calls, and also for matching bracketing symbols (like parentheses or HTML tags). We push any opening symbol onto the stack, and when we see a closing symbol, we check it against the top of the stack and pop if it matches.

Queues

A queue is a list-like structure where elements can only be inserted at one end, and removed from the other end. Queues are useful when we want to process a number of things “fairly”, with everyone taking turns. Queues are easily implemented on top of singly-linked lists, by having enqueue add new elements onto the end of the list, and dequeue remove them from the beginning (remember that removing elements from the end of a singly-linked list takes $O(n)$ time, so we don’t want to do that).

Module 3: Sorting and Searching

Searching

If we want to find an element in an unordered sequence, then the best we can do is to check every element, one by one.

If we know that the sequence is already sorted, then we do better using binary search: at each step, compare the target to the “middle” element of the current range. If it is less than it, then we can throw out the upper half of the range, likewise, the lower half. Thus, at each step we halve the size of the search space.

An interesting optimization to consider is if we can not only compare the elements, but also subtract them: we could use this to get an approximation of the “derivative” of the sequence at a position, an estimate of how fast it is increasing, and use this to maybe accelerate our search? (This is basically the search equivalent to Newton’s root-finding method.)

Puzzle: can you build a two-dimensional version of binary search, that will try to find a location in a 2D grid which is “sorted” (increases both from left-to-right and from top-to-bottom) that contains a given value?

Sorting

If binary search requires sorted sequences, how do we sort them?

Quadratic sorting

Insertion sort: keep two sequences, unsorted (initially full) and sorted (initially empty). For every element in the unsorted sequence, remove it and insert it into the sorted sequence in the proper (sorted) location.

Bubble sort: compare-and-swap adjacent elements, starting at the beginning and proceeding to the end. After one “pass”, the largest element will be in the last position. For the next pass, you can stop at n-1, then at n-2 and so forth until there’s only one element left (a one-element sequence is always sorted).

Selection sort: maintain a sorted and unsorted sequence. Find the smallest element in the unsorted sequence and place it at the end of the sorted sequence.

Sub-quadratic sorting

Merge-sort: divide the input sequence in half, recursively sort both halves, and then merge them together. The merge step takes $O(n)$ time, giving this a runtime of $O(n \log n)$ time. Note that this is the best-case runtime: it always takes logarithmic time, regardless of what the input looks like.

Quick-sort: works by picking an element of the sequence to be the pivot. The sequence is the partitioned into two sections: the elements less than the pivot, and the elements greater than the pivot (with the pivot between the two partitions). The choice of the pivot affects how “balanced” the two partitions are, which is important because we then recursively Quicksort the two partitions. If they are not balanced, the performance may approach $O(n)$.

Sub-sub-quadratic sorting

We can exploit extra information to sort in sub-quadratic time. E.g., sort an array of n distinct integers in the range $a$ to $a + n - 1$. Just do

for(int i = a; i < a + n; i++)
    arr[i - a] = i;

Usually this means knowing something about the distribution of values.

Module 4: Binary trees

Binary search trees

BSTs encode the search structure of binary search, so that we can insert/delete elements while maintaining an $O(\log n)$ lookup time. Inserting new elements requires finding the correct location in the tree for them, which is equivalent to a lookup, and likewise deletion requires finding the element, hence both insert and delete also take $O(\log n)$ time in the best case.

A BST is a binary tree with the BST condition: all the elements of the left child must be less than the root, and all the elements of the right child must be greater than the root.

struct node {
    node* left;
    node* right;
    node* parent; // Optional
    int key;
}

To find a specific node, we use a procedure very similar to binary search:

node* find(node* root, int key) {
    if(root == nullptr)
        return nullptr; 
    else if(root->key == key)
        return root;
    else if(key < root->key)
        return find(root->left, key);
    else
        return find(root->right, key);
}

(I’ve simplified this by removing the reference-magic.)

Inserting a node is easy: just find the empty space where it belongs, and then put it there.

Deleting a node is more difficult: there are two easy-ish cases and one hard case:

If the node has no children, then we can just remove it (setting the pointer in its parent to nullptr):

 if(target->parent->left == target)
     target->parent->left = nullptr;
 else
     target->parent->right = nullptr;
 delete target;

If the node as only one child (left or right), then we replace the node with its child:

 node* child = target->left ? target->left : target->right;
 if(target->parent->left == target)
     target->parent->left = child;
 else
     target->parent->right = child;
 delete target;

If the node has two children then we have to find a node to replace it, which is still in between its left and right children. We can choose either the predecessor (largest value in left subtree) or successor (smallest value in right subtree). Whichever we choose, this value is between the left and right children (if we were to arrange the elements of the tree in order, the predecessor and successor would be the values immediately before/after it), and thus it can “adopt” them as its own children. It also is guaranteed to only have one child (being the rightmost/leftmost node in a subtree), and thus deleting it is easy.

Hence, we swap the target with its pred/succ, and then delete the pred/succ (which, in turn, just moves its child up).

To find the largest element in a binary tree, just go right as far as possible. To find the smallest, go left.

Tree traversals:

If we walk through the tree, printing the root, then its left subtree, then its right subtree, then we have a preorder traversal.
If we walk through the tree, printing the left subtree, then the root, then the right subtree, then we have an inorder traversal. For a BST, this will print all values in the tree in ascending order.
If we walk through the tree, printing the left, right subtrees and then finally the root, we have a postorder traversal.

Tree balancing

A BST only offers log-time performance if the tree remains relatively balanced. If the tree becomes heavily skewed toward the left or right, then it starts to act more like a linked list than a tree, and performance approaches linear. To solve this, various kinds of self-balancing BSTs have been proposed. We looked at three methods:

AVL trees maintain a height condition: at any node, the heights of its left/right subtrees can differ by at most one. After insertion/deletion we walk back up the tree to the root, “fixing” nodes which no longer meet the condition.
Splay trees have no explicit balance requirement: instead, after every operation (including find!) they walk up to the root performing a splay. The splay has the effect of moving the target node all the way to the root.

All three methods rely on tree rotations as their building blocks. A tree rotation is a way of exchanging a node with its parent that preserves the tree order property. There are two kinds of rotations, depending on whether the node in question is a left or right child of its parent:

      b         a
     / \       / \
    /   \     /   \
   a     z   x     b
  / \             / \
 /   \           /   \
x     y         y     z

Note that the relative order of x, a, y, b, z is maintained (i.e., both versions have the same inorder traversal).

Binary heaps

A binary heap is a complete binary tree which has the heap order condition: The root must be greater than both its children. (This is called a max heap; a min heap requires that the root be less than both its children.)

By complete, we mean that every row except maybe the last one is full, and the last one fills in from left to right, with no gaps.

The main advantage of a binary heap is that it always gives us easy access to the largest (smallest) element in the heap: just look at the root of the tree. The two key operations are

Extract: removes the maximum from the heap. We replace the root with the “last” (rightmost) node in the bottom row of the tree, and then shift it down, swapping it with one of its children, until the heap order property is restored.
Insert: add a new value as the rightmost node in the bottom row, and then shift-swap it up until the heap order property is restored.

Both of these, in the worst case, walk all the way up/down the tree, and because the tree is complete it is always balanced, thus they take $O(\log n)$ time.

Because of the requirement that the heap be a complete bintree, we can store it quite simply in an array: just store the successive rows (top to bottom) left to right in the array. This makes moving up/down the array a simple matter of multiplying/dividing by two.

Interestingly, we can construct a heap from an unordered array in $O(n)$ time. If we just repeated did Insert it would take $O(n \log n)$ time. Instead, we just take the array as is and then run the shift down operation on every element of it, starting from the end (the bottom of the heap). Because elements toward the end can only shift down a few levels, this results in a minimal number of swaps.

Heaps can be used to implement priority queues, where elements are enqueued with a priority and dequeuing always gives the elements with the highest priority.

Module 5: Hash functions and Hash tables

Hash function: transforms a key into an integer in the range $[0,m)$. Should be deterministic, fast, have a good distribution over the output range, and small differences in keys should be amplified into large differences in hashes.

Remainder treats the input string (byte sequence) like a base-256 number, and then takes that modulo m. m should be a prime, not too close to a power of 2. Note that if the keys are already numbers, then you just take them mod m and you’re done.
Multiplicative multiplies the return of the remainder hash by a floating-point constant, takes the fractional portion of the result, and then multiplies that by m and rounds it down (resulting in a value between 0 and $m-1$). The multiplier $A$ has an effect on how well the algorithm performs. (I don’t expect you to memorize the multiplier we used.)

If we have two different hash values then we know that their keys must be different. If we have two identical hash values then we don’t know anything, because collisions (different keys that hash to the same value) are possible.

Handling collisions:

Chaining: keep a linked list in every table entry, just add colliding entries to the list.
Open addressing: if an entry is full, look for an open entry somewhere else in the table, using a probe sequence. Linear probing looks in the next entry, quadratic probing looks in the $a_0 i + a_1 i^2$ entry, double-hashing looks in the $\mathrm{hash}_0(k) + i \mathrm{hash}_1(k)$ entry.

The load factor $\alpha$ of a hash table is the ratio of elements stored $n$ to $m$, the size of the table. With open addresses, the maximum load factor is 1.0.

Hash functions should be

Deterministic: always hash a particular key to the same value
Uniformly-distributed over the range m. No hash value should more dramatically more common than any other.
Amplify small differences in keys (avalanche effect)
Low probability of collision

You can remember these as DUAL.

Module 6: String searching and matching

Tokenization/lexical analysis: the process of splitting a string into tokens. Can be easy or hard, depending on the types of tokens.

Recognition/parsing/semantic analysis: taking a token sequence and determining if it is a valid sentence in a given grammar. This involves constructing a “derivation” (parse tree) from the start symbol of the grammar to the actual tokens. E.g., here’s a grammar for regular expressions:

re -> CHAR
re -> "."
re -> re "*"
re -> re re
re -> re "|" re
re -> "(" re ")"

(ab)|(ab)*

a valid re? Construct a derivation to find out:

…

We can write a recognizer for a grammar to test this. This only tricky rules are the ones that are recursive, and particularly those that involve multiple recursion. We don’t know where the “split” might occur, so we have to test all the possibilities.

Parsing: building an expression tree of some sort from a token sequence. This follows the same model as for recognition, except that instead of just returning a true/false answer, we return a tree (if the parse was successful) or false (if it was not).

Module 7: Graphs and graph theory

Graph: edges and vertices. Directed vs. undirected edges. Weighted vs. unweighted edges.

Graph representations:

Adj. list: store an array of lists, one list for each vertex. The list stores pointers to the vertices that are adjacent (connected by outbound edges) to that one.
Adj. matrix: store an n by n bool matrix (where n is the number of vertices). To determine whether there is an edge from x to y look in matrix[x][y].

Breadth-first search: finds shortest paths (in terms of path length) from a source node to as many other nodes are reachable from it, by adding newly discovered nodes to a queue.

Depth-first search: finds paths (not necessarily shortest) from a source node to all reachable nodes. DFS has various useful properties that can expose the structure of the graph (e.g., whether or not it has cycles) while we are exploring it. Example…