Review of last time

Heaps, giving fast easy access to the largest/smallest element of a collection.

Binary search trees, can be used to implement maps (more on those today).

The Map abstract data type

A map is an abstract data type that associates keys (of some type) with values of some type. The simplest map implementation is perhaps the array, which associates integer keys with values of whatever the array type is. Generally, when we talk about maps, we mean a structure which has sparse key allocation. This means that, in an array, if we have key 1 and key 1000, we also have to allocate storage for keys 2, 3, …, 998, 999. Arrays are dense in their storage. A map is ideally sparse, meaning that it only stores the keys that the user has explicitly placed in it. A binary search tree has sparse storage, because it only stores tree nodes for the values that we have insert-ed. Another way of putting this is that the space complexity of the data structure should be at most \(O(n)\) where n is the number of elements stored. (An array is \(O(k)\) where k is the largest key stored.)

We can use binary search trees to represent maps, provided that the key type supports an ordering (i.e., keys can be compared for <, > or ==). To write a completely generic binary search tree, where we can specify the type of the keys and values, we’d do something like this:

template<typename Key, typename Value>
class binary_tree {
  public:

    bool has_key(Key key) {
        if(!_root)
            return false;
        else {
            node* current = _root;
            while(current) {
                if(current->key <= key)
                    if(key <= current->key)
                        return true; 
                    else // current->key < key
                        current = current->right; 
                else
                    current = current->left;
            }
            return false;
        }
    }

    // Other operations: find, insert, remove

  private:
    struct node {
        Key key;
        Value value;
        node* left;
        node* right;
    };

    node* _root = nullptr;
};

Question: why have I written has_key to use the rather obscure if-else chain, instead of the more straightforward

if(current->key < key)
    ...
else if(current->key > key)
    ...
else // current->key == key
    ...

The answer is because this way, the Key type only needs to support one overloaded comparison operator: <=. If you are writing your own key class, you’ll appreciate only having to overload one operator, instead of two. If we wanted to really make our class generic, we could provide a third template parameter that you could use to specify a comparison function. You could then use this in situations where the key class does not include <=, and you don’t have access to the class to modify it.

This method will work provided that the key supports ordering. But not every key type does. Furthermore, using a binary search tree implies that map lookup will be \(O(\log n)\). On the other hand, using a binary search tree, with the ordering that it implies, gives us easy access to some other things: the largest and smallest elements, the ability (using an inorder traversal) to list the elements in order, etc. It turns out that we can actually get better runtime complexity, if we are willing to trade the ordering/comparison requirement on keys for something else: hashing.

Hashing

A hash of an object x, hash(x) is, at its most basic, smaller representation of x, such that for any two objects x and y:

A function that produces hashes is called a hash function. As an example of what we mean, consider the problem of comparing strings for equality. We could compare them character-by-character, but if we are going to be comparing the same set of strings over and over, maybe that’s too slow. Instead, we hash every string (where the hash of each string is smaller than the string itself) and compare the hashes:

For strings, we’re going to divide hashing into a two step process:

How can we map strings to ints? We need a function that converts strings to ints. Here are some possibilities:

The best way to sum would be to treat each character as a digit in a base-\(2^8\) number, and then take the result modulo some \(2^{32}\) (or whatever the size of an int is).

To compute this, we would calculate the sum

$$256^0 \mathtt{s[0]} + 256^1 \mathtt{s[1]} + \ldots \mod 2^{32}$$

(Note that instead of waiting until the end to compute the mod, we can perform it after each addition and get the same result.)

To implement this, we can use Horner’s rule to avoid having to compute the powers of 256. Horner’s rule says that we can compute

$$a_0 + a_1 x + a_2 x^2 + \cdots a_n x^n$$

at some \(x\) by rearranging this into

$$x(x(\ldots(x(x(a_n) + a_{n-1}) + a_{n-2}) \ldots) + a_1) + a_0$$

We start with \(a_n\) and then at each step we multiply by \(x\) and then add in the next lower coefficient.

In other words, we run the loop

auto result = 0;
for(int i = n; i >= 0; i--)
    result = (result * x + a[i]) % m;

For our purposes, this means that we can compute the “value” of a string as a base-256 number, mod \(m\) via

unsigned int result = 0;
for(int i = s.size()-1; i >= 0; i--)
    result = (result * 256 + s.at(i)) % m;

(Bear in mind that you might need to adjust the value of s.at(i) if extended characters are in play!)

Note that these methods (you should only ever really use the last one) are not hash functions; they are just methods for turning an arbitrary length key (i.e., a string) into a fixed-length key, because many hash methods work only with fixed-length keys.

In terms of hash functions, we have a couple of common choices:

With a little finesse, it’s possible to rewrite the multiplicative method in a way such that the intermediate results never exceed the range of a double: notice that inside the multiplicative method, we have \(kA\) where \(k\) is the result of the Horner’s method sum:

$$s[0] 256^0 + s[1] 256^1 + \cdots + s[n] 256^n$$

We can distribute \(A\) over this giving

$$A s[0] 256^0 + A s[1] 256^1 + \cdots + A s[n] 256^n$$

which turns Horner’s method into

hash = 256 * hash + A * s[i] ;

(hash must now be a float or double.)

The next step is to take the fractional part of this, but just as we can move the normal modulo inside the loop, we can move the fractional part inside the loop as well, because of the identity:

$$\text{frac}(a + b) = \text{frac}{(\text{frac}(a) + \text{frac}(b))}$$

In C++, the frac function is actually fmod from <cmath>:

$$\text{frac}(a) = \mathtt{fmod}(a,1)$$

Moving this inside gives us

hash = fmod(A * (256 * hash + s[i]), 1) ;

and the values of hash are always in the range 0 to 1. After the loop is finished, we can multiply by \(m\) and then round down. This method ensures that we get the correct hash value, even if the intermediate results would have been big enough to overflow an int.

Hash function properties

A good hash function should have

Applications of hash functions

So now we’ve got our hash function hash(k), what can we do with it? There are a few interesting options:

Hash functions are interesting in that comparing hashes gives us partial information: if two hashes are different, then we know something (the values were different), but if they are the same, we don’t know anything. This would suggest that a desirable property for a hash function to have is that it makes collisions as rare as possible (a collision is two different values that hash to the same result).

Hash tables

Hash tables are probably the main application of hashing functions that most people think of. The idea of a hash table is to create an array (or vector) big enough to hold all possible hashes (i.e., the size should be \(m\)). Then, to store a key/value in the table, we compute the key’s hash and place it in the array element indicated by its hash. To lookup a value, we hash the key and find its array element, and then we have its value.

To be more precise, a hash table is an unordered map ADT implemented using an array (or vector) of values. It’s unordered, unlike a search tree-based map, because the structure does not contain or use any information about the order of the values (the order of the hashes is unrelated to the order of the original values). So with a hash table map we cannot ask for the smallest/largest value, or the predecessor/successor or a value, and we cannot enumerate the values “in order” (we can enumerate them in an unpredictable order, the order of the hashes). A sketch of a hash table class might look something like

template<typename Key, typename Value>
class hash_table {
  public:
    void insert(Key k, Value v);
    Value& find(Key k);

  private:
    std::vector<Value> table;

    // Hash function, result in the range 0...table.size()-1
    unsigned hash(Key k); 
};

Given a pair (key, value) we store it in the hash table by simply doing

void hash_table::insert(Key k, Value v) {
    table.at(hash(key)) = value;
}

(I am assuming that hash(x) automatically restricts its results to the size of the table, table.size().)

Similarly, to lookup the value associated with a key we simply do

Value& hash_table::find(Key key) {
    return table.at(hash(key));
}

Of course, the details are more complex than this; in particular:

Hash functions

For hash tables, we have an additional requirement on the hash function: ideally, we want our hash values to be uniformly distributed over the size of the table. This means that for some set of input keys, the hashes produced have an equal chance of being placed in any of the locations in the table. If the distribution is not uniform, then some regions of the table will always be empty, while other regions will have many collisions. If we want to make efficient use of the table space, this is bad. We want the table to “fill up” regularly, everywhere at the same rate.

Another desirable property is called the avalanche effect: changing a small part of a value results in a totally different hash. This implies that very similar inputs will have very different hashes, so that values that would otherwise end up close to each other (or possibly even in the same cell) in the table get “spread out”.

Sometimes the hash function imposes requirements on the size of the hash tables: e.g., the remainder hash function requires a large prime modulo, which will also be the size of the hash table. The multiplicative method has no such requirement.

Taken together, we have these desirable properties for hash functions:

Empty/full cells

The cells of the hash table hold values, and thus the choice of a representation for “emptiness” depends on the Value type. If Value is a pointer type, then nullptr is an obvious choice for “empty” (provided we never need to store nullptr in the table).

If no such unused Value is available, then we can also create an auxilliary vector<bool> to keep track of whether a given cell is empty or full.

If chaining (see below) is used, then we have a natural representation for empty cells: the empty list.

Collision handling

If there are more possible keys than there are hash values, then collisions (two keys hashing to the same value) are inevitable. To see why, imagine that we have only 3 keys and our hash function produces only 2 distinct hash values. It should be obvious that one of these hash values will correspond to (at least!) 2 of the three keys; there simply aren’t enough hash values to support all the keys.

If two keys hash to the same result, they will end up in the same cell in the table. What do we do know? There are several possibilities, which can broadly be divided into two methods: chaining vs. open addressing.

With chaining, each entry of the hash table is actually a linked list of values. When we have a collsion, we simply add the additional value to the list.

The value \(n / m\) is called the load factor of the table, and indicates roughly how full the table is.

The worst case for chaining is that all the elements hash to the exact same value, and thus are stored in a single list. In this case, finding and deleting are both \(O(n)\). A good hash function will usually mitigate this possibility.

To illustrate chainin, assume we have a hash table with size \(m = 10\). We have the following sequence of keys to be inserted

$$20 \quad 39 \quad 15 \quad 11 \quad 19 \quad 3 \quad 50 \quad 41$$

and we’ll assume simple remainder hashing (\(\mathtt{hash}(x) = x \mod m\)).

Initially, all the lists in the table are empty. As we insert elements, we add them to the head of each list. (This is both easy to do without storing a tail pointer, and also means that recently-added elements are close to the front of the list, where they will be faster to find.) This gives us the hash table

Index List
0 20, 50
1 11, 41
2
3 3
4
5 15
6
7
8
9 39, 19

(This isn’t a very uniform distribution, but with so few keys pretty much no distribution will look uniform. Only when we’ve inserted many keys will the table begin to look uniformly filled.)

In open addressing we don’t use any linked lists. Instead, if, after computing h0 = hash(k) we find that table[h0] is occupied, we use some probe function to generate h1 = p(h0) and check to see if table[h1] is free. If it is, we use that location, otherwise we generate h2 = p(h1) and continue. As long as the probe function is deterministic (always generates the same sequence of indexes), then we can reliably insert, find, and delete key/values from the table. (If you want, you can think of this as a linked list, where the “links” are implicitly defined by the probe function and the table indices it generates.)

Probe functions:

If we want to analyze open addresses (assuming that clustering is not an issue), we do so by looking at the load factor \(\alpha = n / m\).

  1. We always perform one initial probe (to look at the entry for hash(x))

  2. The entry table[x] is full with probability \(\alpha\). That is, the “proportion” of fullness at any particular cell is, for open addressing, the probability that any particular cell is full (again, ignoring the effects of clustering).

  3. If table[x] is full then table[p(x)] is also full with probability \(\alpha\).

  4. And so forth.

For open addressing, we know that \(n \le m\) which implies that \(0 \le \alpha \le 1\). Thus, over a probe sequence, the number of probes in a search will be

$$1 + \alpha + \alpha^2 + \alpha^3 + \cdots = \frac{1}{1 - \alpha}$$

(The idea is that this represents the probability that the first probe succeeds, or that it fails and the second succeeds, or that the first and second fail and the third succeeds, and so forth.)

This is for an “unsuccessful” probe sequence, one where we are searching for an empty space in which to place a new element. A “successful” search, where we are looking for a particular key which is already in the tree is more difficult the analyze.

Rehashing: full hash tables

Regardless of the scheme we use for collision handling, the performance of a hash table will degrade as the load factor increases. It is generally unwise to allow a hash table to even approach truly being “full” (\(n = m\)), as by that time performance will have severely degraded. Instead, we set a threshold \(\beta\) and if \(\alpha \ge \beta\) then we rehash. This means increasing the size of the hash table, rehashing all the keys/values for the new \(m\) and placing them into the new table. Obviously this is an expensive operation, so we want to grow the table by a sufficient amount so as to make it infrequent. It turns out that the same scheme that we used for vectors works here as well: when we need to grow the hash table, we double its size.

With a chaining-based table, instead of triggering a rehash based on \(\alpha\), we usually trigger it based on the length of the longest chain. Thus, if any list gets “too long”, we rehash all lists.

Applications of hash tables

Hash tables are useful in any situation where we want to have near-\(O(1)\) lookups of key/value pairs. A common application is binding variable names to values in programming languages (compilers and interpreters). To illustrate this, we’re going to build a very simple interpreter for a minimal programming language. The language only has arithmetic and variable assignment (and only single-letter variables and single-digit integers at that) but it should serve as a useful introduction.

We’re going to build a stack language, one that works off a stack. Our input will consists of a sequence of tokens (characters) separated by spaces. Each token is either a value, in which case it is pushed onto the stack, or an operation in which case it can pop some values off the stack, manipulate them, and push some back on. For example 1 is an integer value and x is a variable value; + is the addition operation: it pops the top two entries off the stack, adds them, and then pushes the result back onto the stack.

In this language, the computation

x = 1
y = 2
x * y + 3

looks like this

1 x =   2 y =   x y *  3 +

The possible operations are

Operation Description
v1 v2 = Sets the value of s2 (variable) to s1. Pushes nothing
v1 v2 + Pops and adds s1 and s2 and pushes the result
- * / Subtraction, multiplication, division
v ~ Unary minus, pops v, negates it and pushes the result
#include<iostream>
#include<stack>
#include<unordered_map>

class stack_elem {
  public:
    stack_elem(char c) {
        if(c >= '0' && c <= '9') {
            is_var = false;
            value = c - '0';
        }
        else {
            is_var = true;
            name = c;
        }
    }

    stack_elem(int x) {
        is_var = false;
        value = x;
    }

    bool is_var = false;
    int value = 0;
    char name = '#';
};

std::stack<stack_elem> st;
std::unordered_map<char,int> variables;

void process(char c);
void execute(char c);
int evaluate(stack_elem s);

// ---------------------------------------------------------------------------

int main() {
    char c;
    while(std::cin >> c) {
        process(c);

        if(!st.empty() && !st.top().is_var)
            std::cout << st.top().value << std::endl;
    }

    return 0;
}

void process(char c) {
    if(c >= '0' && c <= '9')
        st.push(c); // Integer
    else if((c >= 'a' && c <= 'z') ||
            (c >= 'A' && c <= 'Z'))
        st.push(c); // Variable
    else
        execute(c); // Operation
}

void execute(char c) {
    int v1, v2, v, var, val;
    switch(c) {
        case '=':
            var = st.top().name; st.pop();
            val = evaluate(st.top()); st.pop();

            variables[var] = val;
            break;

        case '+':
            v1 = evaluate(st.top()); st.pop();
            v2 = evaluate(st.top()); st.pop();

            st.push(v1 + v2);
            break;

        case '-':
            v1 = evaluate(st.top()); st.pop();
            v2 = evaluate(st.top()); st.pop();

            st.push(v1 - v2);
            break;

        case '*':
            v1 = evaluate(st.top()); st.pop();
            v2 = evaluate(st.top()); st.pop();

            st.push(v1 * v2);
            break;

        case '/':
            v1 = evaluate(st.top()); st.pop();
            v2 = evaluate(st.top()); st.pop();

            st.push(v1 / v2);
            break;

        case '~':
            v = evaluate(st.top()); st.pop();

            st.push(-v);
            break;

        default:
            // ignore this character
            return;
    }
}

int evaluate(stack_elem s) {
    if(!s.is_var)
        return s.value;
    else
        return variables.at(s.name); 
}