Hashing and hash tables

Review of last time

Heaps, giving fast easy access to the largest/smallest element of a collection.

Binary search trees, can be used to implement maps (more on those today).

The Map abstract data type

A map is an abstract data type that associates keys (of some type) with values of some type. The simplest map implementation is perhaps the array, which associates integer keys with values of whatever the array type is. Generally, when we talk about maps, we mean a structure which has sparse key allocation. This means that, in an array, if we have key 1 and key 1000, we also have to allocate storage for keys 2, 3, …, 998, 999. Arrays are dense in their storage. A map is ideally sparse, meaning that it only stores the keys that the user has explicitly placed in it. A binary search tree has sparse storage, because it only stores tree nodes for the values that we have insert-ed. Another way of putting this is that the space complexity of the data structure should be at most $O(n)$ where n is the number of elements stored. (An array is $O(k)$ where k is the largest key stored.)

We can use binary search trees to represent maps, provided that the key type supports an ordering (i.e., keys can be compared for <, > or ==). To write a completely generic binary search tree, where we can specify the type of the keys and values, we’d do something like this:

template<typename Key, typename Value>
class binary_tree {
  public:

    bool has_key(Key key) {
        if(!_root)
            return false;
        else {
            node* current = _root;
            while(current) {
                if(current->key <= key)
                    if(key <= current->key)
                        return true; 
                    else // current->key < key
                        current = current->right; 
                else
                    current = current->left;
            }
            return false;
        }
    }

    // Other operations: find, insert, remove

  private:
    struct node {
        Key key;
        Value value;
        node* left;
        node* right;
    };

    node* _root = nullptr;
};

Question: why have I written has_key to use the rather obscure if-else chain, instead of the more straightforward

if(current->key < key)
    ...
else if(current->key > key)
    ...
else // current->key == key
    ...

The answer is because this way, the Key type only needs to support one overloaded comparison operator: <=. If you are writing your own key class, you’ll appreciate only having to overload one operator, instead of two. If we wanted to really make our class generic, we could provide a third template parameter that you could use to specify a comparison function. You could then use this in situations where the key class does not include <=, and you don’t have access to the class to modify it.

This method will work provided that the key supports ordering. But not every key type does. Furthermore, using a binary search tree implies that map lookup will be $O(\log n)$. On the other hand, using a binary search tree, with the ordering that it implies, gives us easy access to some other things: the largest and smallest elements, the ability (using an inorder traversal) to list the elements in order, etc. It turns out that we can actually get better runtime complexity, if we are willing to trade the ordering/comparison requirement on keys for something else: hashing.

Hashing

A hash of an object x, hash(x) is, at its most basic, smaller representation of x, such that for any two objects x and y:

If x == y then hash(x) == hash(y)
If x != y then it is unlikely that hash(x) == hash(y). A case where hash(x) == hash(y) but x != y is called a hash collision. Ideally, collisions are rare, though if hash(x) is smaller than x in some sense, they will always be possible.

A function that produces hashes is called a hash function. As an example of what we mean, consider the problem of comparing strings for equality. We could compare them character-by-character, but if we are going to be comparing the same set of strings over and over, maybe that’s too slow. Instead, we hash every string (where the hash of each string is smaller than the string itself) and compare the hashes:

If hash(x) != hash(y) then we know that x and y are different strings.
If hash(x) == hash(y) then we don’t know for certain, so we still have to check x == y.

For strings, we’re going to divide hashing into a two step process:

Convert the string to an int. This is a common first step in hashing functions, as many expect the input to be of some fixed size, and strings (and many other data structures) obviously are not.
Run the resulting value $k$ through a hash function to get a final hashed value $0 < h < m$, for some $m$.

How can we map strings to ints? We need a function that converts strings to ints. Here are some possibilities:

s.length() – just take the length of the string. Strings of the same length will have their hashes collide. This is a pretty terrible method.
s.at(0) (or s.at(c) for any constant c) – Just use the first (or n-th) character. All strings that start with the same character will collide. Similar to the previous, this is bad.

Note that at this point we run into a quirk of C++: char may be signed or unsigned. For normal characters this doesn’t matter (as all the standard alphabetic characters are in the range 0-127) but if your text includes extended ASCII characters, those will be mapped to negative values, which will throw off your sum.
s.at(0) + s.at(1) + ... – add all the characters together, and let integer overflow keep it to the range of an int. We can write this as a simple loop:
```
 unsigned int hash = 0;
 for(char c : s) {
    hash = hash + c; 
 }
```
This is one form of a checksum and it’s easy to compute. The main problem with this is that it’s easy to make two strings collide: just mix up the order of the characters: "ab" and "ba" will have the same hashes. (A similar problem arises if we just exclusive-OR all the characters together.)

The best way to sum would be to treat each character as a digit in a base-$2^8$ number, and then take the result modulo some $2^{32}$ (or whatever the size of an int is).

To compute this, we would calculate the sum

$$256^0 \mathtt{s[0]} + 256^1 \mathtt{s[1]} + \ldots \mod 2^{32}$$

(Note that instead of waiting until the end to compute the mod, we can perform it after each addition and get the same result.)

To implement this, we can use Horner’s rule to avoid having to compute the powers of 256. Horner’s rule says that we can compute

$$a_0 + a_1 x + a_2 x^2 + \cdots a_n x^n$$

at some $x$ by rearranging this into

$$x(x(\ldots(x(x(a_n) + a_{n-1}) + a_{n-2}) \ldots) + a_1) + a_0$$

We start with $a_n$ and then at each step we multiply by $x$ and then add in the next lower coefficient.

In other words, we run the loop

auto result = 0;
for(int i = n; i >= 0; i--)
    result = (result * x + a[i]) % m;

For our purposes, this means that we can compute the “value” of a string as a base-256 number, mod $m$ via

unsigned int result = 0;
for(int i = s.size()-1; i >= 0; i--)
    result = (result * 256 + s.at(i)) % m;

(Bear in mind that you might need to adjust the value of s.at(i) if extended characters are in play!)

Note that these methods (you should only ever really use the last one) are not hash functions; they are just methods for turning an arbitrary length key (i.e., a string) into a fixed-length key, because many hash methods work only with fixed-length keys.

In terms of hash functions, we have a couple of common choices:

We can take the input $k \mod m$. This is known as remainder hashing. It is interesting in that it distributes the hash values in exactly the same way as the input keys. That is, if the keys are uniformly distributed, then the hash values will be also.

Some choices for $m$ are (much!) better than others. The best choice is usually a large prime number, that isn’t too close to a power of 2. The fact that we can’t choose any $m$ we like somewhat limits the uses of this, but it’s easy enough that we can use it for examples. It’s also fast: nothing more than a single division.

Note that in order to do this correctly, you should not wait until the end of the key computation to apply % m, but rather do it inside the loop.
Multiplicative hashing is a technique that allows us to map the hash values into any target range we like, which will be important when we talk anout hash tables. (The remainder method, in contrast, is restricted to modulos that are suitable primes.) The idea is to choose a floating point value $0 < A < 1$ and then compute

$$\mathtt{hash}(k) = \lfloor m\; \mathrm{frac}(kA) \rfloor$$

where the function frac gives the fractional part of its input. k should be a suitable integer representation of the key value to be hashed.

A good value for $A$ (according to Knuth) is

$$A = \frac{\sqrt{5} - 1}{2} \approx 0.6180339887\ldots$$

Some values of $A$ for better for different datasets.

As an example, let’s hash the string “Hello”. The ASCII values for the letters H e l l o are
```
 72 101 108 108 111
```
Treating these as digits (high digit on the left) of a base 256 number gives us
```
 k = 72*256^4 + 101*256^3 + 108*256^2 + 108*256 + 111 = 310939249775
```
Let’s assume that $m = 10$ so we have

$$ \lfloor 10\; \mathrm{frac}(310939249775 \times 0.6180339887) \rfloor = 8$$

One problem with the multiplicative method is that for very large $k$, as we had here, floating-point roundoff errors can occur, leading to badly distributed hashes. Some implementations use a fixed-point version, where we encode the fractional values in an int under the assumption that the decimal point has been shifted some fixed number of positions to the left.

With a little finesse, it’s possible to rewrite the multiplicative method in a way such that the intermediate results never exceed the range of a double: notice that inside the multiplicative method, we have $kA$ where $k$ is the result of the Horner’s method sum:

$$s[0] 256^0 + s[1] 256^1 + \cdots + s[n] 256^n$$

We can distribute $A$ over this giving

$$A s[0] 256^0 + A s[1] 256^1 + \cdots + A s[n] 256^n$$

which turns Horner’s method into

hash = 256 * hash + A * s[i] ;

(hash must now be a float or double.)

The next step is to take the fractional part of this, but just as we can move the normal modulo inside the loop, we can move the fractional part inside the loop as well, because of the identity:

$$\text{frac}(a + b) = \text{frac}{(\text{frac}(a) + \text{frac}(b))}$$

In C++, the frac function is actually fmod from <cmath>:

$$\text{frac}(a) = \mathtt{fmod}(a,1)$$

Moving this inside gives us

hash = fmod(A * (256 * hash + s[i]), 1) ;

and the values of hash are always in the range 0 to 1. After the loop is finished, we can multiply by $m$ and then round down. This method ensures that we get the correct hash value, even if the intermediate results would have been big enough to overflow an int.

Hash function properties

A good hash function should have

a Low Probability of collisions
a Equal Distribution
a good Avalanche Effect and
be Fast

Applications of hash functions

So now we’ve got our hash function hash(k), what can we do with it? There are a few interesting options:

If, every time we create a string, we store its hash with it, we get a cheap way to check for string inequality. If s1.hash != s2.hash then we know the strings are not equal. Otherwise, we go through with the full string equality check. (A lot of hash applications work like this, giving us a quick early check to see whether two things are different.)
We can do the same thing with files: compute the hash of the entire file contents, and then compare hashes. Obviously, if we only do this once, there’s no savings, but if we save the hashes, we can do it again in the future.
One property of a good hash function is that its output should look random (like the values of a random variable). An interesting thing happens if we insert a random stream of values into a binary search tree: the tree tends, on average, to end up roughly balanced. So one way to get a balanced binary search tree for free is to hash the keys, and then insert the hashes instead of the keys themselves. (But note that this breaks the advantage of the tree being ordered, because there is no correlation between the ordering of the hashes and the ordering of the keys.)

Hash functions are interesting in that comparing hashes gives us partial information: if two hashes are different, then we know something (the values were different), but if they are the same, we don’t know anything. This would suggest that a desirable property for a hash function to have is that it makes collisions as rare as possible (a collision is two different values that hash to the same result).

Hash tables

Hash tables are probably the main application of hashing functions that most people think of. The idea of a hash table is to create an array (or vector) big enough to hold all possible hashes (i.e., the size should be $m$). Then, to store a key/value in the table, we compute the key’s hash and place it in the array element indicated by its hash. To lookup a value, we hash the key and find its array element, and then we have its value.

To be more precise, a hash table is an unordered map ADT implemented using an array (or vector) of values. It’s unordered, unlike a search tree-based map, because the structure does not contain or use any information about the order of the values (the order of the hashes is unrelated to the order of the original values). So with a hash table map we cannot ask for the smallest/largest value, or the predecessor/successor or a value, and we cannot enumerate the values “in order” (we can enumerate them in an unpredictable order, the order of the hashes). A sketch of a hash table class might look something like

template<typename Key, typename Value>
class hash_table {
  public:
    void insert(Key k, Value v);
    Value& find(Key k);

  private:
    std::vector<Value> table;

    // Hash function, result in the range 0...table.size()-1
    unsigned hash(Key k); 
};

Given a pair (key, value) we store it in the hash table by simply doing

void hash_table::insert(Key k, Value v) {
    table.at(hash(key)) = value;
}

(I am assuming that hash(x) automatically restricts its results to the size of the table, table.size().)

Similarly, to lookup the value associated with a key we simply do

Value& hash_table::find(Key key) {
    return table.at(hash(key));
}

Of course, the details are more complex than this; in particular:

What hash function should we use?
How do we indicate whether a given table entry is “empty” or “in use”?
What happens if the entry at hash(key) is already filled by a different value? (I.e., how do we handle hash collisions.)
What do we do when every element of the table is “in use” (i.e., the table is full.) Do we even wait that long, or do we maybe do something when the table gets to some percentage full?
How big should we make the table to start with?

Hash functions

For hash tables, we have an additional requirement on the hash function: ideally, we want our hash values to be uniformly distributed over the size of the table. This means that for some set of input keys, the hashes produced have an equal chance of being placed in any of the locations in the table. If the distribution is not uniform, then some regions of the table will always be empty, while other regions will have many collisions. If we want to make efficient use of the table space, this is bad. We want the table to “fill up” regularly, everywhere at the same rate.

Another desirable property is called the avalanche effect: changing a small part of a value results in a totally different hash. This implies that very similar inputs will have very different hashes, so that values that would otherwise end up close to each other (or possibly even in the same cell) in the table get “spread out”.

Sometimes the hash function imposes requirements on the size of the hash tables: e.g., the remainder hash function requires a large prime modulo, which will also be the size of the hash table. The multiplicative method has no such requirement.

Taken together, we have these desirable properties for hash functions:

Uniformly distributed (will use all of the range evenly)
Low probability of collision (related partly to the previous)
Avalance effect: changing a value in a small way should generate a dramatically different hash (related to the previous)
Fast: the worst case should be $O(n)$ in the size of the value to be hashed (e.g., length of a string).

Empty/full cells

The cells of the hash table hold values, and thus the choice of a representation for “emptiness” depends on the Value type. If Value is a pointer type, then nullptr is an obvious choice for “empty” (provided we never need to store nullptr in the table).

If no such unused Value is available, then we can also create an auxilliary vector<bool> to keep track of whether a given cell is empty or full.

If chaining (see below) is used, then we have a natural representation for empty cells: the empty list.

Collision handling

If there are more possible keys than there are hash values, then collisions (two keys hashing to the same value) are inevitable. To see why, imagine that we have only 3 keys and our hash function produces only 2 distinct hash values. It should be obvious that one of these hash values will correspond to (at least!) 2 of the three keys; there simply aren’t enough hash values to support all the keys.

If two keys hash to the same result, they will end up in the same cell in the table. What do we do know? There are several possibilities, which can broadly be divided into two methods: chaining vs. open addressing.

With chaining, each entry of the hash table is actually a linked list of values. When we have a collsion, we simply add the additional value to the list.

Adding a key/value is $O(1)$, because we simply insert the new element at the head of the list, and finding the correct list is just an $O(1)$ array lookup.
Finding a key is $O(k)$ where $k$ is the length of the longest list. If the hash values are uniformly distributed, then we would expect all the lists to be roughly the same length, which should be $n / m$ where $n$ is the number of key/values in the table and $m$ is the size of the table itself.
Deleting a key/value involves finding it, and thus is also $O(n/m)$.

The value $n / m$ is called the load factor of the table, and indicates roughly how full the table is.

The worst case for chaining is that all the elements hash to the exact same value, and thus are stored in a single list. In this case, finding and deleting are both $O(n)$. A good hash function will usually mitigate this possibility.

To illustrate chainin, assume we have a hash table with size $m = 10$. We have the following sequence of keys to be inserted

$$20 \quad 39 \quad 15 \quad 11 \quad 19 \quad 3 \quad 50 \quad 41$$

and we’ll assume simple remainder hashing ($\mathtt{hash}(x) = x \mod m$).

Initially, all the lists in the table are empty. As we insert elements, we add them to the head of each list. (This is both easy to do without storing a tail pointer, and also means that recently-added elements are close to the front of the list, where they will be faster to find.) This gives us the hash table

Index	List
0	20, 50
1	11, 41
2
3	3
4
5	15
6
7
8
9	39, 19

(This isn’t a very uniform distribution, but with so few keys pretty much no distribution will look uniform. Only when we’ve inserted many keys will the table begin to look uniformly filled.)

In open addressing we don’t use any linked lists. Instead, if, after computing h0 = hash(k) we find that table[h0] is occupied, we use some probe function to generate h1 = p(h0) and check to see if table[h1] is free. If it is, we use that location, otherwise we generate h2 = p(h1) and continue. As long as the probe function is deterministic (always generates the same sequence of indexes), then we can reliably insert, find, and delete key/values from the table. (If you want, you can think of this as a linked list, where the “links” are implicitly defined by the probe function and the table indices it generates.)

Probe functions:

One simple probe function is just

$$p(h) = h + 1 \mod m$$

which just moves from one element of the table to the next, wrapping around at the end. This is known as linear probing. Although easy to implement, this probe function suffers from primary clustering: when the table starts to get full, runs of full cells will appear. Any key that hashes into such a run will end up being placed at the end of the run, which has the effect of making the run even bigger, and thus worse for the next key to be hashed into it. (Primary clustering essentially makes collisions appear where there should be none, thus artificially increasing the apparent load factor of the table.)

An an example, consider a hash table with 10 cells. Suppose we have some keys:

$$20 \quad 39 \quad 15 \quad 11 \quad 19 \quad 3 \quad 50 \quad 41$$

We’ll assume simple remainder hashing ($\mathrm{hash}(x) = x \mod m$).

Notice as we insert the values into the table, the “run” of occupied cells starts to spill out into the rest of the table.
Quadratic probing makes use of an auxilliary hash function hash2(x) and also of the “probe index”, the number of times probing has been attempted (if we are computing h3 = p(h2) then the probe index is 3). The quadratic probing function is

$$p(x,i) = (\mathtt{hash}_2 (x) + c_1 i + c_2 i^2) \mod m$$

where $c_1$ and $c_2$ are some positive constants. Quadratic probing has the advantage that as later probes are tried, it spreads the new attempts father apart, reducing the possibility of primary clusters developing. It does however have the problem that if two key values legitimately collide (hash to the same value) then their probe sequences will be the same, which will lead to clusters, just not near the original hash location. This is known as secondary clustering.

As an example, we’ll use the same sequence of keys as above

$$20 \quad 39 \quad 15 \quad 11 \quad 19 \quad 3 \quad 50 \quad 41$$

And we’ll let $c_1 = 5, c_2 = 2$ (again assuming simple remainder hashing for the hash function). This gives us the hash table:
```
 TODO
```
Double hashing replaces the basic hash function with a new one, built on two different hash functions hash1 and hash2 It is defined as

$$p(x,i) = (\mathtt{hash}_1(x) + i\; \mathtt{hash}_2 (x)) \mod m$$

Like quadratic probing, this distributes later probes across the table; here, however, the “distance” between the later probes is based on the auxilliary hash function hash2. Thus, every key will be mapped to a different probe sequence, and clustering (both primary and secondary) will be minimized.

The one difficult requirement is that the hash values generated by hash2 must be relatively prime (not share any factors with) the size of the hash table $m$. If $m$ is a power of 2, then an easy way to do this is to force the values returned by hash2 to be odd.

Because double hashing uses two different hash functions, this means that there are approximately $n^2$ different possible probe sequences. Both linear and quadratic probing are only capable of producing $n$ possible probe sequences. The larger number of probe sequences means that secondary clustering is much less likely.

In practice, double hashing is as close as we can get to the theoretically ideal “uniform hashing” (where keys perfectly hash to uniformly distributed hash values).

If we want to analyze open addresses (assuming that clustering is not an issue), we do so by looking at the load factor $\alpha = n / m$.

We always perform one initial probe (to look at the entry for hash(x))
The entry table[x] is full with probability $\alpha$. That is, the “proportion” of fullness at any particular cell is, for open addressing, the probability that any particular cell is full (again, ignoring the effects of clustering).
If table[x] is full then table[p(x)] is also full with probability $\alpha$.
And so forth.

For open addressing, we know that $n \le m$ which implies that $0 \le \alpha \le 1$. Thus, over a probe sequence, the number of probes in a search will be

$$1 + \alpha + \alpha^2 + \alpha^3 + \cdots = \frac{1}{1 - \alpha}$$

(The idea is that this represents the probability that the first probe succeeds, or that it fails and the second succeeds, or that the first and second fail and the third succeeds, and so forth.)

This is for an “unsuccessful” probe sequence, one where we are searching for an empty space in which to place a new element. A “successful” search, where we are looking for a particular key which is already in the tree is more difficult the analyze.

Rehashing: full hash tables

Regardless of the scheme we use for collision handling, the performance of a hash table will degrade as the load factor increases. It is generally unwise to allow a hash table to even approach truly being “full” ($n = m$), as by that time performance will have severely degraded. Instead, we set a threshold $\beta$ and if $\alpha \ge \beta$ then we rehash. This means increasing the size of the hash table, rehashing all the keys/values for the new $m$ and placing them into the new table. Obviously this is an expensive operation, so we want to grow the table by a sufficient amount so as to make it infrequent. It turns out that the same scheme that we used for vectors works here as well: when we need to grow the hash table, we double its size.

With a chaining-based table, instead of triggering a rehash based on $\alpha$, we usually trigger it based on the length of the longest chain. Thus, if any list gets “too long”, we rehash all lists.

Applications of hash tables

Hash tables are useful in any situation where we want to have near-$O(1)$ lookups of key/value pairs. A common application is binding variable names to values in programming languages (compilers and interpreters). To illustrate this, we’re going to build a very simple interpreter for a minimal programming language. The language only has arithmetic and variable assignment (and only single-letter variables and single-digit integers at that) but it should serve as a useful introduction.

We’re going to build a stack language, one that works off a stack. Our input will consists of a sequence of tokens (characters) separated by spaces. Each token is either a value, in which case it is pushed onto the stack, or an operation in which case it can pop some values off the stack, manipulate them, and push some back on. For example 1 is an integer value and x is a variable value; + is the addition operation: it pops the top two entries off the stack, adds them, and then pushes the result back onto the stack.

In this language, the computation

x = 1
y = 2
x * y + 3

looks like this

1 x =   2 y =   x y *  3 +

The possible operations are

Operation	Description
`v1 v2 =`	Sets the value of s2 (variable) to s1. Pushes nothing
`v1 v2 +`	Pops and adds s1 and s2 and pushes the result
`- * /`	Subtraction, multiplication, division
`v ~`	Unary minus, pops v, negates it and pushes the result

#include<iostream>
#include<stack>
#include<unordered_map>

class stack_elem {
  public:
    stack_elem(char c) {
        if(c >= '0' && c <= '9') {
            is_var = false;
            value = c - '0';
        }
        else {
            is_var = true;
            name = c;
        }
    }

    stack_elem(int x) {
        is_var = false;
        value = x;
    }

    bool is_var = false;
    int value = 0;
    char name = '#';
};

std::stack<stack_elem> st;
std::unordered_map<char,int> variables;

void process(char c);
void execute(char c);
int evaluate(stack_elem s);

// ---------------------------------------------------------------------------

int main() {
    char c;
    while(std::cin >> c) {
        process(c);

        if(!st.empty() && !st.top().is_var)
            std::cout << st.top().value << std::endl;
    }

    return 0;
}

void process(char c) {
    if(c >= '0' && c <= '9')
        st.push(c); // Integer
    else if((c >= 'a' && c <= 'z') ||
            (c >= 'A' && c <= 'Z'))
        st.push(c); // Variable
    else
        execute(c); // Operation
}

void execute(char c) {
    int v1, v2, v, var, val;
    switch(c) {
        case '=':
            var = st.top().name; st.pop();
            val = evaluate(st.top()); st.pop();

            variables[var] = val;
            break;

        case '+':
            v1 = evaluate(st.top()); st.pop();
            v2 = evaluate(st.top()); st.pop();

            st.push(v1 + v2);
            break;

        case '-':
            v1 = evaluate(st.top()); st.pop();
            v2 = evaluate(st.top()); st.pop();

            st.push(v1 - v2);
            break;

        case '*':
            v1 = evaluate(st.top()); st.pop();
            v2 = evaluate(st.top()); st.pop();

            st.push(v1 * v2);
            break;

        case '/':
            v1 = evaluate(st.top()); st.pop();
            v2 = evaluate(st.top()); st.pop();

            st.push(v1 / v2);
            break;

        case '~':
            v = evaluate(st.top()); st.pop();

            st.push(-v);
            break;

        default:
            // ignore this character
            return;
    }
}

int evaluate(stack_elem s) {
    if(!s.is_var)
        return s.value;
    else
        return variables.at(s.name); 
}