Assignment 5: Testing hash functions

In this assignment you are going to test various hash functions to see how good they are, in terms of how many collisions they have. Your input will be strings, in fact, all the strings that are defined in the system’s dictionary. This is located in /usr/share/dict/words; you can download a copy to your local computer for testing here. (If you try to download it your browser may say that its a binary file, but it is in fact a text file, with one word per line.) Note that if you try to open the system copy of the dictionary in your code on the server, you will have to open it for reading only, otherwise opening the file will fail. This file contains just under 100,000 English words, which we’re going to use to test the uniformity of various hash functions.

Your hash functions will hash strings into 16-bit (not 32-bit) ints. This is important, because we’re going to keep a table of the number of collisions for each hash value. With 16-bit ints there are only 65,536 possible hash values, so this table will easily fit in memory. If we used 32-bit ints then there would be 4,294,967,296 possble hashes, a more troublesome amount. For C++, you can get a 16-bit unsigned int type by doing

#include<cstdint>

uint16_t x; // x has exactly 16 bits, and is unsigned

For each of the possible hash functions your program should:

Create a vector<int> hashes of size 65,536
Process the list of words, and for each word, compute its hash h
Increment the entry in the table for that hash: hashes.at(h)++
When finished, use Pearson’s $X^2$ test to determine the probability that the resulting distribution is uniformly distributed. (See below for the deets.)

Hash functions to test

The hash (non)functions you should test are:

String length (modulo $2^{16}$)
First character
Additive checksum (add all characters together), modulo $2^{16}$
Remainder (use a modulo of 65413, this is the first prime that is smaller than the table size). Remember that you cannot just add up all the characters and then take the mod of the result; you have to thread the modulo through the loop that computes the sum.
Multiplicative (using the scheme described in class/in the lecture notes). Again, remember that you can’t just use the final sum; you have to incorporate the multiplicative calculation into hashing loop.

Also remember that in C++ char may be signed or unsigned; on your computers and on the server it is signed, so character values actually range from -128 to +127. Normally this doesn’t matter, as all of the standard characters are in the range 0-127. However, a few of the words in the dictionary use extended foreign characters in the range 128-255, which will map to negative values if char is signed, and negative values will throw off the hash functions. A hacky way to compensate for this is to just assume that char is signed and pass the character value c through the transformation

uint16_t(c)

before using it. (This will leave standard character codes alone, but will map extended characters to positive values above 127.)

Pearson’s $X^2$ test

The $X^2$ test is a statistical method for determining the probability that a given set of observed results (in this case, the number of hashes per “bin”) could have been generated from a given distribution. In our case, we want our distribution to be uniform, meaning that every bin should have

$$\text{expected} = \frac{\text{Number of words}}{65,536}$$

items in it (this will be a fractional value). To perform the test, you will take the hashes vector and from it compute the value $X^2$:

$$\mathtt{c2} = \sum_{i} \frac{(\text{expected} - \mathtt{hashes}[i])^2}{\text{expected}}$$

We also need to know the “degrees of freedom”, which is just the number of bins, minus 1: 65,535.

We’ll use the Boost.Math library for the statistical functions:

#include <boost/math/distributions/chi_squared.hpp>

(If you’re working on your own computer, Boost.Math is a header-only library, so you can just download the full Boost distribution and then extract boost_1_63_0/boost/math directory somewhere in the include path for your compiler or project.)

Create a $X^2$ distribution, with the desired degrees of freedom:

boost::math::chi_squared c2d(65535.0);

and then use the cumulative distribution function to get the probability:

float p = boost::math::cdf(c2d, c2);

p tells us the probability that c2 is not derived from a uniform distribution, so the smaller this is, the better! (You can also negate it and convert to a percentage, for a kind of “good distribution” measure: (1 - p) * 100.0.)

Implementation

This assignment doesn’t have a test runner. Just run each of the above hash functions through the full dictionary and then report on the $p$ value you get for each. You should get $p = 1$ for the really bad hash functions (indicating that they are very far from uniform) but $p < 1$ for at least the remainder method. (I got p = 0.251 for remainder hashing; multiplicative should give you a $p < 1$ but it’s tricky to get right, so don’t worry about it too much if you get $p = 1$.)

Submission

Save your work in a directory on the server named cs133/assign5/.

Extra Credit

If you make your code print out nice-looking histograms of each hash function’s distribution, then I’ll count this assignment as an extra assignment.

Hash functions to test

Pearson’s \(X^2\) test

Implementation

Submission

Extra Credit