In this assignment you are going to test various hash functions to see how
good they are, in terms of how many collisions they have. Your input will be
strings, in fact, all the strings that are defined in the system’s
dictionary. This is located in /usr/share/dict/words
; you can download a
copy to your local computer for testing here. (If you
try to download it your browser may say that its a binary file, but it is in
fact a text file, with one word per line.) Note that if you try to open the system
copy of the dictionary in your code on the server, you will have to open it for
reading only, otherwise opening the file will fail. This file contains
just under 100,000 English words, which we’re going to use to test the uniformity
of various hash functions.
Your hash functions will hash strings into 16-bit (not 32-bit) int
s. This
is important, because we’re going to keep a table of the number of collisions
for each hash value. With 16-bit int
s there are only 65,536 possible hash
values, so this table will easily fit in memory. If we used 32-bit int
s
then there would be 4,294,967,296 possble hashes, a more troublesome amount.
For C++, you can get a 16-bit unsigned int
type by doing
#include<cstdint>
uint16_t x; // x has exactly 16 bits, and is unsigned
For each of the possible hash functions your program should:
Create a
vector<int> hashes
of size 65,536Process the list of words, and for each word, compute its hash h
Increment the entry in the table for that hash:
hashes.at(h)++
When finished, use Pearson’s \(X^2\) test to determine the probability that the resulting distribution is uniformly distributed. (See below for the deets.)
Hash functions to test
The hash (non)functions you should test are:
String length (modulo \(2^{16}\))
First character
Additive checksum (add all characters together), modulo \(2^{16}\)
Remainder (use a modulo of 65413, this is the first prime that is smaller than the table size). Remember that you cannot just add up all the characters and then take the mod of the result; you have to thread the modulo through the loop that computes the sum.
Multiplicative (using the scheme described in class/in the lecture notes). Again, remember that you can’t just use the final sum; you have to incorporate the multiplicative calculation into hashing loop.
Also remember that in C++ char
may be signed or unsigned; on your computers
and on the server it is signed, so character values actually range from -128 to
+127. Normally this doesn’t matter, as all of the standard characters are in the
range 0-127. However, a few of the words in the dictionary use extended foreign
characters in the range 128-255, which will map to negative values if char
is signed, and negative values will throw off the hash functions.
A hacky way to compensate for this is to just assume that char
is signed
and pass the character value c
through the transformation
uint16_t(c)
before using it. (This will leave standard character codes alone, but will map extended characters to positive values above 127.)
Pearson’s \(X^2\) test
The \(X^2\) test is a statistical method for determining the probability that a given set of observed results (in this case, the number of hashes per “bin”) could have been generated from a given distribution. In our case, we want our distribution to be uniform, meaning that every bin should have
items in it (this will be a fractional value). To perform the test, you will
take the hashes
vector and from it compute the value \(X^2\):
We also need to know the “degrees of freedom”, which is just the number of bins, minus 1: 65,535.
We’ll use the Boost.Math library for the statistical functions:
#include <boost/math/distributions/chi_squared.hpp>
(If you’re working on your own computer, Boost.Math is a header-only
library, so you can just download the full Boost distribution and then extract boost_1_63_0/boost/math
directory
somewhere in the include path for your compiler or project.)
Create a \(X^2\) distribution, with the desired degrees of freedom:
boost::math::chi_squared c2d(65535.0);
and then use the cumulative distribution function to get the probability:
float p = boost::math::cdf(c2d, c2);
p
tells us the probability that c2
is not derived from a uniform distribution,
so the smaller this is, the better! (You can also negate it and convert to
a percentage, for a kind of “good distribution” measure: (1 - p) * 100.0
.)
Implementation
This assignment doesn’t have a test runner. Just run each of the above
hash functions through the full dictionary and then report on the \(p\)
value you get for each. You should get \(p = 1\) for the really bad hash
functions (indicating that they are very far from uniform) but \(p < 1\)
for at least the remainder method. (I got p = 0.251
for remainder
hashing; multiplicative should give you a \(p < 1\) but it’s tricky to get
right, so don’t worry about it too much if you get \(p = 1\).)
Submission
Save your work in a directory on the server named cs133/assign5/
.
Extra Credit
If you make your code print out nice-looking histograms of each hash function’s distribution, then I’ll count this assignment as an extra assignment.