Motivation: Tracking Copied Assignment Submissions

Suppose you are the instructor of a class, and your students have submitted an assignment. You want to know if any of your students have copied off of each other. Assuming you have n students, S₀ through S_n-1, you will have to check \(n(n-1)/2\) assignments against each other:

S₀ == S₁
S₀ == S₂ 
   ⋮
S₀ == Sₙ₋₁

S₁ == S₂
S₁ == S₃ 
   ⋮
S₁ == Sₙ₋₁
   ⋮

(Note that because equality is reflexive, once we’ve checked S₀ == S₁ we do not need to check S₁ == S₀.)

This means that the number of comparisons is \O(n^2)\ in the number of students. However, we can reduce the number of comparisons as we go: if we discover that S_a == S_b and then later we discover that S_b == S_c, then we do not need to check S_a == S_c! This is due to the transitive property of equality:

If A == B and B == C then A == C.

Depending on how many students copied off each other, we may be able to significantly reduce the number of comparisons needed. To keep track of this information, we will use a disjoint set data structure. When two students are discovered to have copied off each other, we place them into the same set; if any other students were already in that set, they will all be grouped together.

The disjoint set data structure will support the following interface:

class disjoint_set {
  public:
    // Constructs a disjoint set with the given number of elements. Initially,
    // every element is in its own set.
    disjoint_set(int elem_count); 

    // Returns the representative element for the set containing `elem`. 
    // If two elements are in the same set, they will have the same rep.
    int rep(int elem);

    // Merge the set containing `a` and `b`. Does nothing if `a` and `b` are
    // already in the same set.
    void merge(int a, int b);

    // -------------------------- Utility functions --------------------------

    // Returns true if elements `a` and `b` are in the same set.
    bool in_same_set(int a, int b);

    // Returns the number of elements, as specified in the constructor.
    int elem_count() const;

    // Returns the number of sets, between 1 and elem_count.
    int set_count();

    // Returns the number of elements in the same set as `elem`. 
    int set_size(int elem);
};

The main functions are rep and merge; the remaining functions are utilities.

In order to not have to deal with two concepts (sets and elements), we represent each set by one of its elements. The rep(e) function returns the representative element for the set containing e. If two elements are in the same set, then they are guaranteed to have the same representative; if two elements are in different sets, then their representatives will be different, too.

The merge(a,b) function merges the set containing element a with the set containing element b. (If a and b were already in the same set, it does nothing.) This means that after the merge is finished, rep(a) == rep(b) and likewise for all other pairs of elements from a‘s set and b’s set.

Given this class, our process for checking the assignments for cheating becomes:

disjoint_set ds(student_count);

for(int a = 0; a < student_count; ++a)
    for(int b = a+1; b < student_count; ++b) {
        if(ds.in_same_set(a,b))
            continue; // We already know a == b
        else if(get_assignment(a) == get_assignment(b))
            ds.merge(a,b); // a copied from b, merge their sets
    }

When we’re done, we can look at the sets in ds to determine which/how many students cheated off each other.

The worst case runtime for this code is when all students are honest: we then perform the full \(O(n^2)\) assignment comparisons, as well as \(O(n^2)\) in_same_set calls.
The best case runtime is when all students copy off each other! Then, although we still perform \(O(n^2)\) in_same_set tests, we only perform \(O(n)\) assignment checks.

Hopefully, in_same_set is fast enough that although this code is still technically \(O(n^2)\) in both the test and worst cases, it’s faster than if we didn’t have that check.

Disjoint Sets

A disjoint set data structure is a tree-like structure which stores equivalence classes; sets of objects which are “equal” to each other in some way. In terms of implementation, disjoint sets answer the question, “Is it useful to have a tree structure with no down pointers (left/right), only a parent pointer?”.

A disjoint set’s operations include

Determining the representative element of a set, given another element in that set.
Merging two sets.
Determining the number of elements in a given set.
Determining the total number of sets
Determining whether the disjoint set structure represents a singleton set (i.e., is the total number of sets == 1).

For simplicity, we will consider only disjoint sets where each “item” is identified by a non-negative integer value. Disjoint sets of strings, etc. can be accomplished by using a hash table to map the key type to non-negative integers. We will also only consider disjoint sets where the number of elements is known in advance: the constructor for disjoint_set takes a parameter giving the number of elements, and new elements cannot be added later.

Vector/Array-based Disjoint Set Implementation

The simplest possible implementation of a disjoint set is as a vector:

vector<int> elems;

elems[i] tells us the set number that element i belongs to. In this implementation, elements and sets are two different concepts, although they are both represented as ints. Initially, each element belongs to its own set:

class disjoint_set {
  public:
    disjoint_set(int set_count)
    {
        elems.resize(set_count);

        for(size_t i = 0; i < elems.size(); ++i)
            elems[i] = i;
    }
    ⋮
  private:
    vector<int> elems;
};

To determine the representative of an element, we simply look it up in the vector:

int rep(int e)
{
    return elems[e];
}

This is obviously a \(O(1)\)-time operation.

To merge two sets a and b, we have to find all the elements of set b and update them to be in set a:

void merge(int a, int b)
{
    for(size_t i = 0; i < elems.size(); ++i)
        if(elems[i] == b)
            elems[i] = a;
}

This obviously takes \(O(n)\)-time, regardless of the number of elements in a or b.

Determining the number of elements in set a is relatively simple:

int set_size(int a)
{
    int s = 0;
    for(size_t i = 0; i < elems.size(); ++i)
        if(elems[i] == a)
            ++s;

    return s;
}

Again, this takes \(O(n)\) time.

Determining whether there is only 1 set can be done slightly faster in the best case:

bool is_singleton()
{
    int s = elems[0];
    for(size_t i = 1; i < elems.size(); ++i)
        if(elems[i] != s)
            return false; 

    return true;
}

In the worst-case (singleton set), this takes \(O(n)\)-time, but in the best case (first two elements are in different sets), it takes only \(O(1)\)-time.

Determining the number of sets can be done in \(O(n \log n)\)-time, by sorting the vector and then counting the number of unique elements (counting unique elements without sorting takes \(O(n^2)\)-time).

int set_count()
{
    vector<int> sorted = elems;

    std::sort(sorted.begin(), sorted.end()); // O(n log n)

    // Count unique elements: O(n)
    int c = 1; 
    int s = sorted[0];
    for(size_t i = 0; i < sorted.size(); ++i) 
        if(sorted[i] != s) {
            ++c;
            s = sorted[i];
        }

    return c;
}

A Nested-vector-based Disjoint Set Implementation

As a first attempt at implementing a disjoint set, we can use a vector-of-vectors: the outer vector represents all the sets, while each inner vector is a single set:

class disjoint_set {
  public:
    disjoint_set(int set_count)
    {
        total_sets = set_count;
        sets.resize(set_count);

        for(int i = 0; i < set_count; ++i)
            sets.at(i).push_back( { i } );
    }

    ⋮
  private:
    vector<vector<int>> sets;
    int total_sets;
};

This initializes the vector-of-vectors to a structure like this:

{
    { 0 },
    { 1 },
    { 2 },
     ⋮
    { n-1 }
}

I.e., each element of sets is a vector of size 1, containing the index of that set.

To determine if two elements are in the same set, we need to discover if they are in the same inner-vector. To assist with this, we define the representative of an element n to be the index of the inner-vector it occurs in:

int rep(int n)
{
    assert(n >= 0 and n < total_sets);

    for(int i = 0; i < sets.size(); ++i)
        for(int elem : sets.at(i))
            if(elem == n)
                return i;

    // Unreachable: we will always find a set
    return -1;
}

Then, determining if two elements are in the same set is just a matter of checking to see if they have the same representative:

bool in_same_set(int a, int b)
{
    return rep(a) == rep(b);
}

The runtime complexity of in_same_set is dependent on the runtime complexity of rep. In the worst case, in which case the complexity of rep is \(O(N)\) where N is the number of sets/elements. (Note that in merge, below, we will ensure that if all the elements are in one set, then sets.size() == 1 and thus the outer loop will run at most once.)

To merge two sets, we first find their representatives (i.e., their indexes within the vector). We then copy all the elements from one set (vector) into the other, and then erase the original:

void merge(int a, int b)
{
    a = rep(a);
    b = rep(b);

    // Copy all of set b into a (at the end of a)
    sets.at(a).insert(
        sets.at(a).end(),             // Where in a to insert (at end)
        sets.at(b).begin(),           // Where in b to start copying
        sets.at(b).end()              // Where in b to stop copying
    );

    // Erase b from the outer vector
    sets.erase(sets.begin() + b);
}

The runtime complexity of this is based on rep, insert and erase:

The worst-case complexity of rep is \(O(N)\), as described above.
The worst-case for insert occurs when a’s vector has 1 element and b’s vector has \(N-1\) elements, in which case insert takes \(O(N)\) time.
erase must shift all following elements in the vector down one index, resulting in \(O(N)\) runtime in the worst-case.

Hence the total complexity in the worst case is \(O(N + N + N) = O(N)\). Of course, many merge operations will be closer to \(O(1)\), when the sets are small (as they are at the beginning).

Tree/Forest-based Disjoint Set Implementation

A disjoint set is implemented as a forest; a collection of trees. Each tree is made of nodes where each node stores:

Its index (i.e., which student does it represent?)
Its parent, the index of another node. If a node is the root of a tree, then its parent is -1.

struct node
{
    int index;
    int parent;
};

Each node may have many (or zero!) children; given a node, there is no easy way to go down in the tree, to look at its children, but you can easily go up to its parent. Hence, the direct children of a node i are all the other nodes whose parent == i.

If we look at a node with parent == -1, then the node and all of its descendants (children, children’s children, etc.) are a single set. The root node is called the representative of the set. Note that the representative of any node can be found just by following the parents until we find a node whose parent == -1 (i.e., to find the representative for a node, walk up the tree from the node to the root of its tree).

Because we know in advance the maximum number of nodes we will need (it is the number of students), we don’t need to dynamically allocate them as in a traditional tree structure; we can create an array of nodes (actual nodes, not node-pointers) and then use the array indexes instead of pointers.

class disjoint_set {
  public:
    disjoint_set(int set_count) 
    {
        total_sets = set_count;
        forest = new node[total_sets];

        // Initialize all nodes as roots
        for(int i = 0; i < total_sets; ++i) {
            forest[i].index = i;
            forest[i].parent = -1;
        }
    }

    ⋮
  private:
    ⋮
    node* forest = nullptr;    // Array of nodes (tree roots)
    int total_sets;            // Size of the forest array
};

Initially, we create each node containing its own index and set its parent to -1; remember that at the beginning, each student is in a separate set.

A newly-created disjoint set with set_count == 8 looks like this:

DIAGRAM

Each node is the root of its own tree; hence, every student is in a separate set.

Finding the representative of a set

Suppose we have a set which looks like this:

DIAGRAM

How do we find the representative of set 3? Simple: we follow the parents in each node until we reach a node whose parent == -1; this is the root of 3’s tree and hence the representative of 3’s set. Thus, the rep(n) operation (which returns the representative of n) looks like this:

int rep(int n)
{
    assert(n >= 0 and n < total_sets);

    while(forest[n].parent != -1)
        n = forest[n].parent;

    return n;
}

The runtime complexity of this is proportional, in the worst case, to the size of n’s set; the number of elements in the set with n. (This is because it’s possible that n is at the bottom of a chain of nodes leading up to the root of its tree.)

DIAGRAM

The best-case complexity is when n is already the root of a tree, in which case it is O(1).

Determining if two elements are in the same set

To determine if two elements are in the same set, simply find the representatives for both elements and compare:

bool in_same_set(int a, int b)
{
    return rep(a) == rep(b);
}

The runtime complexity of this is O(1) in the best case, and proportional to the larger of a’s and b’s sets in the worst case.

Merging two sets

Two merge two sets, we simply find their representatives, and then make one a child of the other:

void merge(int a, int b)
{
    a = rep(a);
    b = rep(b); 

    if(a == b)
        return; // Already in same set

    // Make b a child of a
    forest[b].parent = a;
}

The best-case runtime complexity is O(1); the worst case is proportional to the larger of the sets containing a and b. Note that the only non-\(O(1)\) operation here is the call to rep.

Optimizations

The current disjoint set implementation has the advantage of simplicity, but it exhibits a kind of “inverse amortized” performance: initially, all sets are small and performance is good (\(O(1)\)). But as sets are merged and become larger, performance becomes slower. If, in the end, all elements end up in the same set, then rep and all operations which depend on it run in \(O(n)\) time!

To combat this, we will add two optimizations which do not improve the performance of any single operation, but which have an amortized effect: over many operations, the runtime will faster than expected. These two optimizations are:

When merging two sets (trees), take the height of the trees into account. rep runs slower on taller trees, so when merging trees, we want to make the taller tree the parent, and the shorter tree the child.

However, tracking the height of all the trees accurately turns out to take more time than its worth, so instead we use an approximation of height called rank, thus leading to an optimization called merge by rank.
When searching for the representative of a node, we can take the opportunity to “compress” the entire path from node to root into direct children of the root. This takes no extra time, and means that any future requests for the rep of the node, or any of its parents, will run in \(O(1)\) time. This is called path compression.

Merge-by-rank: The purpose of the merge-by-rank optimization is to avoid creating “too tall” trees when we perform a merge. To do this, we track the rank of each node:

struct node
{
    int index;
    int parent;
    int rank;
};

The rank is not exactly the size or height of each tree, but rather an approximate upper-bound on the height of the node.

Initially, the rank of every node is 0. When we merge two sets, we make the set with the larger rank the parent (root) and the set with the smaller rank the child. We leave the rank of the parent unchanged unless both sets have the same rank, in which case we increment the parent’s rank.

For example, we if merge sets 1 and 2 below:

DIAGRAM

Note that because both sets had the same rank (3) the final rank of the root is \(3+1 = 4\).

Merge-by-rank results in a runtime of \(O(log N)\).

Path compression: Previously, the rep operation did not modify the disjoint set structure. However, while we are following the path to the root, we can take each node we find and make it a direct child of the root, instead of a descendant. This “compresses” the path of all nodes along the root, from a length of \(N\) to a length of 1. This does not affect the big-O runtime of rep, but it makes future calls to rep for any of those nodes much faster.

The general method is to follow the path from node to root twice: the first time, we find the representative r of the node (i.e., the root of its tree). The second time, we update the parent of every node along this path to be r, thus making them direct children:

DIAGRAM

The combination of merge-by-rank with path-compression results in \(O(\hat{\alpha}(N))\) amortized runtime, where \(\hat{\alpha}(N)\) is the inverse of the Ackermann function. The Ackermann function grows very quickly, so its inverse grows very slowly: for any N representable as a 64-bit integer, \(\hat{\alpha}(N) \le 4\), which is effectively constant.