Motivation: Tracking Copied Assignment Submissions
Suppose you are the instructor of a class, and your students have submitted an assignment. You want to know if any of your students have copied off of each other. Assuming you have n students, S0 through Sn-1, you will have to check \(n(n-1)/2\) assignments against each other:
S₀ == S₁
S₀ == S₂
⋮
S₀ == Sₙ₋₁
S₁ == S₂
S₁ == S₃
⋮
S₁ == Sₙ₋₁
⋮
(Note that because equality is reflexive, once we’ve checked S₀ == S₁
we
do not need to check S₁ == S₀
.)
This means that the number of comparisons is \O(n^2)\ in the number of
students. However, we can reduce the number of comparisons as we go: if we
discover that Sa == Sb
and then later we
discover that Sb == Sc
, then we do not
need to check Sa == Sc
! This is due to the
transitive property of equality:
If
A == B
andB == C
thenA == C
.
Depending on how many students copied off each other, we may be able to significantly reduce the number of comparisons needed. To keep track of this information, we will use a disjoint set data structure. When two students are discovered to have copied off each other, we place them into the same set; if any other students were already in that set, they will all be grouped together.
The disjoint set data structure will support the following interface:
class disjoint_set {
public:
// Constructs a disjoint set with the given number of elements. Initially,
// every element is in its own set.
disjoint_set(int elem_count);
// Returns the representative element for the set containing `elem`.
// If two elements are in the same set, they will have the same rep.
int rep(int elem);
// Merge the set containing `a` and `b`. Does nothing if `a` and `b` are
// already in the same set.
void merge(int a, int b);
// -------------------------- Utility functions --------------------------
// Returns true if elements `a` and `b` are in the same set.
bool in_same_set(int a, int b);
// Returns the number of elements, as specified in the constructor.
int elem_count() const;
// Returns the number of sets, between 1 and elem_count.
int set_count();
// Returns the number of elements in the same set as `elem`.
int set_size(int elem);
};
The main functions are rep
and merge
; the remaining functions are utilities.
In order to not have to deal with two concepts (sets and elements), we
represent each set by one of its elements. The rep(e)
function returns the
representative element for the set containing e
. If two elements are in
the same set, then they are guaranteed to have the same representative; if two
elements are in different sets, then their representatives will be different,
too.
The merge(a,b)
function merges the set containing element a with the set
containing element b. (If a and b were already in the same set, it does
nothing.) This means that after the merge
is finished, rep(a) == rep(b)
and
likewise for all other pairs of elements from a‘s set and b’s set.
Given this class, our process for checking the assignments for cheating becomes:
disjoint_set ds(student_count);
for(int a = 0; a < student_count; ++a)
for(int b = a+1; b < student_count; ++b) {
if(ds.in_same_set(a,b))
continue; // We already know a == b
else if(get_assignment(a) == get_assignment(b))
ds.merge(a,b); // a copied from b, merge their sets
}
When we’re done, we can look at the sets in ds
to determine which/how many
students cheated off each other.
The worst case runtime for this code is when all students are honest: we then perform the full \(O(n^2)\) assignment comparisons, as well as \(O(n^2)\)
in_same_set
calls.The best case runtime is when all students copy off each other! Then, although we still perform \(O(n^2)\)
in_same_set
tests, we only perform \(O(n)\) assignment checks.
Hopefully, in_same_set
is fast enough that although this code is still
technically \(O(n^2)\) in both the test and worst cases, it’s faster than
if we didn’t have that check.
Disjoint Sets
A disjoint set data structure is a tree-like structure which stores
equivalence classes; sets of objects which are “equal” to each other in
some way. In terms of implementation, disjoint sets answer the question,
“Is it useful to have a tree structure with no down pointers (left
/right
),
only a parent
pointer?”.
A disjoint set’s operations include
Determining the representative element of a set, given another element in that set.
Merging two sets.
Determining the number of elements in a given set.
Determining the total number of sets
Determining whether the disjoint set structure represents a singleton set (i.e., is the total number of sets
== 1
).
For simplicity, we will consider only disjoint sets where each “item” is
identified by a non-negative integer value. Disjoint sets of strings, etc. can
be accomplished by using a hash table to map the key type to non-negative
integers. We will also only consider disjoint sets where the number of
elements is known in advance: the constructor for disjoint_set
takes a
parameter giving the number of elements, and new elements cannot be added
later.
Vector/Array-based Disjoint Set Implementation
The simplest possible implementation of a disjoint set is as a vector:
vector<int> elems;
elems[i]
tells us the set number that element i belongs to. In this
implementation, elements and sets are two different concepts, although they are both represented as int
s. Initially, each element belongs to its own set:
class disjoint_set {
public:
disjoint_set(int set_count)
{
elems.resize(set_count);
for(size_t i = 0; i < elems.size(); ++i)
elems[i] = i;
}
⋮
private:
vector<int> elems;
};
To determine the representative of an element, we simply look it up in the vector:
int rep(int e)
{
return elems[e];
}
This is obviously a \(O(1)\)-time operation.
To merge two sets a and b, we have to find all the elements of set b and update them to be in set a:
void merge(int a, int b)
{
for(size_t i = 0; i < elems.size(); ++i)
if(elems[i] == b)
elems[i] = a;
}
This obviously takes \(O(n)\)-time, regardless of the number of elements in a or b.
Determining the number of elements in set a is relatively simple:
int set_size(int a)
{
int s = 0;
for(size_t i = 0; i < elems.size(); ++i)
if(elems[i] == a)
++s;
return s;
}
Again, this takes \(O(n)\) time.
Determining whether there is only 1 set can be done slightly faster in the best case:
bool is_singleton()
{
int s = elems[0];
for(size_t i = 1; i < elems.size(); ++i)
if(elems[i] != s)
return false;
return true;
}
In the worst-case (singleton set), this takes \(O(n)\)-time, but in the best case (first two elements are in different sets), it takes only \(O(1)\)-time.
Determining the number of sets can be done in \(O(n \log n)\)-time, by sorting the vector and then counting the number of unique elements (counting unique elements without sorting takes \(O(n^2)\)-time).
int set_count()
{
vector<int> sorted = elems;
std::sort(sorted.begin(), sorted.end()); // O(n log n)
// Count unique elements: O(n)
int c = 1;
int s = sorted[0];
for(size_t i = 0; i < sorted.size(); ++i)
if(sorted[i] != s) {
++c;
s = sorted[i];
}
return c;
}
A Nested-vector-based Disjoint Set Implementation
As a first attempt at implementing a disjoint set, we can use a vector-of-vectors: the outer vector represents all the sets, while each inner vector is a single set:
class disjoint_set {
public:
disjoint_set(int set_count)
{
total_sets = set_count;
sets.resize(set_count);
for(int i = 0; i < set_count; ++i)
sets.at(i).push_back( { i } );
}
⋮
private:
vector<vector<int>> sets;
int total_sets;
};
This initializes the vector-of-vectors to a structure like this:
{
{ 0 },
{ 1 },
{ 2 },
⋮
{ n-1 }
}
I.e., each element of sets
is a vector of size 1, containing the index of
that set.
To determine if two elements are in the same set, we need to discover if they are in the same inner-vector. To assist with this, we define the representative of an element n to be the index of the inner-vector it occurs in:
int rep(int n)
{
assert(n >= 0 and n < total_sets);
for(int i = 0; i < sets.size(); ++i)
for(int elem : sets.at(i))
if(elem == n)
return i;
// Unreachable: we will always find a set
return -1;
}
Then, determining if two elements are in the same set is just a matter of checking to see if they have the same representative:
bool in_same_set(int a, int b)
{
return rep(a) == rep(b);
}
The runtime complexity of in_same_set
is dependent on the runtime
complexity of rep
. In the worst case, in which case the complexity of
rep
is \(O(N)\) where N is the number of sets/elements. (Note that in merge
,
below, we will ensure that if all
the elements are in one set, then sets.size() == 1
and thus the outer loop
will run at most once.)
To merge two sets, we first find their representatives (i.e., their indexes within the vector). We then copy all the elements from one set (vector) into the other, and then erase the original:
void merge(int a, int b)
{
a = rep(a);
b = rep(b);
// Copy all of set b into a (at the end of a)
sets.at(a).insert(
sets.at(a).end(), // Where in a to insert (at end)
sets.at(b).begin(), // Where in b to start copying
sets.at(b).end() // Where in b to stop copying
);
// Erase b from the outer vector
sets.erase(sets.begin() + b);
}
The runtime complexity of this is based on rep
, insert
and erase
:
The worst-case complexity of
rep
is \(O(N)\), as described above.The worst-case for
insert
occurs whena
’s vector has 1 element andb
’s vector has \(N-1\) elements, in which caseinsert
takes \(O(N)\) time.erase
must shift all following elements in the vector down one index, resulting in \(O(N)\) runtime in the worst-case.
Hence the total complexity in the worst case is \(O(N + N + N) = O(N)\). Of
course, many merge
operations will be closer to \(O(1)\), when the sets are
small (as they are at the beginning).
Tree/Forest-based Disjoint Set Implementation
A disjoint set is implemented as a forest; a collection of trees. Each tree is made of nodes where each node stores:
Its index (i.e., which student does it represent?)
Its parent, the index of another node. If a node is the root of a tree, then its parent is -1.
struct node
{
int index;
int parent;
};
Each node may have many (or zero!) children; given a node, there is no easy
way to go down in the tree, to look at its children, but you can easily
go up to its parent. Hence, the direct children of a node i are all the
other nodes whose parent == i
.
If we look at a node with parent == -1
, then
the node and all of its descendants (children, children’s children, etc.) are
a single set. The root node is called the representative of the set. Note
that the representative of any node can be found just by following the
parent
s until we find a node whose parent == -1
(i.e., to find the
representative for a node, walk up the tree from the node to the root of its
tree).
Because we know in advance the maximum number of nodes we will need (it is the
number of students), we don’t need to dynamically allocate them as in a
traditional tree structure; we can
create an array of nodes (actual node
s, not node
-pointers) and then use the
array indexes instead of pointers.
class disjoint_set {
public:
disjoint_set(int set_count)
{
total_sets = set_count;
forest = new node[total_sets];
// Initialize all nodes as roots
for(int i = 0; i < total_sets; ++i) {
forest[i].index = i;
forest[i].parent = -1;
}
}
⋮
private:
⋮
node* forest = nullptr; // Array of nodes (tree roots)
int total_sets; // Size of the forest array
};
Initially, we create each node containing its own index and set its parent to -1; remember that at the beginning, each student is in a separate set.
A newly-created disjoint set with set_count == 8
looks like this:
DIAGRAM
Each node is the root of its own tree; hence, every student is in a separate set.
Finding the representative of a set
Suppose we have a set which looks like this:
DIAGRAM
How do we find the representative of set 3? Simple: we follow the parent
s in
each node until we reach a node whose parent == -1
; this is the root of 3’s
tree and hence the representative of 3’s set. Thus, the rep(n)
operation
(which returns the representative of n
) looks like this:
int rep(int n)
{
assert(n >= 0 and n < total_sets);
while(forest[n].parent != -1)
n = forest[n].parent;
return n;
}
The runtime complexity of this is proportional, in the worst case, to the size
of n
’s set; the number of elements in the set with n
. (This is because it’s
possible that n
is at the bottom of a chain of nodes leading up to the root of
its tree.)
DIAGRAM
The best-case complexity is when n
is already the root of a tree, in which
case it is O(1).
Determining if two elements are in the same set
To determine if two elements are in the same set, simply find the representatives for both elements and compare:
bool in_same_set(int a, int b)
{
return rep(a) == rep(b);
}
The runtime complexity of this is O(1) in the best case, and proportional to the larger of a’s and b’s sets in the worst case.
Merging two sets
Two merge two sets, we simply find their representatives, and then make one a child of the other:
void merge(int a, int b)
{
a = rep(a);
b = rep(b);
if(a == b)
return; // Already in same set
// Make b a child of a
forest[b].parent = a;
}
The best-case runtime complexity is O(1); the worst case is proportional to
the larger of the sets containing a
and b
. Note that the only non-\(O(1)\)
operation here is the call to rep
.
Optimizations
The current disjoint set implementation has the advantage of simplicity, but
it exhibits a kind of “inverse amortized” performance: initially, all sets are
small and performance is good (\(O(1)\)). But as sets are merged and become
larger, performance becomes slower. If, in the end, all elements end up in
the same set, then rep
and all operations which depend on it run in \(O(n)\)
time!
To combat this, we will add two optimizations which do not improve the performance of any single operation, but which have an amortized effect: over many operations, the runtime will faster than expected. These two optimizations are:
When merging two sets (trees), take the height of the trees into account.
rep
runs slower on taller trees, so when merging trees, we want to make the taller tree the parent, and the shorter tree the child.However, tracking the height of all the trees accurately turns out to take more time than its worth, so instead we use an approximation of height called rank, thus leading to an optimization called merge by rank.
When searching for the representative of a node, we can take the opportunity to “compress” the entire path from node to root into direct children of the root. This takes no extra time, and means that any future requests for the
rep
of the node, or any of its parents, will run in \(O(1)\) time. This is called path compression.
Merge-by-rank: The purpose of the merge-by-rank optimization is to avoid
creating “too tall” trees when we perform a merge
. To do this, we track
the rank of each node:
struct node
{
int index;
int parent;
int rank;
};
The rank is not exactly the size or height of each tree, but rather an approximate upper-bound on the height of the node.
Initially, the rank of every node is 0. When we merge two sets, we make the set with the larger rank the parent (root) and the set with the smaller rank the child. We leave the rank of the parent unchanged unless both sets have the same rank, in which case we increment the parent’s rank.
For example, we if merge sets 1 and 2 below:
DIAGRAM
Note that because both sets had the same rank (3) the final rank of the root is \(3+1 = 4\).
Merge-by-rank results in a runtime of \(O(log N)\).
Path compression: Previously, the rep
operation did not modify the
disjoint set structure. However, while we are following the path to the root,
we can take each node we find and make it a direct child of the root,
instead of a descendant. This “compresses” the path of all nodes along the
root, from a length of \(N\) to a length of 1. This does not affect the
big-O runtime of rep
, but it makes future calls to rep
for any of those
nodes much faster.
The general method is to follow the path from node to root twice: the first
time, we find the representative r
of the node (i.e., the root of its tree). The second
time, we update the parent
of every node along this path to be r
, thus making
them direct children:
DIAGRAM
The combination of merge-by-rank with path-compression results in \(O(\hat{\alpha}(N))\) amortized runtime, where \(\hat{\alpha}(N)\) is the inverse of the Ackermann function. The Ackermann function grows very quickly, so its inverse grows very slowly: for any N representable as a 64-bit integer, \(\hat{\alpha}(N) \le 4\), which is effectively constant.