Review

Insertion sort

Suppose we split the input into two “sections”, sorted and unsorted. The sorted section is initially empty. For each element in the unsorted section, we insert it into the sorted section, in its proper, sorted, position. Because this is an array/vector, inserting an element requires shifting all the elements after it up one step. There’s a simplification we can apply: the steps of finding the proper sorted position and inserting/shifting it there can be combined into a single loop. We take the element and swap it with the previous element, until either it is in the correct position, or it has reached the beginning of the vector.

Insertion sort is stable.

template<typename T>
void insertion_sort(vector<T>& data) 
{
    for(int i = 1; i < data.size(); ++i)        
        for(int j = i; i > 0 && data[j] < data[j-1]; --j)
            std::swap(data[j], data[j-1]);

}

In terms of number of lines, insertion sort is probably the simplest \(O(n^2)\) sorting algorithm.

What are the best and worst cases for this algorithm?

We can write a recursive version through a pretty straightforward removal of the outer loop:

template<typename T>
void insertion_sort(vector<T>& data, int i = 1) 
{
    if(i == data.size())
        return;
    else {    
        for(int j = i; i > 0 && data[j] < data[j-1]; --j)
            std::swap(data[j], data[j-1]);

        insertion_sort(data, i+1);
    }    
}

Is better sorting possible?

So far, all the sorting algorithms we’ve looked at have been \(O(n^2)\) in the worst case. Is that the best we can do, or can we do better? There’s not really any use looking for a better sorting algorithm, so let’s look at whether we even could do better. We want to look at what an optimal sorting algorithm might look like.

When we talk about the “performance” of a sorting algorithm, we are usually referring to the number of comparisons. Obviously, the best case for any sorting algorithm is \(O(n)\); if the elements are in order, then all we need to do is check that, which requires \(O(n)\) comparisons.

The real question is, what is the worst case for the number of comparisons?

In order to do this, we’re going to build a decision tree of all the comparisons, and what the result would be of them. The idea here is that when we find a pair of elements that are out of order, we swap them. If they are not out of order, we keep them the same.

A tree for a sequence of length three looks like this:

                1:2
            /         \
           2:3        1:3
          /   \      /    \ 
        1,2,3 1:3  2,1,3  2:3
             /    \      /    \
           3,1,2 3,1,2 2,3,1 3,2,1

The leaves of the tree are permutations. The internal nodes are comparisons. E.g., suppose we found that element #1 was less than element #2. (This is the root node: we take the left branch if the comparison is \(<\), right if it is \(\ge\)). Next, we would check to see if the #2 element was less than the #3 element. If that is the case, then we have \(\#1 < \#2 < \#3\), so the sequence is already in the right order, so the permutation we want is \(\langle 1,2,3 \rangle\). This leaves us in the leftmost leaf node. (Once we’ve found the right permutation, actually rearranging the elements of the sequence is \(O(n)\).

In order to find the correct permutation, we have to walk from the root of the tree, to one of its leaf nodes; the height of the tree from the root to the deepest leaf is the maximum number of comparisons we might need to do to find the correct sort. We don’t know the height of the tree, but we do know the number of leaf nodes: it must be at least \(n!\). That’s because there are \(n!\) permutations of \(n\) elements, and every permutation could be the result of some sort. (It’s possible for some permutations to be the results of more than one path, and thus there might be “duplicate” leaves.) We’ll call the number of leaf nodes \(l\) and we have

$$n! \le l$$

Since this is a binary tree, we need to figure out how the height of a binary tree is related to the number of nodes in it. We call a tree with just the root node a 0-height tree:

  root

(A zero height tree has one leaf node.)

A tree of height 1 has the root and two children:

   root
   /  \
  left right

(A 1-height tree has at most 2 leaf nodes, if we fill out both branches.)

     r
   /   \
   *   *
  / \ / \
  # # # #

(A 3-height tree has at most 4 leaf nodes. Noticing a pattern?)

In fact, a complete binary tree of height \(h\) has \(2^h\) leaf nodes, which implies that an incomplete tree has \( \le 2^h \) leaf nodes. If we want to, we can prove this inductively:

Proof: the number of leaf nodes in a complete binary tree of height \(h\) is exactly \(2^h\).

Base case: \(h = 0\). Then the number of leaf nodes should be \(2^0 = 1\), which it is.

Inductive case: \(h = k, k \gt 0\)

Suppose we have two trees of height \(k-1\):

  /\      /\
 /  \    /  \
/  a \  /  b \

by the IH, we know that each of these has \(2^{k-1}\) leaf nodes. We can take these two trees and form a new tree, of height \(k\) by making them the left and right subtrees of a new tree (the new root is not a leaf, so it doesn’t count).

     root
    /    \
  /\      /\
 /  \    /  \
/  a \  /  b \

The height of the new tree is \(k\), and the two subtrees both have \(2^{k-1}\) leaf nodes, so the new tree has a combined total of \(2 (2^{k-1}) = 2^k\) leaf nodes, exactly what we expected.

Incorporating this into our prior work, we have

$$n! \le l \le 2^h$$

whatever \(h\) is. (E.g., in the above example, \(3! = 6\) but \(2^3 = 8\).) If we take the log of both sides, we get

$$\log (n!) \le h$$

In order to evaluate \(\log (n!)\), we’re going to derive a weak upper bound on the value of \(n!\). Each term in the product \(n! = n(n-1)(n-2)\ldots\) is \(\le n\), so as an upper bound, we can say that \(n! \le n^n\). If we take the log of this we get

$$\log n^n \le h$$ $$n \log n \le h$$

Since the number of comparisons needed to reach a sorted solution is \(h\), an optimal sorting algorithm will, in the worst case, need to perform at least \(O(n \log n)\) comparisons. (Some paths to a leaf are shorter than others.) This is the target we should be aiming for.

(Note that we got this result, even though we were overestimating during our calculations: we overestimated the number of leaf nodes, we overestimated the value of \(\log n!\), and even with that, we still get \(O(n \log n)\). This suggests that the true worst case might be better than we have suggested…)

Shellsort: The first sub-quadratic sort algorithm

Let’s take a look at the first sorting algorithm to break the \(O(n^2)\) barrier: shell sort (invented by Donald Shell in 1959).

Shell sort can be thought of as a fancied-up version of insertion sort. It relies on sorting elements that are far apart, and then gradually moving closer together, until it’s finally sorting elements that are right next to each other.

A particular pass in Shell sort is characterized by its gap. The idea is that for a gap \(g\), we sort elements that are \(g\) spaces apart in the sequence. E.g., if \(g = 5\) then we sort the 1,6,11,16,… elements, then the 2,7,12,17,… elements, and so forth, until we’ve sorted the 5,10,15,20,… elements. That finishes the \(g = 5\) pass. (Thus, a pass with gap \(g\) actually consists of \(g\) sub-passes, but each pass is short, because it only has \(\lfloor n / h \rfloor\) elements.) Next, we might move on to a 3-pass, and finish up with a 1-pass. We must finish with a 1-pass for the algorithm to work.

As the passes get closer to 1, there are fewer subpasses, but each subpass sorts more elements. The final pass sorts all the elements, but hopefully, by that point, most everything will be in the right place. The key is that if a sequence is \(k\)-sorted (for some gap \(k\)), then it is also sorted for all \(k’ < k\). In other words, smaller gaps don’t “break” any sorting that larger gaps have accomplished.

One of the first gap sequences to have sub-quadratic performance is given by

$$2^k - 1 = 1, 3, 7, 15, 31, 63, \ldots$$

This sequence should be used in reverse: find the largest element in the sequence that is smaller than \(n\) and then work your way down to 1. This gap sequence gives a number-of-comparisons that is \(O(n^{3/2})\), just below quadratic. Better sequences can reach \(O(n^{4/3})\), at the expense of having weird formulas for generating the sequence.

Here’s a sketch of the sorting algorithm.

// Sort data, starting at index start (< gap) with a spacing of
// gap between elements.
template<typename T>
void insertion_sort(vector<T>& data, int start, int gap); 

// Generates the gap sequence for a sequence of size n, in descending order
vector<int> gap_sequence(int n) {
    int i = 1, g = 1 << i - 1;
    while(g < n) {
        i++;
        g = 1 << i - 1;
    }
    i--; // Back off one.

    vector<int> gaps;
    for(int j = i; j > 0; --j) {
        gaps.push_back(1 << j - 1);
    }

    return gaps;
}

template<typename T>
void shell_sort(vector<T> data) {
    vector<int> gaps = gap_sequence(data.size());

    for(int g : gaps) {
        for(int i = 0; i < g; ++i) 
            insertion_sort(data, i, g);
    }
}

You can actually use any sorting algorithm in place of insertion sort. Insertion sort has the advantage that it takes only \(O(n)\) time if the input is already sorted, so that means that later (smaller) gap passes will be faster, as the data is already partially sorted.

Mergesort: a near-optimal sorting algorithm

Mergesort is a sorting algorithm that is of order \(O(n \log n)\). It’s also fairly easy to understand, being based on a simple merge operation.

Suppose we have two vectors, both of size \(n/2\) which are already sorted. In this case, we can merge them into a single size-n vector in only \(O(n)\) time. We just keep two pointers into the two vectors, and advance them as we add the smaller element to the output vector:

template<typename T>
void merge(T* in, int mid, int size, T* out) {
    int i = 0, j = mid, k = 0;

    while(i < mid && j < size) {

        // Add the smaller of s1, s2 to output.
        if(in[i] <= in[j]) 
            out[k++] = in[i++];
        else {
            out[k++] = in[j++];
    }

    // Reached the end of one or both, add any remaining elements
    // Only one (or zero) or these loops will ever run, never both.
    while(i < mid)  out[k++] = in[i++];
    while(j < size) out[k++] = in[j++];        
}

(walk through this)

It’s possible to do a recursive merge. The base case is when one of the input sequences is fully consumed (i.e., is empty); at that point we copy the other sequence to the output. The recursive case is when both input sequences are non-empty; we compare the first elements of both sequences, add one of them to the output, and then recurse on the rest of the input sequences and output sequence.

It’s also possible to do an in-place merge, not requiring the separate output array. We’ll look at this later.

template<typename T>
void mergesort(T* data, T* output, int size) {
    if(size == 0)
        return;
    else if(size == 1)
        output[0] = data[0]; // Copy one

    T* temp = new T[size]; // Temporary space

    int mid = size / 2;

    mergesort(data,       temp,       mid);           // Sort lower half
    mergesort(data + mid, temp + mid, size - mid);    // Sort upper half
    merge(data, mid, size, temp); // Merge

    // Copy to output
    for(int i = 0; i < size; ++i)
      output[i] = temp[i];

    delete[] temp;
}

Let’s try out the case where \(n = 8\) with the sequence {1,7,3,9,4,7,2,5}

What is the complexity of mergesort? Well, we divide the input in half each time. For each “level” of the subdivision, we have to do the merge (\(O(n)\) in total) and the copy \(O(n)\). So if there are \(\log n\) levels, and each level takes \(O(n)\) time, then that leaves us at \(O(n \log n)\) putting us at optimal. Note that this is the worst case runtime; there’s nothing we can do in terms of the input sequence that will make the performance worse than \(O(n \log n)\). Some sorting algorithms perform well on some sequences, but badly on others; mergesort has no “bad” input sequences.

Note that although mergesort uses an optimal number of comparisons, and performs well in practice, the version presented here has a minor wart: the new and copy-loop at each stage. This can be avoided by not having a dedicated temp array, instead, we “ping-pong” the sorted results between the two arrays, input and output. Depending on how many levels are required to finish the sort, the results may be in the original input, or in the output array. Either way, we do need \(O(n)\) temporary space, which is a bit of a drawback. The next sorting algorithm can sort in-place, with no extra storage, and it’s still \(O(n \log n)\), but with a bit more algorithmic complexity.

The main downside to Mergesort is that it is not an in-place sort: it requires \(O(n)\) space for the output. It’s possible to do an in-place merge, however a correct implementation of this is must more complex; it also worsens the time complexity to \(O(n^2 \log n)\). (An incorrect implementation would be to put the merged array at the beginning, and shift everything over by one each time we add a new element to it; this reduces the total time complexity to \(O(n^2)\)!)

Non-recursive mergesort

If we look at the recursion tree for Mergesort, we see that all of the real work is done on the way up. On the way down, we are just figuring out where the various splits in the array should happen. This suggests that we should be able to make a bottom-up Mergesort; one that starts with the array and just does the merge sequence. This leads to a non-recursive Mergesort.

  1. We start with the raw array, and first group the elements into pairs (sub-arrays of size 2), and then “merge” each pair (a merge at this point is really just a compare-and-swap). This sets things up for the next stage:

  2. We group elements into sub-arrays of size 4, and then merge each subarray.

  3. Continue, doubling the size of the sub-arrays at each step, until the entire array is sorted.

This method assumes that the initial size of the array is a power of 2. If this is not the case, then some of the midpoints won’t fall where we expect them to; it can still be done, but it’s more complicated.

Just as in a traditional Mergesort, we have to merge from one array into another. In order to avoid doing a copy at every step, we use two arrays (one the original input and the other a temporary array) and ping-pong the merge inputs and output between them. The arrays switch places after each merge pass. This leads to the following algorithm:

int* mergesort(int* input, int size) 
{
  int* tmp =  new int[size];
  int* to =   tmp;
  int* from = input;  

  for(int s = 2; s <= size; s *= 2) {

    // Perform one merge "pass"
    for(int i = 0; i < size; i += s)       
      merge(from + i, s/2, s, to + i);    

    std::swap(from, to);
  }

  // Sorted results are in from; copy to input if needed
  if(input != from)
    for(int i = 0; i < size; ++i)
      input[i] = from[i];

  delete tmp;
}

Extending this to non-power-of-2 sizes is left as an exercise for the reader. (It’s annoying.)

Quicksort

Mergesort does all its work on the way back up the recursion, during the merge process. Quicksort is something of the opposite, as it does all its work on the way down; when it reaches the base case, there’s nothing left to be done. Mergesort also always divides the sequence evenly in half; in Quicksort the size of the two “halves” depends on the values in the sequence, which means that some sequences will sort faster than others, though we will take steps to mitigate this.

Quicksort is based on an operation called a partition. The idea is, given an element \(i, 0 < i < n\) called the pivot, put every element data[j] < data[i] in the left half of the data, and put every element data[j] >= data[i] in the right half.

The easiest way to understand the partition operation is to imagine that we are making a copy of the input data: we keep two pointers into the copy, i starts from the beginning, moves forward, and places elements that are less than the pivot. j starts from the end, moves in reverse, and places elements that are greater than or equal to the pivot. For each element in the original, we compare it to the pivot, put it at the proper end of the copy, and advance the respective pointer.

The only problem with this method is that it has to make a copy of the data, which requires temporary storage. It’s possible to write a partition algorithm that works in-place, by finding pairs of less-than and ≥ elements and swapping them. It’s a bit harder to understand, but doesn’t require \(O(n)\) extra space:

template<typename T>
it partition(T* arr, int start, int finish) {
    int p = ...; // Choose pivot

    int i = start - 1;
    int j = finish; // Finish is last + 1

    while(true) {

        do i++; while(arr[i] < arr[p]);
        do j--; while(arr[j] > arr[p]);

        if(i >= j)
            return j + 1;

        std::swap(arr[i], arr[j]);
    }
}

This partition scheme works by starting at both ends of the input sequence and walking inward until we find an inversion, a pair of values that are both in the wrong sections (one high, one low); at that point, we swap them and then continue looking for inversions. Note that we have to store the value of the pivot into x, because this procedure might very well move the pivot around! The location returned is, as before, the first element of right partition/end of the left partition. The pivot is located somewhere in the right partition.

A common (but bad!) choice for the pivot is typically the first

int p = start;

or last

int p = finish - 1;

element of the sequence, as these are easy to get. As we’ll see, both of these choices cause Quicksort to degrade to \(O(n^2)\) for some input sequences.

Note that the pivot must be an element of the input sequence. E.g., you might think we could take the average of the sequence (which can be computed in \(O(n)\) time) and use that, but the average might not be an element of the sequence. The best choice would be the median, but that requires the sequence to be already sorted!

Having written our partition scheme, the main Quicksort function is trivial:

template<typename T>
void quicksort(T* arr, int start, int finish) {
    if(start < finish - 1) {
        int p = partition(arr, start, finish); 
        quicksort(arr, start, p - 1);
        quicksort(arr, p + 1, finish);
    }
}

That is, we partition the input into two sections (less than p and greater than or equal to p) and then we recursively sort each section. The base case is when start == finish - 1, when a section contains a single element; this does not need to be sorted.

Analysis

With Mergesort, the recursive step divided the input into exactly two halves (or as close as possible, if the input size was odd), thus we could see that it would produce \(O(n \log n)\) behavior pretty easily. With Quicksort, how many elements are in the less-than- vs. greater-than-partitions depends on our choice of pivot. Consider the following pivot choice and input sequence:

In this case, the less-than partition will always be empty, and the greater-than partition will contain \(n - 1\) elements each time. Instead of reducing the input size by half at each recursive step, we are reducing it only by 1. This gives us \(O(n^2)\) time. (A similar result occurs if the input sequence is sorted in reverse, or if we choose the last element to be the pivot.)

In order to get behavior close to \(O(n \log n)\) we have to choose the pivot more carefully. Some methods that have been suggested:

Conclusions

Mergesort and quicksort are the two main sub-quadratic sort algorithms. The advantages and disadvantages of each are

Even better sorting

Can we do better? Remember that when we derived the worst-case of \(O(n \log n)\), we explicitly limited ourselves to just comparing pairs of elements. If we can do more than just comparisons, we may be able to do better. (In general, we may be able to do better at some algorithm, the more information we have.)

Trivial sorting

Suppose we have a vector of size \(n\) which contains the numbers \(1\ldots n\), with all elements distinct (no repeats). What’s the fastest we can sort this? Believe it or not, \(O(n)\). How?

A puzzle: write a sort function

void sort(vector<int> data) 

that sorts data, which contains the numbers 1 through data.size(), all distinct, in linear time. This means you only get one loop.

The solution:

void sort(vector<int> data) {
    for(int i = 1; i <= data.size(); ++i)
        data.at(i-1) = i;
}

This seems like a silly exercise, but the point is: if you know where each and every element in the sequence goes, it’s final position, then sorting becomes trivial. You just put every element where it belongs in a single pass, and you’re done. The reason why we can’t do this normally is that we don’t know where things belong. Just because I see the number 5, I don’t know whether its the smallest element — and thus should go first — or the largest — last — or something in between until I’ve compared it to all the other elements.

There are a number of other sorting algorithms that depend on the property of having “extra” information about items.

Counting sort

Counting sort is a variation on the trivial “sort” from the previous section. Suppose we know that all the inputs are integers in the range \(0\ldots k\) where k isn’t too big. The idea is, for each element i, count the number of elements less than i. This is how many slots in the sorted vector will be filled before i‘s final position, thus if we take this count that gives us i’s final position.

In order to do this, we create an auxilliary vector<int> pos of size \(k+1\). For each integer i in \(0\ldots k\), pos.at(i) tells us how many things are less than i, which is to say, when we’re done, pos.at(i) tells us where i goes in the output array. A sketch of the algorithm looks like this

vector<int> count_sort(vector<int>& data, int k) {
  vector<int> pos(k+1);
  vector<int> output(data.size());

  // Store positions into pos...


  // Everythiiiiing, in its right plaaace...
  for(int i = data.size()-1; i >= 0; i--) {
    int e = data.at(i);    // current element
    int where = pos.at(e); // where does it go
    output.at(where) = e; 

    pos.at(e)--; 
  }
}

Why is pos.at(e)-- needed at the end? Because there might be duplicates in the input. If there’s more than one of some particular e, then we need to keep track of the fact that we’ve already put one in place, so the next one should go before it (because we are working backwards).

Now we just need to figure out the positions. We start by counting how many of each particular element there are (remember that we want how many are less than or equal to a particular element):

for(int e : data)
  pos.at(e)++; 

After this, pos.at(e) contains the number of es in the input array.

In order to find the number of elements that are before than some element i, we’re going to transform the values in pos into a running sum. This just means adding the previous element to each element, working our way up:

for(int i = 1; i <= k; i++)
  pos.at(i) += pos.at(i - 1);

The complete code is thus

vector<int> count_sort(vector<int>& data, int k) {
  vector<int> pos(k+1);
  vector<int> output(data.size());

  // Count elements
  for(int e : data)
    pos.at(e)++; 

  // Convert to positions
  for(int i = 1; i <= k; i++)
    pos.at(i) += pos.at(i - 1);

  // Everything, in its right plaaace...
  for(int i = data.size()-1; i >= 0; i--) {
    int e = data.at(i);    // current element
    int where = pos.at(e); // where does it go
    output.at(where) = e; 

    pos.at(e)--; 
  }
}

An examples

Let’s walk through the process of count-sorting

2 5 3 0 2 3 0 3

Analysis

The first loop takes \(O(n)\) time. The second loop takes \(O(k)\) time, where k is the size of the range of input values. The third loop takes \(O(n)\) time, giving us a total of

$$O(n + k)$$

(Note that we need \(O(k + n)\) space as well.)

If k is small, this reduces to just \(O(n)\), linear time. But if k is large, then we need \(O(k)\) temporary space, and the algorithm will take \(O(k)\) time. E.g., imagine if we were sorting an array of ints with no limit on their values. Then the range is \(0\ldots 2^{32} - 1\), waaay too big.

Bucket Sort

Bucket sorting assumes that, over all the elements in the input sequence, they all occur roughly equally. In terms of probability, bucket sort assumes that its input elements are uniformly distributed. E.g., if the inputs are ints then there should be roughly the same number of 0s, 1s, 2s, etc. Like radix sort, bucket sort can make use of the digits in numbers (although it can also work with non-numeric data, provided it can be broken up into roughly-equal sized ranges).

Let’s assume we want to sort numbers. We begin by creating a vector, of size 10 — why size 10? because there are 10 digits, \(0\ldots 9\) —, containing linked lists. Each list is initially empty. We call each list a bucket.

E.g., 78 17 39 26 72 94 21 12 23 68

For each element in the input, we add it to the appropriate bucket, based on its highest digit (here, the tens digit). The catch is that we insert it into the list in sorted order, so the lists are always sorted. This is just insert from the ordered array example, but on a list so that no shifting is required.

After we’re done, we can just read off the elements from the buckets, in order, into an output array.

The whole algorithm looks like this

vector<int> bucket_sort(vector<int> data) {
  vector<list<int>> buckets(10);

  for(int e : data) {
    int digit = (e / 10) % 10; // Get 10s digit
    buckets.at(e).insert_sorted(e);
  }

  vector<int> output(data.size());
  int i = 0;
  for(auto l : buckets) 
    for(auto e : l)
      output.at(i++) = e;

  return output;
}

Analysis

The first loop takes \(O(n)\) time if the elements are uniformly distributed. If that’s the case, then each bucket is relatively small, and inserting into it in sorted order is pretty fast. If the elements are not uniformly distributed, then we might end up with one huge bucket, and inserting into it would become slow. The second nested loop takes \(O(n)\) time, so ideally, bucket sort takes \(O(n)\) time.

Hybrid sorting

We know that insertion sort takes \(O(n^2)\) time, and mergesort takes \(O(n \log n)\) time, but big-O notation ignores any constant multipliers. Mergesort is more complex than insertion sort, it’s possible that it’s constant multiple is actually larger. Suppose, exaggerating, we have the actual performance

Mergesort : \(100 (n \log n)\)
Insertion sort: \(10 n^2 \)

If we plot this out, we can see that there is a non-trivial range of small n where the insertion sort is actually faster!

A hybrid sort is a sorting algorithm that switches between other sorting algorithms, based on the size of its input and other factors. For example, we might use insertion sort for small inputs, mergesort for large inputs, and radix sort for really huge inputs. This gives us the best of all algorithms.