Advanced topics: Parallelism

Parallelism is a fancy computer science word for “doing more than one thing at once”. The fact that all of you can be logged in to the server at the same time, compiling your code without affecting each other is one kind of parallelism: multiprocessing, the ability to run more than one instance of a program (each instance is called a process) at the same time, with instances being largely independent of each other. This is sometimes called coarse-grained parallelism, because it works at the level of whole programs. We’re going to look at a kind of parallelism that is a little more fine-grained, called multithreading. This involves multiple “threads” of execution, within a single program.

A note about libraries: Most everything we’re going to talk about has been added to C++ as C++11’s thread library, but the version of the compiler installed on the server doesn’t support it (yet). So instead we’re going to use the Boost library Boost.Thread, which exists partly to bridge compatibility gaps like this. The Boost.Thread library is modeled after the standard library, so most everything we’re going to do will apply to both; it’s just a matter of whether you use the std namespace or the boost::thread namespace. Similarly, we’re going to #include <boost/thread.hpp>; for the standard version, we’d #include <thread>. Similarly, when you compile, you’ll have to explicitly link with the boost_thread library (whereas the standard library is standard):

g++ -o myprogram myprogram.cpp  -lboost_system -lboost_thread

You should also note that <boost/thread.hpp> is a very complex header file, and #include-ing it will make your code take longer-than-normal to compile. There are a few features which are in Boost.Thread which are not in the standard C++ thread library; I’ll try to point these out when we talk about them.

If you’re using the standard <thread> header, you may still need to link with a library to use it. For example, on the server you have to do

g++ -o my_program my_program.cpp -lpthread

Thread independence

What does it mean to say that a program can be executing multiple threads? It depends on the platform, but there are some things that always apply:

Threads share all function definitions and class definitions. All threads in your program are part of the same program, and thus can call all the same functions, etc.
Threads share all global variables; the memory space where global variables are stored is shared by all threads (though, as we will see, you have to be very careful about accessing a global variable from multiple threads). This includes class-static members!
Threads share the heap, so dynamically allocated objects can be shared between threads (but again, you have to do so very carefully).
Every thread has its own execution stack. This is the crucial part; this means that difference threads can be executing different functions, at the same time, without interfering with each other. Similarly, the local variables in a function are local to each thread, even if multiple threads are executing the same function.
Threads can have “thread-local storage”, a portion of memory that is accessible only from a single thread. You have to take special action to allocate things here.

The fact that threads share all the program’s definitions means that they are all part of the same program. The fact that they share global variables and the heap provides a means for threads to work together and communicate. But the fact that threads have their own stack means that the execution state, what each thread is doing at any particular time, can be different for each thread.

A program without threads implicitly has one thread, the main thread of execution. But a mono-threaded program isn’t very interesting from a parallelism point of view, so we’ll only consider programs with two or more threads to be properly multithreaded.

Atomic operations

What kinds of things are ‘safe’ to do in multiple threads? Not much. Remember that threads share the same memory space, so if two threads try to write to the same memory address at the same “time” it is unpredictable which result will win out. Furthermore, most of the objects we operate on (strings, vectors) take up more than one memory location, so that “modifying” them make take several operations: you might see partial results from one thread and some results from another.

Even something as simple as incrementing a variable is not safe:

c++;

in assembly looks something like

mov eax, [c]
inc eax
mov [c], eax

and another thread could interrupt anywhere in this sequence. eax is a register and registers are not shared between threads. Let’s look at what happens if two threads try to both increment a variable:

Time	Thread 1	Thread 2
0	mov eax, [c]
1		mov eax, [c]
2	inc eax
3		inc eax
4	mov [c], eax
5		mov [c], eax

Remember that each thread has its own eax register. After this sequence of events, the value in memory location [c] will have been incremented once, not twice; one of the updates was lost (this is called the “lost update problem”).

The guarantee on Intel is that a single 32-bit memory write will happen atomically: i.e., if you write an int to memory you don’t have to worry that another thread will see some of the bits but not all. Other than that, there are no guarantees.

Another example: why can’t we easily check whether a thread is still running?

if(thread.running())
  // Do stuff

Think about what we know if the if succeeds: do we really know, inside the if that the thread is still running? No! It might have stopped a millisecond after we tested it, so anything we do inside the if, based on the assumption that the thread is running, will be wrong!

Threads in C++

If each thread has its own flow of execution, what does a thread look like in C++? How do we tell it where to start, in our program?

In C++, threads are represented as instances of the class boost::thread. The thread class is just a wrapper that provides for starting up a thread, controlling it, etc. The real meat of the thread of execution is handled by a class that you must write, which must be a “callable” class. A callable class looks like this:

class my_thread {
  public:
    void operator() () {
        // Thread starts here
    }

    // ... Other members/methods as necessary
};

A callable can also be a function that takes no parameters and returns void, e.g.

void f() {
  ... // Thread starts here
}

Indeed, we can even use a lambda (anonymous function):

[]() { 
  // Thread code here
}

That is, it is a class that publically overloads the () function call operator, taking no arguments, so that if we create an instance of this class, we can pretend that it is a function:

my_thread t1;
t1(); // Calls operator() ()

Note that this code does not actually spawn a new thread! my_thread is just a normal class, and call the overloaded operator() is no different from calling any other method. If we want to actually spawn a new thread, we have to create a boost::thread instance and tell it to run a my_thread:

my_thread mt;
boost::thread t1{mt}; // Spawns a new thread running mt
boost::thread t1{my_thread()}; // Also works

... //  Normal program flow continues *in parallel* with mt

Note that the thread makes a copy of mt, so if your thread code modifies any data members, those changes will not be reflected in the members of mt.

Note that we can launch multiple threads running my_thread and they will all be independent of each other:

my_thread mt1, mt2;
boost::thread t1{mt1};
boost::thread t2{mt2};
...

The callable class can have data members; these will be local to the thread (the object itself is actually copied into the thread, so modifying mt1 after the thread t1 has been spawned will have no effect on t1). You can use this to customize what each thread does before you start it up. Thus, you can use the copyable class to customize how the thread starts, but not to communicate with it after it has started.

Note that you can’t copy a boost::thread object, so this won’t work:

boost::thread t1{mt1};
boost::thread t2 = t1; // Error!

However, you can “move” them, which means that you can write a function which returns boost::thread objects:

boost::thread make_thread();
...
boost::thread t1 = make_thread(); // Fine

(In C++ terminology, the object returned by the function is not copied but moved into t1. This is possible because C++ can see that the object returned cannot be used anywhere else, and thus a copy is not needed. In the previous example, t1 still exists after the creation of t2 and thus a (forbidden) copy is required.)

It’s also possible to create a thread object without assigning a callable to it; when we do this, the thread is said to be “Not-A-Thread”, the thread version of a nullptr. You can’t to anything interesting with a Not-A-Thread thread.

boost::thread t; // t is Not-A-Thread

thread objects have an ID value, which uniquely identifies a particular thread:

boost::thread t{thing};
t.get_id() // returns t's ID

The only rules for IDs are:

If t1.get_id() == t2.get_id then t1 and t2 are the same thread.
If t1.get_id() == boost::thread::id() then t1 is Not-A-Thread.

If any exceptions are uncaught in a thread, then your entire program will terminate (the same as if you had an uncaught exception in a single-threaded program).

Communicating with a thread

There are a few basic ways to communicate with a thread

Global variables
Static class members
References/pointers data members

It’s fairly easy to write a thread class that uses a global variable or a static class member:

int x = 0; // Global

class thing {
  public:
    void operator() () {
      x++; 
      y--;
    }

  static int y;
};

int thing::y = 100;

Using references/pointers is a little more involved:

class thing {
  public:
    thing(int *t) {
        target = t;
    }

    void operator() () {
      (*target)++;
    }

  int* target;
}

We use this like so:

int x = 0;
boost::thread t1{thing(&x)};

After launching this thread, x will be continually incremented, as fast as the thread can do so.

Using a reference requires the use of a member intiailizer list:

class thing {
  public:
    thing(int *t) : target(t) 
    { }

    void operator() () {
      target++;
    }

  int& target;
}

...
int x = 0;
boost::thread t1{thing(x)};

Thread lifetime

How long does a thread last, and what happens when it ends? In Boost.Thread, there is only one way a thread can end: internally, because it reached the end of the callable’s operator() method. In particular, there is no way to “kill” a thread from outside it, because doing so is very unsafe: you don’t know what the thread might be doing, that you would be interrupting. That leaves the question of how we, outside the thread, can react to its termination. In Boost.Thread you have to explicitly tell the system one of two things:

I don’t care when this thread finishes. This is called detaching from the thread.
I want to wait until the thread finishes. This is called joining the thread.

What happens if a boost::thread object is destroyed (e.g., goes out of scope at the end of a function) without having either finished, or being detached? The answer is, your entire program is terminated. It is considered an error to try to destroy a still-running thread, because interrupting a thread is a very unsafe operation (there’s no way to know what the thread might be doing when we stop it). So Boost.Thread (and the C++ standard) take the only safe option, which is to terminate the whole program (remember that when the whole program ends, the operating system will clean things up for us, hopefully undoing whatever things the thread might have been doing).

To detach from a thread we simply call its .detach() method:

boost::thread t{thing};
t.detach();
// Now t represents Not-A-Thread and can be destroyed

The thread itself continues to run, but t is no longer connected to it (t represents Not-A-Thread). t can be destroyed without ending our program.

To join a thread (wait for it to finish), we call its .join() method:

boost::thread t{thing};
...
t.join(); // Wait for t to finish
// Now t is finished and represents Not-A-Thread

The call to .join() will not return until the thread has ended. This means that after the join, we can be certain that the thread has finished.

A lot of functions will have this pattern:

void my_function() {
    using boost;
    thread t1{...}, t2{...}, t3{...};

    // Do stuff...

    // Wait for all threads to finish
    t1.join(); t2.join(); t3.join();
}

To simplify this kind of thing, Boost.Thread provides a thread_joiner object which will automatically join a thread before the surrounding function exits:

void my_function() {
    using boost;
    thread t1{...}, t2{...}, t3{...};
    thread_joiner tj1{t1}, tj2{t1}, tj3{t3};

    // Do stuff...

    // No need to call .join, the thread_joiners will do it for us
}

Like sleep_for and sleep_until, there are versions of join — try_join_for and try_join_until that wait to join the thread but give up after some duration has passed, or some particular time has been reached. These versions return a bool that is true if the thread terminated (i.e., was joined) and false if it was still running when the time ran out.

Testing for thread completion

How can we check to see whether a thread is still running? We would expect there to be something like this (there isn’t):

boost::thread t{thing};
...
if(t.still_running()) {
    ...
}

But let’s think about the implementation of .still_running():

If the thread has ended, then .still_running() can return false, because a thread that has ended isn’t going to start back up again. (Once .still_running() == false it will always be that way.)
What if the thread is still running? Suppose .still_running() returns true. This seems OK, but suppose that immediately after we call .still_running() the thread ends; now we are inside the true branch of the if above, but the condition is no longer true! In fact, to the thread that is running the above code, it would appear that we magically got inside the true branch without the condition having been true!

Multithreading causes these kinds of problems a lot: you can’t just ask a thread a question, because the answer might change immediately after it is given. Because of this, Boost.Thread does not provide a built-in way to check whether a thread has finished. You can work around this by using try_join_for with a duration of 0:

boost::thread t{thing};
...
if(t.try_join_for(0)) {
    // t has terminated
}
// else t might still be running

(Again, in the else case its possible that the thread terminated immediately after the call to try_join_for, and thus we don’t really know whether it it still running or not.)

A better way to deal with thread termination is to join with the thread, as mentioned above. After doing .join() you can be certain that the thread has finished. This also implies a certain kind of design for threads that are intended to be joined-with: they should be written so that they are always progressing towards completion; it should not be possible for a joined-with thread to get into an infinite loop. (If the thread gets into an infinite loop, then the .join() method will never return, and the outer thread will effectively be waiting forever.)

Sleeping

Sometimes a thread is waiting for something to happen and it doesn’t know how long it will take. One way to do this is to write a loop like this:

while(!thing_has_happened()) {
    // Do nothing
}

This is called busy-waiting and it’s rather inefficient, unless we expect the “thing” to happen very soon. While the loop is running, the thread is still using up CPU time. Instead, a better way is to wait for some amount of time inside the loop:

while(!thing_has_happened()) {
    this_thread::sleep_for(chrono::seconds(10));
}

(This requires the Boost.Chrono library, for working with times and durations.)

There is also .sleep_until(t) which sleeps until a particular absolute time (e.g., 12:37 PM).

Yielding the CPU

One last thing a thread can do is yield the CPU to another thread. This isn’t something you every have to do, but it can be useful in threads that would otherwise tend to hog CPU time. this_thread::yield() tells the system to give the rest of this thread’s CPU time slice to the next thread waiting to executed.

This is just a request to yield; depending on the OS and state of the system, it might do nothing.

The “Hello, World” of multithreading

Let’s consider a bank application where we have one account but deposits and withdrawals can occur in parallel:

int account_balance = 0;

struct deposit {
    deposit(int amt) {
        amount = amt;
    }

    void operator() () {
        account_balance = account_balance + amount;
    }

  private:
    int amount;
};

struct withdraw {
    withdraw(int amt) {
        amount = amt;
    }

    void operator() () {
        account_balance = account_balance - amount;
    }

  private:
    int amount;
};

Note that both deposit and withdraw are callable; we can spawn new threads to do either action:

deposit dep(100); // Deposit $100
withdraw wth(50); // Withdraw $50
boost::thread t1{dep}, t2{wth};
// Party on...
t1.join(); t2.join();

What will be the result of the above code, after both threads have completed? We would expect the balance (starting at 0) to end up at 50, but is this assured? In fact, no! There are several different possible outcomes, and which one we get is totally unpredictable. To see how this can come about, let’s take apart the key operation inside both the deposit and withdraw classes: adding/subtracting a value from account_balance:

account_balance = account_balance + amount;

Inside the CPU, this single statement will be broken into a sequence of instructions, looking something like this:

COPY account_balance INTO r1;
ADD amount TO r1 INTO r2;
COPY r2 INTO account_balance;

for deposit and

COPY account_balance INTO r3;
SUB amount FROM r3 INTO r4;
COPY r4 INTO account_balance;

(where the r* are “registers”, storage locations that are inside the CPU itself, as opposed to account_balance and amount which are stored in memory. I could have used the same registers for both, because each thread gets its own registers, but I made them different so that which is which will be clear.)

Remember that the CPU will be running both threads at the same time; that means that these three instructions might be executed in any order, relative to each other. While each thread has its own registers, and its own amount, both threads share the same account_balance; that’s the whole point, is to allow concurrent updates to the account balance. This means that the instructions could effectively be interleaved in any order:

d: COPY account_balance INTO r1;
d: ADD amount TO r1 INTO r2;
d: COPY r2 INTO account_balance;
w: COPY account_balance INTO r3;
w: SUB amount FROM r3 INTO r4;
w: COPY r4 INTO account_balance;

d: COPY account_balance INTO r1;
d: ADD amount TO r1 INTO r2;
w: COPY account_balance INTO r3;
d: COPY r2 INTO account_balance;
w: SUB amount FROM r3 INTO r4;
w: COPY r4 INTO account_balance;

…

In fact, out of the \(6! = 720\) possible ways these two threads can executed, only 2 result in the correct behavior! (These two correspond to sequential execution, where one thread completes before the other starts; in order for correct behavior, one thread’s final COPY r INTO account_balance must execute before the other thread’s initial COPY account_balance INTO r.)

Multithreaded programs have to deal with this constantly: the fact that virtually every “interesting” operation actually involves multiple steps, and those steps could end up interleaved in any order when executed by multiple threads. Even something so simple as

a = b;

isn’t necessarily safe. If a and b are anything more complex than an int it’s possible for b to be changed in the middle of the process of copying it into a, so that a ends up with something that is half old-b and half new-b. (Imagine copying a vector, when another thread tries to reallocate and grow the vector in the middle of the copy!)

A data type that can be safely accessed by multiple threads without things “going wrong” is called thread-safe. It should be obvious that our account implementation is not thread safe, as currently written.

Synchronization

As long as we are disciplined about what and how we access shared data, we won’t have any problems. Often, however, we will want to access shared data that is bigger than a single int, or we will want to access it in a way that requires some coordination. There are several synchronization tools that enable us to control access to shared data.

Mutex

A mutex (“mutual exclusion”) is an object that can be “held” by at most one thread at a time. Typically, only the thread holding the mutex is allowed to access some shared data or resource (by convention; the mutex doesn’t enforce this). You can think of a mutex as being like the “talking stick” of campfire stories; only the person holding the stick (mutex) is allowed to speak.

When no thread is holding the mutex, it is unlocked. In this state, no thread should be accessing the shared resource (reading from the resource may be OK or even that may be forbidden, depending on the nature of the resource). When a thread wants to use the resource, it locks the mutex. This will do one of two things:

If the mutex was unlocked, then it becomes locked, owned by the thread. No other thread is allowed to lock it.
If the mutex was already locked by another thread, then this thread blocks, waiting for the mutex to become available. (Rather than waiting on the mutex, a thread can use try_lock to check whether the mutex is unlocked, locking it if it is.)

(Blocking is done in such a way that no thread will have to wait forever; every thread eventually gets a turn at the mutex.)

When a thread is done with a mutex, it should unlock it, so that the next thread can have a turn.

Mutexes are somewhat low-level, but their behavior underlies a lot of other synchronization primitives. lock_guards and unique_locks provide a wrapper around mutexes that can make them easier to use.

Atomic variables

An atomic variable is essentially a variable wrapped in a mutex, so that modifications to it happen atomically, from the perspective of other threads.

struct thing {
  int a;
  float b;
  string c;
};

std::atomic<thing> at;

To replace the value in an atomic variable, use store or normal assignment:

at.store(thing{1,2.2,"Hello"}); 

// Equivalent
at = thing{1,2.2,"Hello"};

The replacement is done atomically: other threads see either the old value, or the new value, but never a mixture.

To access the value in an atomic variable, use load, or just convert the atomic to its stored type:

thing x = at.load();

thing y = at; // Equivalent, implicit cast

Once again, you are guaranteed to get a consistent thing from load().

atomic overloads many of the operators, so that if the type you are overloading supports them, you can use them directly. For example, with an atomic float:

std::atomic<float> af;

af = 0;
af += 2; // Atomically adds 2
af++;    // Atomically increments

Futures and promises

While some types of threads are intended to be long-running (probably detached), needing to periodically communicate, other types of threads are started with a specific task in mind, and should “return” a result when they are finished. Of course, threads are not functions, so they cannot return a result normally, but futures and promises provide a similar facility for threads.

A future is an object that will contain the result of some calculation, to be finished in the future. Once the calculation is complete the value in the future will not change.
A promise is the other half of a future, an object into which the final result is written.

You can think of the future as the “getter” and the promise as the “setter”. You get a future from a promise. (You can also get a future from a “packaged task”, a helper for turning a normal function into a promise, which runs the function in a separate thread and stores its return value into the promise. async does something similar.)

Given a future, you can only do a few things with it:

Ask whether its associated task is complete (i.e., does the future contain a value). This is provided by the .valid() member function.
Get the future’s value, waiting for it if it is not available. This is provided by the .get() member function.
Wait for the future to complete (optional: wait for some amount of time, returning false if the future has not finished by the deadline). This is provided by .wait(), .wait_for() and .wait_until().

To use a promise, a thread should write a result into it when it is done. Typically this is done by calling .set_value_at_thread_exit() on the promise, passing it the final value. (The normal .set_value() stores the value immediately; the above waits until the thread is actually finished to store the value, which has better performance characteristics.)

An interesting use case for futures/promises is to create a future which doesn’t store anything, a promise<void>. This can only be used to communicate completion to other threads, a kind of signal that the thread owning the promise is done.

Normally, only one future can wait on a single promise; if you need to allow multiple threads to wait on a promise, use a shared_future, a version of a future which can be copied freely, so all the waiting threads can hold a copy.

Waiting on a group of threads

Calling .join or using thread_joiner works fine as long as we have a fixed number of threads. What if we are dynamically allocating threads, and we aren’t sure how many of them we’ll have? In this case we can use a thread group. A thread group is a container for an arbitrary number of threads. The nice thing about it is that we can treat the entire group like a single thread: we can .join_all with it to wait for all the threads in the group to finish:

boost::thread_group grp;

// Create as many threads as there are things
for(auto x : things) {
    grp.create_thread(x);
}
...
grp.join_all(); // Wait for all threads to finish

It’s also possible to interrupt all the threads in a group:

grp.interrupt_all();

Of particular interest is the fact that thread_group itself is thread-safe, so a thread that is in the group can add other threads to it. This is useful for converting recursive procedures to parallel ones:

using boost;

unsigned int location = 0;
bool found = false;
vector<int> data;

thread_group threads;

struct parallel_find {
    unsigned int low, high; // Range to search through
    int target;             // Value to search for

    void operator()() {
        if(high < low)
            return;
        else if(high == low)
            if(data.at(low) == target) {
                location = low;
                found = true;
            }
            else
                return;
        else {
            // Split range in half and search in parallel
            unsigned mid = low + (high - low) / 2;
            threads.create_thread(parallel_find{low, mid, target});
            threads.create_thread(parallel_find{mid+1,high, target});
        }
    }
};

bool find(int target, int& where) {
    location = 0;
    found = false;

    // Create initial thread
    threads.create_thread(parallel_find{0, data.size()-1, target});

    // Wait for all threads to finish
    threads.join_all();

    if(found)
        where = location;

    return found;
}

If there is enough parallelism in the system, this parallel find will find the location of a target in a vector in \(O(\log n)\) time, rather than the \(O(n)\) time a sequential linear search would require. But note that if the target occurs more than once in the vector data then which occurrence is referenced by location is totally unpredictable.

A safe framework for parallelism

A safer framework for both writing and analyzing parallel code is fork-join. This framework assumes that multiple threads never operate on the same data simultaneously; all threads operate independently of each other, and if we want to combine the results of two threads, we must join on them (wait for both to finish). (Some versions of this framework actually require all data structures to be immutable – they cannot be modified at all after creation!).

Consider a function intended to sum of the elements of an array, specified as starting and ending pointers:

int sum(int* start, int *finish) {
  if(start == finish)
    return 0;
  else
    return *start + sum(start + 1, finish);
}

The traditional big-O complexity of this is \(O(n)\) in the distance between finish and start. In fork-join analysis we would call this the work of the function, the total amount of stuff that a function does over all threads.

Here’s another version which uses recursion:

int sum(int* start, int* finish) {
  if(start == finish)
    return 0;
  else {
    int* mid = start + (finish - start)/2 ; // Midpoint
    int ls = sum(start, mid);
    int rs = sum(mid, finish);
    return ls + rs;
  }
}

This does exactly the same amount of work as the previous, but notice that the two recursive calls are independent of each other; they look at completely disjoint parts of the array, and the array is not modified, so there’s no reason they could not run in parallel. In fact, we could fork both recursive computations to run in parallel, and then join on them, waiting until they finish to do the final addition.

The span of this operation is the depth of its recursion tree, in this case, \(O(\log n)\). Span effectively measures how long an algorithm would take to run if infinite threads were available, and the only constraints were the data dependencies (ls + rs has a dependency on both ls and rs).

The work divided by the span gives us the “amount of parallelism” available, a measure of how many threads can usefully be assigned to run an algorithm.

Suppose we have p threads available (e.g., 4, 8, etc). Then we can compute the real cost of the algorithm via \(O(\frac{w}{p} + s \log p)\). If \(w/p > s \log p\) then we say that “work dominates”: the sequential part will take more time than the parallel part. For sequential algorithms, we find that \(s \approx w\).

To compute the work of a (recursive) algorithm, we simply add up the work done by its recursive calls. Thus, the total work done by the recursive sum is just the work done by the left sum, plus the work done by the right sum. The base case does work 1.

To compute the span of a (recursive) algorithm, we take the max of the spans of its recursive calls, plus 1 for the calling function. The base case has span 1. This gives us the height of recursion tree.

Low span is always better: it means that the recursion tree is flat, and we can distribute the work over as many threads are available. If \(s = w\) then the algorithm is purely sequential: no parallelism at all can be exploited.

Note that any parallel algorithm designed this way can be run sequentially, with no modification: just run the “forked” recursive calls sequentially, one after the other.

There are fork-join frameworks available in C++. Some even let you annotate your code with the work and span measurements, so that the scheduler can choose the optimal number of threads to use (can choose whether to fork, or to run sequentially) for the particular inputs passed to your functions.