Advanced topics: Parallelism
Parallelism is a fancy computer science word for “doing more than one thing at once”. The fact that all of you can be logged in to the server at the same time, compiling your code without affecting each other is one kind of parallelism: multiprocessing, the ability to run more than one instance of a program (each instance is called a process) at the same time, with instances being largely independent of each other. This is sometimes called coarse-grained parallelism, because it works at the level of whole programs. We’re going to look at a kind of parallelism that is a little more fine-grained, called multithreading. This involves multiple “threads” of execution, within a single program.
A note about libraries: Most everything we’re going to talk about has been
added to C++ as C++11’s thread
library, but the version of the compiler
installed on the server doesn’t support it (yet). So instead we’re going to use
the Boost library Boost.Thread, which exists partly to bridge compatibility
gaps like this. The Boost.Thread library is modeled after the standard library,
so most everything we’re going to do will apply to both; it’s just a matter of
whether you use the std
namespace or the boost::thread
namespace. Similarly,
we’re going to #include <boost/thread.hpp>
; for the standard version,
we’d #include <thread>
. Similarly, when you compile, you’ll have to
explicitly link with the boost_thread
library (whereas the standard library
is standard):
g++ -o myprogram myprogram.cpp -lboost_system -lboost_thread
You should also note that <boost/thread.hpp>
is a very complex header file,
and #include
-ing it will make your code take longer-than-normal to compile.
There are a few features which are in Boost.Thread which are not in the
standard C++ thread library; I’ll try to point these out when we talk about
them.
If you’re using the standard <thread>
header, you may still need to link
with a library to use it. For example, on the server you have to do
g++ -o my_program my_program.cpp -lpthread
Thread independence
What does it mean to say that a program can be executing multiple threads? It depends on the platform, but there are some things that always apply:
Threads share all function definitions and class definitions. All threads in your program are part of the same program, and thus can call all the same functions, etc.
Threads share all global variables; the memory space where global variables are stored is shared by all threads (though, as we will see, you have to be very careful about accessing a global variable from multiple threads). This includes class-static members!
Threads share the heap, so dynamically allocated objects can be shared between threads (but again, you have to do so very carefully).
Every thread has its own execution stack. This is the crucial part; this means that difference threads can be executing different functions, at the same time, without interfering with each other. Similarly, the local variables in a function are local to each thread, even if multiple threads are executing the same function.
Threads can have “thread-local storage”, a portion of memory that is accessible only from a single thread. You have to take special action to allocate things here.
The fact that threads share all the program’s definitions means that they are all part of the same program. The fact that they share global variables and the heap provides a means for threads to work together and communicate. But the fact that threads have their own stack means that the execution state, what each thread is doing at any particular time, can be different for each thread.
A program without threads implicitly has one thread, the main thread of execution. But a mono-threaded program isn’t very interesting from a parallelism point of view, so we’ll only consider programs with two or more threads to be properly multithreaded.
Atomic operations
What kinds of things are ‘safe’ to do in multiple threads? Not much. Remember that threads share the same memory space, so if two threads try to write to the same memory address at the same “time” it is unpredictable which result will win out. Furthermore, most of the objects we operate on (strings, vectors) take up more than one memory location, so that “modifying” them make take several operations: you might see partial results from one thread and some results from another.
Even something as simple as incrementing a variable is not safe:
c++;
in assembly looks something like
mov eax, [c]
inc eax
mov [c], eax
and another thread could interrupt anywhere in this sequence. eax
is a
register and registers are not shared between threads. Let’s look at
what happens if two threads try to both increment a variable:
Time | Thread 1 | Thread 2 |
---|---|---|
0 | mov eax, [c] | |
1 | mov eax, [c] | |
2 | inc eax | |
3 | inc eax | |
4 | mov [c], eax | |
5 | mov [c], eax |
Remember that each thread has its own eax
register. After this sequence of
events, the value in memory location [c]
will have been incremented once,
not twice; one of the updates was lost (this is called the “lost update problem”).
The guarantee on Intel is that a single 32-bit memory write will happen
atomically: i.e., if you write an int
to memory you don’t have to worry
that another thread will see some of the bits but not all. Other than that,
there are no guarantees.
Another example: why can’t we easily check whether a thread is still running?
if(thread.running())
// Do stuff
Think about what we know if the if
succeeds: do we really know,
inside the if
that the thread is still running? No! It might have stopped
a millisecond after we tested it, so anything we do inside the if, based on
the assumption that the thread is running, will be wrong!
Threads in C++
If each thread has its own flow of execution, what does a thread look like in C++? How do we tell it where to start, in our program?
In C++, threads are represented as instances of the class boost::thread
. The
thread
class is just a wrapper that provides for starting up a thread,
controlling it, etc. The real meat of the thread of execution is handled by
a class that you must write, which must be a “callable” class. A callable
class looks like this:
class my_thread {
public:
void operator() () {
// Thread starts here
}
// ... Other members/methods as necessary
};
A callable can also be a function that takes no parameters and
returns void
, e.g.
void f() {
... // Thread starts here
}
Indeed, we can even use a lambda (anonymous function):
[]() {
// Thread code here
}
That is, it is a class that publically overloads the ()
function call
operator, taking no arguments, so that if we create an instance of this class,
we can pretend that it is a function:
my_thread t1;
t1(); // Calls operator() ()
Note that this code does not actually spawn a new thread! my_thread
is
just a normal class, and call the overloaded operator()
is no different
from calling any other method. If we want to actually spawn a new thread, we
have to create a boost::thread
instance and tell it to run a my_thread
:
my_thread mt;
boost::thread t1{mt}; // Spawns a new thread running mt
boost::thread t1{my_thread()}; // Also works
... // Normal program flow continues *in parallel* with mt
Note that the thread makes a copy of mt
, so if your thread code modifies
any data members, those changes will not be reflected in the members of mt
.
Note that we can launch multiple threads running my_thread
and they will
all be independent of each other:
my_thread mt1, mt2;
boost::thread t1{mt1};
boost::thread t2{mt2};
...
The callable class can have data members; these will be local to the thread
(the object itself is actually copied into the thread, so modifying mt1
after the thread t1
has been spawned will have no effect on t1
). You can
use this to customize what each thread does before you start it up. Thus, you
can use the copyable class to customize how the thread starts, but not to
communicate with it after it has started.
Note that you can’t copy a boost::thread
object, so this won’t work:
boost::thread t1{mt1};
boost::thread t2 = t1; // Error!
However, you can “move” them, which means that you can write a function which
returns boost::thread
objects:
boost::thread make_thread();
...
boost::thread t1 = make_thread(); // Fine
(In C++ terminology, the object returned by the function is not copied but
moved into t1
. This is possible because C++ can see that the object
returned cannot be used anywhere else, and thus a copy is not needed. In the
previous example, t1
still exists after the creation of t2
and thus a
(forbidden) copy is required.)
It’s also possible to create a thread
object without assigning a callable
to it; when we do this, the thread
is said to be “Not-A-Thread”, the thread
version of a nullptr
. You can’t to anything interesting with a
Not-A-Thread thread
.
boost::thread t; // t is Not-A-Thread
thread
objects have an ID value, which uniquely identifies a particular
thread:
boost::thread t{thing};
t.get_id() // returns t's ID
The only rules for IDs are:
If
t1.get_id() == t2.get_id
thent1
andt2
are the same thread.If
t1.get_id() == boost::thread::id()
thent1
is Not-A-Thread.
If any exceptions are uncaught in a thread, then your entire program will terminate (the same as if you had an uncaught exception in a single-threaded program).
Communicating with a thread
There are a few basic ways to communicate with a thread
Global variables
Static class members
References/pointers data members
It’s fairly easy to write a thread class that uses a global variable or a static class member:
int x = 0; // Global
class thing {
public:
void operator() () {
x++;
y--;
}
static int y;
};
int thing::y = 100;
Using references/pointers is a little more involved:
class thing {
public:
thing(int *t) {
target = t;
}
void operator() () {
(*target)++;
}
int* target;
}
We use this like so:
int x = 0;
boost::thread t1{thing(&x)};
After launching this thread, x
will be continually incremented, as fast as the
thread can do so.
Using a reference requires the use of a member intiailizer list:
class thing {
public:
thing(int *t) : target(t)
{ }
void operator() () {
target++;
}
int& target;
}
...
int x = 0;
boost::thread t1{thing(x)};
Thread lifetime
How long does a thread last, and what happens when it ends? In Boost.Thread,
there is only one way a thread can end: internally, because it reached the
end of the callable’s operator()
method. In particular, there is no way
to “kill” a thread from outside it, because doing so is very unsafe: you don’t
know what the thread might be doing, that you would be interrupting.
That leaves the question of how we,
outside the thread, can react to its termination. In Boost.Thread you have
to explicitly tell the system one of two things:
I don’t care when this thread finishes. This is called detaching from the thread.
I want to wait until the thread finishes. This is called joining the thread.
What happens if a boost::thread
object is destroyed (e.g., goes out of
scope at the end of a function) without having either finished, or being
detached? The answer is, your entire program is terminated. It is considered
an error to try to destroy a still-running thread, because interrupting a
thread is a very unsafe operation (there’s no way to know what the thread
might be doing when we stop it). So Boost.Thread (and the C++ standard) take
the only safe option, which is to terminate the whole program (remember that
when the whole program ends, the operating system will clean things up for
us, hopefully undoing whatever things the thread might have been doing).
To detach from a thread we simply call its .detach()
method:
boost::thread t{thing};
t.detach();
// Now t represents Not-A-Thread and can be destroyed
The thread itself continues to run, but t
is no longer connected to it
(t
represents Not-A-Thread). t
can be destroyed without ending our program.
To join a thread (wait for it to finish), we call its .join()
method:
boost::thread t{thing};
...
t.join(); // Wait for t to finish
// Now t is finished and represents Not-A-Thread
The call to .join()
will not return until the thread has ended. This means
that after the join, we can be certain that the thread has finished.
A lot of functions will have this pattern:
void my_function() {
using boost;
thread t1{...}, t2{...}, t3{...};
// Do stuff...
// Wait for all threads to finish
t1.join(); t2.join(); t3.join();
}
To simplify this kind of thing, Boost.Thread provides a thread_joiner
object
which will automatically join a thread before the surrounding function exits:
void my_function() {
using boost;
thread t1{...}, t2{...}, t3{...};
thread_joiner tj1{t1}, tj2{t1}, tj3{t3};
// Do stuff...
// No need to call .join, the thread_joiners will do it for us
}
Like sleep_for
and sleep_until
, there are versions of join
—
try_join_for
and try_join_until
that wait to join the thread but give up
after some duration has passed, or some particular time has been reached.
These versions return a bool
that is true if the thread terminated (i.e.,
was joined) and false if it was still running when the time ran out.
Testing for thread completion
How can we check to see whether a thread is still running? We would expect there to be something like this (there isn’t):
boost::thread t{thing};
...
if(t.still_running()) {
...
}
But let’s think about the implementation of .still_running()
:
If the thread has ended, then
.still_running()
can returnfalse
, because a thread that has ended isn’t going to start back up again. (Once.still_running() == false
it will always be that way.)What if the thread is still running? Suppose
.still_running()
returnstrue
. This seems OK, but suppose that immediately after we call.still_running()
the thread ends; now we are inside thetrue
branch of theif
above, but the condition is no longer true! In fact, to the thread that is running the above code, it would appear that we magically got inside thetrue
branch without the condition having been true!
Multithreading causes these kinds of problems a lot: you can’t just ask a thread
a question, because the answer might change immediately after it is given.
Because of this, Boost.Thread does not provide a built-in way to check
whether a thread has finished. You can work around this by using
try_join_for
with a duration of 0:
boost::thread t{thing};
...
if(t.try_join_for(0)) {
// t has terminated
}
// else t might still be running
(Again, in the else
case its possible that the thread terminated immediately
after the call to try_join_for
, and thus we don’t really know whether it
it still running or not.)
A better way to deal with thread termination is to join with the thread,
as mentioned above. After doing .join()
you can be certain that the thread
has finished. This also implies a certain kind of design for threads that
are intended to be joined-with: they should be written so that they are always
progressing towards completion; it should not be possible for a joined-with
thread to get into an infinite loop. (If the thread gets into an infinite
loop, then the .join()
method will never return, and the outer thread will
effectively be waiting forever.)
Sleeping
Sometimes a thread is waiting for something to happen and it doesn’t know how long it will take. One way to do this is to write a loop like this:
while(!thing_has_happened()) {
// Do nothing
}
This is called busy-waiting and it’s rather inefficient, unless we expect the “thing” to happen very soon. While the loop is running, the thread is still using up CPU time. Instead, a better way is to wait for some amount of time inside the loop:
while(!thing_has_happened()) {
this_thread::sleep_for(chrono::seconds(10));
}
(This requires the Boost.Chrono library, for working with times and durations.)
There is also .sleep_until(t)
which sleeps until a particular absolute
time (e.g., 12:37 PM).
Yielding the CPU
One last thing a thread can do is yield the CPU to another thread. This
isn’t something you every have to do, but it can be useful in threads that
would otherwise tend to hog CPU time. this_thread::yield()
tells the system
to give the rest of this thread’s CPU time slice to the next thread waiting
to executed.
This is just a request to yield; depending on the OS and state of the system, it might do nothing.
The “Hello, World” of multithreading
Let’s consider a bank application where we have one account but deposits and withdrawals can occur in parallel:
int account_balance = 0;
struct deposit {
deposit(int amt) {
amount = amt;
}
void operator() () {
account_balance = account_balance + amount;
}
private:
int amount;
};
struct withdraw {
withdraw(int amt) {
amount = amt;
}
void operator() () {
account_balance = account_balance - amount;
}
private:
int amount;
};
Note that both deposit
and withdraw
are callable; we can spawn new threads
to do either action:
deposit dep(100); // Deposit $100
withdraw wth(50); // Withdraw $50
boost::thread t1{dep}, t2{wth};
// Party on...
t1.join(); t2.join();
What will be the result of the above code, after both threads have completed?
We would expect the balance (starting at 0) to end up at 50, but is this
assured? In fact, no! There are several different possible outcomes, and
which one we get is totally unpredictable. To see how this can come about,
let’s take apart the key operation inside both the deposit
and withdraw
classes: adding/subtracting a value from account_balance
:
account_balance = account_balance + amount;
Inside the CPU, this single statement will be broken into a sequence of instructions, looking something like this:
COPY account_balance INTO r1;
ADD amount TO r1 INTO r2;
COPY r2 INTO account_balance;
for deposit
and
COPY account_balance INTO r3;
SUB amount FROM r3 INTO r4;
COPY r4 INTO account_balance;
(where the r*
are “registers”, storage locations that are inside the
CPU itself, as opposed to account_balance
and amount
which are stored in
memory. I could have used the same registers for both, because each thread
gets its own registers, but I made them different so that which is which will
be clear.)
Remember that the CPU will be running both threads at the same time; that means
that these three instructions might be executed in any order, relative to
each other. While each thread has its own registers, and its own amount
, both threads share the same account_balance
; that’s the whole point, is to allow
concurrent updates to the account balance. This means that the instructions
could effectively be interleaved in any order:
d: COPY account_balance INTO r1;
d: ADD amount TO r1 INTO r2;
d: COPY r2 INTO account_balance;
w: COPY account_balance INTO r3;
w: SUB amount FROM r3 INTO r4;
w: COPY r4 INTO account_balance;
d: COPY account_balance INTO r1;
d: ADD amount TO r1 INTO r2;
w: COPY account_balance INTO r3;
d: COPY r2 INTO account_balance;
w: SUB amount FROM r3 INTO r4;
w: COPY r4 INTO account_balance;
…
In fact, out of the \(6! = 720\) possible ways these two threads
can executed, only
2 result in the correct behavior! (These two correspond to sequential
execution, where one thread completes before the other starts; in order for
correct behavior, one thread’s final COPY r INTO account_balance
must
execute before the other thread’s initial COPY account_balance INTO r
.)
Multithreaded programs have to deal with this constantly: the fact that virtually every “interesting” operation actually involves multiple steps, and those steps could end up interleaved in any order when executed by multiple threads. Even something so simple as
a = b;
isn’t necessarily safe. If a
and b
are anything more complex than an
int
it’s possible for b
to be changed in the middle of the process of
copying it into a
, so that a
ends up with something that is half old-b
and half new-b
. (Imagine copying a vector
, when another thread tries to
reallocate and grow the vector in the middle of the copy!)
A data type that can be safely accessed by multiple threads without things “going wrong” is called thread-safe. It should be obvious that our account implementation is not thread safe, as currently written.
Synchronization
As long as we are disciplined about what and how we access shared data, we won’t have any problems. Often, however, we will want to access shared data that is bigger than a single int, or we will want to access it in a way that requires some coordination. There are several synchronization tools that enable us to control access to shared data.
Mutex
A mutex (“mutual exclusion”) is an object that can be “held” by at most one thread at a time. Typically, only the thread holding the mutex is allowed to access some shared data or resource (by convention; the mutex doesn’t enforce this). You can think of a mutex as being like the “talking stick” of campfire stories; only the person holding the stick (mutex) is allowed to speak.
When no thread is holding the mutex, it is unlocked. In this state, no thread should be accessing the shared resource (reading from the resource may be OK or even that may be forbidden, depending on the nature of the resource). When a thread wants to use the resource, it locks the mutex. This will do one of two things:
If the mutex was unlocked, then it becomes locked, owned by the thread. No other thread is allowed to lock it.
If the mutex was already locked by another thread, then this thread blocks, waiting for the mutex to become available. (Rather than waiting on the mutex, a thread can use
try_lock
to check whether the mutex is unlocked, locking it if it is.)
(Blocking is done in such a way that no thread will have to wait forever; every thread eventually gets a turn at the mutex.)
When a thread is done with a mutex, it should unlock it, so that the next thread can have a turn.
Mutexes are somewhat low-level, but their behavior underlies a lot of other
synchronization primitives. lock_guard
s and unique_lock
s provide a wrapper
around mutexes that can make them easier to use.
Atomic variables
An atomic
variable is essentially a variable wrapped in a mutex, so that
modifications to it happen atomically, from the perspective of other threads.
struct thing {
int a;
float b;
string c;
};
std::atomic<thing> at;
To replace the value in an atomic variable, use store
or normal assignment:
at.store(thing{1,2.2,"Hello"});
// Equivalent
at = thing{1,2.2,"Hello"};
The replacement is done atomically: other threads see either the old value, or the new value, but never a mixture.
To access the value in an atomic variable, use load
, or just convert the
atomic to its stored type:
thing x = at.load();
thing y = at; // Equivalent, implicit cast
Once again, you are guaranteed to get a consistent thing
from load()
.
atomic
overloads many of the operators, so that if the type you are overloading
supports them, you can use them directly. For example, with an atomic float
:
std::atomic<float> af;
af = 0;
af += 2; // Atomically adds 2
af++; // Atomically increments
Futures and promises
While some types of threads are intended to be long-running (probably detached), needing to periodically communicate, other types of threads are started with a specific task in mind, and should “return” a result when they are finished. Of course, threads are not functions, so they cannot return a result normally, but futures and promises provide a similar facility for threads.
A future is an object that will contain the result of some calculation, to be finished in the future. Once the calculation is complete the value in the future will not change.
A promise is the other half of a future, an object into which the final result is written.
You can think of the future as the “getter” and the promise as the “setter”.
You get a future from a promise. (You can also get a future from a “packaged task”,
a helper for turning a normal function into a promise, which runs the function
in a separate thread and stores its return value into the promise. async
does something similar.)
Given a future, you can only do a few things with it:
Ask whether its associated task is complete (i.e., does the future contain a value). This is provided by the
.valid()
member function.Get the future’s value, waiting for it if it is not available. This is provided by the
.get()
member function.Wait for the future to complete (optional: wait for some amount of time, returning
false
if the future has not finished by the deadline). This is provided by.wait()
,.wait_for()
and.wait_until()
.
To use a promise, a thread should write a result into it when it is done. Typically
this is done by calling .set_value_at_thread_exit()
on the promise, passing
it the final value. (The normal .set_value()
stores the value immediately; the
above waits until the thread is actually finished to store the value, which has
better performance characteristics.)
An interesting use case for futures/promises is to create a future which doesn’t
store anything, a promise<void>
. This can only be used to communicate
completion to other threads, a kind of signal that the thread owning the
promise is done.
Normally, only one future can wait on a single promise; if you need to allow
multiple threads to wait on a promise
, use a shared_future
, a version
of a future which can be copied freely, so all the waiting threads can hold
a copy.
Waiting on a group of threads
Calling .join
or using thread_joiner
works fine as long as we have a
fixed number of threads. What if we are dynamically allocating threads,
and we aren’t sure how many of them we’ll have? In this case we can use a
thread group. A thread group is a container for an arbitrary number of
threads. The nice thing about it is that we can treat the entire group
like a single thread: we can .join_all
with it to wait for all the threads
in the group to finish:
boost::thread_group grp;
// Create as many threads as there are things
for(auto x : things) {
grp.create_thread(x);
}
...
grp.join_all(); // Wait for all threads to finish
It’s also possible to interrupt all the threads in a group:
grp.interrupt_all();
Of particular interest is the fact that thread_group
itself is thread-safe,
so a thread that is in the group can add other threads to it. This is useful
for converting recursive procedures to parallel ones:
using boost;
unsigned int location = 0;
bool found = false;
vector<int> data;
thread_group threads;
struct parallel_find {
unsigned int low, high; // Range to search through
int target; // Value to search for
void operator()() {
if(high < low)
return;
else if(high == low)
if(data.at(low) == target) {
location = low;
found = true;
}
else
return;
else {
// Split range in half and search in parallel
unsigned mid = low + (high - low) / 2;
threads.create_thread(parallel_find{low, mid, target});
threads.create_thread(parallel_find{mid+1,high, target});
}
}
};
bool find(int target, int& where) {
location = 0;
found = false;
// Create initial thread
threads.create_thread(parallel_find{0, data.size()-1, target});
// Wait for all threads to finish
threads.join_all();
if(found)
where = location;
return found;
}
If there is enough parallelism in the system, this parallel find
will
find the location of a target
in a vector in \(O(\log n)\) time,
rather than the \(O(n)\) time a sequential linear search would require. But
note that if the target
occurs more than once in the vector data
then
which occurrence is referenced by location
is totally unpredictable.
A safe framework for parallelism
A safer framework for both writing and analyzing parallel code is fork-join. This framework assumes that multiple threads never operate on the same data simultaneously; all threads operate independently of each other, and if we want to combine the results of two threads, we must join on them (wait for both to finish). (Some versions of this framework actually require all data structures to be immutable – they cannot be modified at all after creation!).
Consider a function intended to sum of the elements of an array, specified as starting and ending pointers:
int sum(int* start, int *finish) {
if(start == finish)
return 0;
else
return *start + sum(start + 1, finish);
}
The traditional big-O complexity of this is \(O(n)\) in the distance between
finish
and start
. In fork-join analysis we would call this the work of the
function, the total amount of stuff that a function does over all threads.
Here’s another version which uses recursion:
int sum(int* start, int* finish) {
if(start == finish)
return 0;
else {
int* mid = start + (finish - start)/2 ; // Midpoint
int ls = sum(start, mid);
int rs = sum(mid, finish);
return ls + rs;
}
}
This does exactly the same amount of work as the previous, but notice that the two recursive calls are independent of each other; they look at completely disjoint parts of the array, and the array is not modified, so there’s no reason they could not run in parallel. In fact, we could fork both recursive computations to run in parallel, and then join on them, waiting until they finish to do the final addition.
The span of this operation is the depth of its recursion tree, in this case,
\(O(\log n)\). Span effectively measures how long an algorithm would take
to run if infinite threads were available, and the only constraints were the
data dependencies (ls + rs
has a dependency on both ls
and rs
).
The work divided by the span gives us the “amount of parallelism” available, a measure of how many threads can usefully be assigned to run an algorithm.
Suppose we have p threads available (e.g., 4, 8, etc). Then we can compute the real cost of the algorithm via \(O(\frac{w}{p} + s \log p)\). If \(w/p > s \log p\) then we say that “work dominates”: the sequential part will take more time than the parallel part. For sequential algorithms, we find that \(s \approx w\).
To compute the work of a (recursive) algorithm, we simply add up the work
done by its recursive calls. Thus, the total work done by the recursive sum
is just the work done by the left sum, plus the work done by the right sum.
The base case does work 1.
To compute the span of a (recursive) algorithm, we take the max of the spans of its recursive calls, plus 1 for the calling function. The base case has span 1. This gives us the height of recursion tree.
Low span is always better: it means that the recursion tree is flat, and we can distribute the work over as many threads are available. If \(s = w\) then the algorithm is purely sequential: no parallelism at all can be exploited.
Note that any parallel algorithm designed this way can be run sequentially, with no modification: just run the “forked” recursive calls sequentially, one after the other.
There are fork-join frameworks available in C++. Some even let you annotate your code with the work and span measurements, so that the scheduler can choose the optimal number of threads to use (can choose whether to fork, or to run sequentially) for the particular inputs passed to your functions.