And when I wave, I lose the time I save
There’s danger if I dare to stop
And here’s the reason why.
You see, I’m over due, I’m in a rabbit stew
The above are excerpts from a famous saying in the Walt Disney film “Alice In Wonderland” (1951).
You might be thinking to yourself: How does this saying relate to parallel programming?
When you learn parallel programming, the usual first experience is with using OpenMP in C++ or FORTRAN. We will use C++ in the sample code, but the same issues in FORTRAN and other languages with programming parallel extensions. Your typical first experience is with loops.
#pragma omp parallel for
for(int i = 0; i < count; ++i)
A[i] = B[i] + C[i];
To the programmer learning parallel programming, the statements above convey the impression that prior to the “#pragma omp parallel for”, one of the processors in the system is running the code and that the “#pragma omp parallel for” somehow magically focuses all the processors immediately on performing the for loop in pieces in parallel. Then someone will point out that processor in this context means core or hardware thread, or just plain thread.
The first impression of the above loop then is viewed as: prior to the loop the program runs with one thread, and during the loop the program uses n threads where n equals the number of concurrent threads available on the system (two on a single core with HT, 2, 4, 6, 8 12, 16 on various other current processor designs multiplied by the number of processor chips in the system).
Here is where “I'm late, I'm late for a very important date” comes to play.
All of the other threads on the system are not necessarily sitting in an idle state when your program reaches the “#pragma omp parallel for”. Even on a dedicated system, some of the threads may be busy running background programs or services (system update), or running other applications you launched (email, browser, playing music, internet radio, Task Manager, etc…).
The effect of other threads not being immediately available results in: not all of the threads scheduled to run the parallel for loop begin at the same time (or nearly the same time). Some threads will be late in starting their section of the loop.
In a similar manner that other applications and services running on the system can delay a thread or threads from starting their section of the loop, other threads within your application may also introduce delays. These other threads could be thread(s) driving your display and input device. Even in simple console applications, some of the runtime systems use background threads to write data to the console window.
The parallel programmer should learn and accept the fact that all threads will not be in complete control by the application. The programmer has a significant influence on the activities of the threads, but not absolute control over the activities of threads.
Now we examine the next line in the saying “And when I wave, I lose the time I save”
As you write parallel code with higher levels of complexity, you will find that some of the threads need to communicate with the other threads working with the same data. This access must be such that each thread does not interfere with any of the other threads use of the data.
There are many ways of performing intra-application thread-to-thread communication which we won’t address in detail in this blog post. Among these are: Critical Section, mutex, Interlocked operations, volatile flags, etc. Although the implementation details differ, the effect to your thread is its performing “And when I wave”. The consequence is: “I lose the time I save”.
The simplest of the inter-thread thread-safe communication (arguably) is the Interlocked operations. Internally the operation may be reduced to adding a single instruction prefix to an instruction such as ADD
LOCK add mem, reg
The LOCK prefix assures cache coherency between all threads (even across processors). In effect, this is your “And when I wave”. Adding that lock prefix (one byte) extends the execution time of that instruction 100 times to 300 times the time required without the LOCK. This is the “I lose the time I save” in writing the code as parallel code. You have similar penalties when using the other inter-thread communication (thread-safe) techniques.
There is a trade-off between the "time you save" against "the time you loose". The lesson learned is write your code such that you reduce the number of “waves” in your program.
Now consider: “There’s danger if I dare to stop, And here’s the reason why. You see, I’m over due, I’m in a rabbit stew”
Looking back at the simple parallel for loop, assume that all the other threads on the system were immediately available at the time the parallel for loop began. And that all these threads started at the same time. Will all these threads finish at the same time?
“There’s danger if I dare to stop”
Although your thread running the parallel for loop does not explicitly stop until it has finished its portion of the parallel for loop, the system will invariably have other things to do. Such as: background programs or services, and you may be running other applications too (email, browser, music, etc..).
Most operating systems perform an operation called pre-emption, also known as context switching, when there are more software threads active than available hardware threads. The operating system will run one thread for a period of time, and then it will suspend this thread and resume a different thread. And repeat this until threads become idle, or at least to the point where the number of busy software threads is less than or equal to the number of hardware threads.
The consequence of pre-emption is: “You see, I’m over due, I’m in a rabbit stew”
This means, although the parallel loop began with all threads starting at the same time, the finish time of each thread may differ as an unintended consequence of pre-emption.
This saying from “Alice In Wonderland” (I’m late, I’m late …) has an important message for parallel programmers. And it is important for you to keep this in mind as you write parallel code.
In future blogs we will examine how QuickThread addresses some of these issues.