Tuesday, December 22, 2009

New release of QuickThread available.

A new release of QuickThread is now available to registered users at the download and support site.

Principal difference is optimization of parallel_for. A new class of task node has been created: an iterating task node. The new parallel_for (and parallel_task) are now modified to make use of this new task node. Prior versions would enqueue multiple task nodes per slice of iteration space (per chunk when chunks used), resulting in one task enqueue per thread (or chunk as the case may be). New code enqueues one iterating task node regardless of the number of slices.


int chunk=100;
int nObjects = 123456789;
parallel_for(chunk, fnObject, 0, nObjects, a3);


old format has nObjects number of task node and enqueue/dequeue
new format has 1 task node and enqueue/dequeue


The new format has significantly fewer demands on task enqueue/dequeue and numbers of task nodes in flight. In the event that you experience problems you can disable this feature by editing parallel_for_v1.h and commenting out


#define USE_new_parallel_for

New Home page at www.quickthreadprogramming.com too.

The home page contains a new document comparing QuickThread and OpenMP under various system loads.


Jim Dempsey

Sunday, November 29, 2009

Comparison between QuickThread and OpenMP 3.0

Often when you see manufacturer’s published bench marks you generally see the results data when a test program is run under ideal conditions. If you have followed my White Rabbit blogs on the Intel Software Network Blogs you will see the effects of various load conditions as it relates to unexpected delays in a test program.

A similar test program was adapted from that program for use in this comparison between QuickThread and OpenMP. One test program is the evolutionary process of adapting a single threaded program to use OpenMP 3.0 ranging from simple design to more complex designs using OpenMP 3.0 features. The other test program performs the equivalent steps using QuickThread software development kit.

Each test program runs a series of tests ranging from single threaded through various design attempts to eek maximum performance out of the test program while run under various load conditions.

All tests ran on Intel Q6600 quad core 2.4GHz system. These tests are not setup to test scalability of the application. Excepting for single threaded tests, all parallel tests use all cores for their thread pools. This is to say, the tests are setup to determine the effects on performance of the test application while running under these different system load conditions:

No load
1 Load of a console command running DIR C:\ /S
2 Loads (2 console commands) of running DIR C:\ /S
1 console window running the following hang.bat file
echo off
:loop
goto loop
2 console windows running hang.bat
3 console windows running hang.bat
4 console windows running hang.bat

The functional diagram of the application is

process 1
process 2
process 3
process 4

Where processes 2 and 3 are dependent on process 1, and process 4 is dependent on processes 2 and 3.

process 1
process 2, process 3
process 4

The data and procedure within all processes are such that slices of the data sets can migrate through the processes.

The single threaded versions are below:

// process 1 - single threaded
void Process1_st()
{
for(int i = 0; i < arraySize; ++i)
A[i] = ((double)i) / 1000.0;
}

// process 2 - single thread
void Process2_st()
{
for(int i = 0; i < arraySize; ++i)
B[i] = sin(A[i]);
}

// process 3 - single thread
void Process3_st()
{
for(int i = 0; i < arraySize; ++i)
C[i] = cos(A[i]);
}

// process 4 - single thread
void Process4_st()
{
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

The size of the arrays are (1024*1024) which is 8MB and larger than the cache size on the system. The functions are meaningless other than they emulate lightly loaded processes that might exist in a system.

After the background load is established the test program runs and takes 100 sample times of each test run using various parallelization strategies. The strategies will be described following the commentary on the charts.







For both charts the lines represent the average of a 100 point samples for each test, under varying test load conditions. The X-axis represents tests. The Y-axis represents a scaling factor from the single threaded No Load minimum run time. The tests are

st single thread
st_Str single thread with stride
pf parallel for
pf_Str parallel for in stride loop
pf_Str_tq parallel for in stride loop driven by taskq (OpenMP 3.0 feature)
pfc parallel for with dynamic scheduling and chunk size
pfc_Str parallel for with dynamic scheduling and chunk size in stride loop
pfc_Str_tq parallel for with dynamic scheduling and chunk size in stride loop driven by taskq
pfs parallel for using sections
pfcs parallel for with dynamic scheduling and chunk size using sections

NOTE
QuickThread does not have a parallel construct with the name taskq, however equivalent functionality is available using parallel_task (which has additional capabilities). The same name was used on the chart for tests with equivalent functionality.

The charts illustrate that under No Load conditions OpenMP 3.0 yields almost as good as performance as QuickThread. (as long as you avoid pf_Str and pfc_Str programming practices).
QuickThread, on the other hand, yields consistently good results under varying load conditions as well as under various programming practices. This means that you will spend less time reworking code with QuickThread than you will with OpenMP

Let’s look at the programming differences. We won’t list all the code, but we will list enough for you to be able to extrapolate what the code would look like.

Single threaded test (st)

// single thread for loops
void test_st()
{
Process1_st();
Process2_st();
Process3_st();
Process4_st();
}

Where the process code was listed earler.

The Single thread with stride test (st_Str)

// single thread for loops using stride
void test_st_Stride()
{
for(int iBegin = 0; iBegin < arraySize; iBegin+=Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;
Process1_st(iBegin, iEnd);
Process2_st(iBegin, iEnd);
Process3_st(iBegin, iEnd);
Process4_st(iBegin, iEnd);
}
}

The purpose of including this in the test was to provide some sense of the additional call overhead as well as any potential benefit of cache differences.

The parallel for tests are the same algorithms as the single threaded, so we will only list the Process1_pf functions

OpenMP

// process 4 - parallel for
void Process4_pf()
{
#pragma omp parallel for
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

QuickThread

// process 1 - parallel for
void Process1_pf()
{
parallel_for(
0, arraySize,
[&](int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
});
}

The QuickThread implementation is using the C++0x Lambda functions. If you prefer to not use Lambda functions then you simply take the Lambda function out and provide a name for the function and supply the half open range as arguments to the new task function.


// process 1 - parallel for
void Process1_pf_task(int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
}
void Process1_pf()
{
parallel_for(
Process1_pf_task,
0, arraySize); // as task args 1 and 2
}

The chunking version of the parallel for is

OpenMP

// process 4 - parallel for chunked
void Process4_pfc()
{
#pragma omp parallel for schedule(dynamic, chunk)
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

QuickThread

// process 1 - parallel for chunked
void Process1_pfc()
{
parallel_for(
chunk,
0,arraySize,
[&](int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
});
}

The Parallel for Stride Taskq

void test_pf_Stride_taskq()
{
#pragma omp parallel
{
#pragma intel omp taskq
{
for(int iBegin = 0; iBegin < arraySize; iBegin += Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;
#pragma intel omp task captureprivate(iBegin,iEnd)
{
Process1_pf(iBegin, iEnd);
Process2_pf(iBegin, iEnd);
Process3_pf(iBegin, iEnd);
Process4_pf(iBegin, iEnd);
}
}
}
}
}


QuickThread

// parallel for loops with stride and taskq
void test_pf_Stride_taskq()
{
qtControl StrideControl;
for(int iBegin = 0; iBegin < arraySize; iBegin+=Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;

// on parallel_task use [=] as opposed to [&]
// as the values of iBegin2 and iEnd2 change
// while building up the task list
parallel_task(
&StrideControl,
[=]()
{
Process1_pf(iBegin, iEnd);
Process2_pf(iBegin, iEnd);
Process3_pf(iBegin, iEnd);
Process4_pf(iBegin, iEnd);
});
}
}

The QuickThread functional equivalent (superset) is the parallel_task

The parallel for in sections

OpenMP

// multi-threaded parallel for loop sections
void test_pfs()
{
Process1_pf();
#pragma omp parallel sections
{
#pragma omp section
{
Process2_pf();
}
#pragma omp section
{
Process3_pf();
}
}
Process4_pf();
}


QuickThread

// multi-threaded parallel for loop sections
void test_pfs()
{
Process1_pf();
parallel_invoke(
[&]() { Process2_pf(); },
[&]() { Process3_pf(); });
Process4_pf();
}

As you can see from the code snips there is some difference in programming style for the parallel constructs between OpenMP and QuickThread. Consistent performance under varying load conditions is a major plus as well as slightly better performance under ideal (no load) conditions.

Jim Dempsey

Saturday, November 28, 2009

White Rabbits on ISN

Dear readers,

There are three derivative postings of White Rabbits on Intel Software Network Blogs - Parallel Programming (here) At the time of this posting, two blog post have been posted, and the third is in for review. I would expect the third posting to be up on Monday, 30th of November. I won't repeat the contents of those posts here. Instead I will use the test program developed for White Rabbits to illustrate the evolutionary programming steps in conversion of a serial program to parallel program. The conversion will address issues, solutions, and results for conversion using OpenMP 3.0 and QuickThread. I hope to have the first installment posted over this weekend.

When reading the posts related to a topic, please read the similarly titled post from the bottom of the blog upwards (in chronological order).

Jim Dempsey

Saturday, November 21, 2009

I'm late, I'm late for a very important date


And when I wave, I lose the time I save

There’s danger if I dare to stop
And here’s the reason why.
You see, I’m over due, I’m in a rabbit stew


The above are excerpts from a famous saying in the Walt Disney film “Alice In Wonderland” (1951).

You might be thinking to yourself: How does this saying relate to parallel programming?

When you learn parallel programming, the usual first experience is with using OpenMP in C++ or FORTRAN. We will use C++ in the sample code, but the same issues in FORTRAN and other languages with programming parallel extensions. Your typical first experience is with loops.

#pragma omp parallel for
for(int i = 0; i < count; ++i)
A[i] = B[i] + C[i];

To the programmer learning parallel programming, the statements above convey the impression that prior to the “#pragma omp parallel for”, one of the processors in the system is running the code and that the “#pragma omp parallel for” somehow magically focuses all the processors immediately on performing the for loop in pieces in parallel. Then someone will point out that processor in this context means core or hardware thread, or just plain thread.

The first impression of the above loop then is viewed as: prior to the loop the program runs with one thread, and during the loop the program uses n threads where n equals the number of concurrent threads available on the system (two on a single core with HT, 2, 4, 6, 8 12, 16 on various other current processor designs multiplied by the number of processor chips in the system).

Here is where “I'm late, I'm late for a very important date” comes to play.

All of the other threads on the system are not necessarily sitting in an idle state when your program reaches the “#pragma omp parallel for”. Even on a dedicated system, some of the threads may be busy running background programs or services (system update), or running other applications you launched (email, browser, playing music, internet radio, Task Manager, etc…).

The effect of other threads not being immediately available results in: not all of the threads scheduled to run the parallel for loop begin at the same time (or nearly the same time). Some threads will be late in starting their section of the loop.

In a similar manner that other applications and services running on the system can delay a thread or threads from starting their section of the loop, other threads within your application may also introduce delays. These other threads could be thread(s) driving your display and input device. Even in simple console applications, some of the runtime systems use background threads to write data to the console window.

The parallel programmer should learn and accept the fact that all threads will not be in complete control by the application. The programmer has a significant influence on the activities of the threads, but not absolute control over the activities of threads.

Now we examine the next line in the saying “And when I wave, I lose the time I save”

As you write parallel code with higher levels of complexity, you will find that some of the threads need to communicate with the other threads working with the same data. This access must be such that each thread does not interfere with any of the other threads use of the data.

There are many ways of performing intra-application thread-to-thread communication which we won’t address in detail in this blog post. Among these are: Critical Section, mutex, Interlocked operations, volatile flags, etc. Although the implementation details differ, the effect to your thread is its performing “And when I wave”. The consequence is: “I lose the time I save”.

The simplest of the inter-thread thread-safe communication (arguably) is the Interlocked operations. Internally the operation may be reduced to adding a single instruction prefix to an instruction such as ADD

LOCK add mem, reg

or intrinsic

InterlockedAdd(mem, value);

The LOCK prefix assures cache coherency between all threads (even across processors). In effect, this is your “And when I wave”. Adding that lock prefix (one byte) extends the execution time of that instruction 100 times to 300 times the time required without the LOCK. This is the “I lose the time I save” in writing the code as parallel code. You have similar penalties when using the other inter-thread communication (thread-safe) techniques.

There is a trade-off between the "time you save" against "the time you loose". The lesson learned is write your code such that you reduce the number of “waves” in your program.

Now consider: “There’s danger if I dare to stop, And here’s the reason why. You see, I’m over due, I’m in a rabbit stew”

Looking back at the simple parallel for loop, assume that all the other threads on the system were immediately available at the time the parallel for loop began. And that all these threads started at the same time. Will all these threads finish at the same time?

“There’s danger if I dare to stop”

Although your thread running the parallel for loop does not explicitly stop until it has finished its portion of the parallel for loop, the system will invariably have other things to do. Such as: background programs or services, and you may be running other applications too (email, browser, music, etc..).

Most operating systems perform an operation called pre-emption, also known as context switching, when there are more software threads active than available hardware threads. The operating system will run one thread for a period of time, and then it will suspend this thread and resume a different thread. And repeat this until threads become idle, or at least to the point where the number of busy software threads is less than or equal to the number of hardware threads.

The consequence of pre-emption is: “You see, I’m over due, I’m in a rabbit stew”

This means, although the parallel loop began with all threads starting at the same time, the finish time of each thread may differ as an unintended consequence of pre-emption.

This saying from “Alice In Wonderland” (I’m late, I’m late …) has an important message for parallel programmers. And it is important for you to keep this in mind as you write parallel code.

In future blogs we will examine how QuickThread addresses some of these issues.

Jim Dempsey

Friday, November 20, 2009

Extreme Parallel Programming

Extreme Parallel Programming, if we really get technical, is programming at the extremes of the programming environment envelope. This envelop is not necessarily linear but comprises of either an area, or volume, depending on how you wish to map the topology.

On one extreme, you will find something like a single processor, single core system with Hyper Threading such as an Intel Atom within a Netbook. On the other extreme you might find a cloud programming environment with ubiquitous processors. This article will examine a subset consisting of what you can fit into a single box. On the low end this is called Single Processor (SP) and on the high end, this is typically called Symmetrical Multi-Processor system (SMP).

For the purpose of this article, the extremes will be the single core with HTon the low end (e.g. Atom), and on the high-end a NUMA SMP system with 4 or 8 processors, each an Intel Nehalem EX, with each processor having 8 cores and Hyper Threading (64 to 128 hardware threads). See this.

In between these two extremes you will have a non-linear area or volume depicting the various blends of processor generations. Note, this article references Intel platforms, what holds for Intel will hold for AMD.

With the playground defined as the full gambit of SP and SMP systems, is it possible to write a single application source file(s) that scales to the fullest extent of the environment? The applications I refer to are the applications that you intend to use or sell (i.e. not a simple benchmark doing trivial work).

While you can write programs that exhibit some measure of scaling across this environment, it is difficult to write programs that scale to the fullest extent of the runtime environment. Especially when the environment is not known until the program runs, and when the environment is not fully anticipated when the programmer writes the code.

While a benchmark may consist of a simple loop, who’s data set size can be adjusted to fit the environment, and scalability charted. A real world application is not generally this simple.

While on the low end you may have a limited amount of L1 and L2 cache shared by one pair of Hyper Threads, on the upper end you have multiple sets of L1, L2, L3 as well as NUMA node separations of one or two levels (3 levels when using 3rd party interfaces). Obtaining the fullest extent of processing capacity requires the knowledge of your topology, as well as knowledge as to how to schedule threads within this topology. This is so difficult to do well, that very few of the best programmers have done so successfully.

As stated in an earlier blog post, the principal design consideration of QuickThread was to provide for a programming paradigm that would scale to the maximum extent of the hardware and do so over a period of 10 to 15 years of generational development of processors and SMP system configurations. This programming paradigm had to be simple to use, would not encumber the system when run on low-end systems, and would scale exceptionally well onto next generation systems well into the future.

Remember that these goals were set for my own personal interest in advancing my Space Elevator simulations. My driving factor was to reduce years or decades of computation time into days or weeks. And I did not want to re-write or re-tweak the program every time a new platform came out.

The design criteria lead to a task level paradigm where task could be asynchronous, synchronous, loops, teams (fixed or variable), completion tasks, compute and I/O tasks, NUMA capable, based on availability of threads, sensitive to data location, among other factors. The principal problem was how to accomplish this idealized system without burdening the programmer with managing the details.

The solution to this was to add an optional qualifying variable (qtPlacement) and an optional task control object (qtControl). See the QuickThread Programmers Reference Guide for detailed description.

When programming without the optional qtPlacement and qtControl arguments, QuickThread is syntactically similar to Threading Building Blocks (another task based parallel programming toolkit).

The qtPlacement argument is most useful on the higher-end systems but has no adverse effect on the low-end systems. On the lower-end systems qtPlacement can be use for opportunistic thread scheduling.

Consider the Atom processor, with 1 core and 2 hardware threads. When using a loop level parallel programming paradigm such as with OpenMP 2.0 programs are written to fan-out and fan-it. That is to say you run in a serial region of your application using one thread, enter a parallel region to run on multiple threads (fan-out), exit the parallel region back to serial code (fan-in) and run to next parallel region. With this type of programming paradigm you are assured that your application’s extra threads are all available upon entry into the parallel region. Therefore when programming

serial code
parallel loop
serial code
parallel loop

You are confident that all threads are available as you enter each parallel loop

In a tasking system, you may have one or more of the above code sequences in progress at any one time. You may also have similar code sequences running in different task(s) at the same time. Should you care if multiple tasks concurrently execute the above sequence, or other similar sequences? More importantly to you, should you expend the effort to write your code to make this determination and to take two code paths:

serial code
if(threadsAvailable)
{
parallel loop
}
else
{
serial loop
}
serial code
if(threadsAvailable)
{
parallel loop
}
else
{
serial loop
}

And, does your threading system inform you as to if threads are available at the time you execute that section of code? Coding as above would be awkward and error prone.

With QuickThread the code would look something like this

serial code
parallel_for( Waiting_L3$, fnDoWork1, vec.begin(), vec.end(), vec);
serial code
parallel_for( Waiting_L3$, fnDoWork2, vec.begin(), vec.end(), vec);


Where the qtPlacement argument is the Waiting_L3$, followed by a functor (address of the work function), note this could be a C++0x Lambda function in line, then followed by the half open range then argument list (the object vector in this case).

What do you think happens with this code when it runs on an Atom processor which has no Level 3 cache (assuming you guessed at L3$ was referring to Level-3 cache). And what is this Waiting thing all about? (the above could have used Waiting$ + L3$)

On the Atom (which has L2 but no L3 cache), it says: if I had an L3, the current thread’s L2 would be the next smaller set of hardware threads adjoining the thread issuing the parallel_for. Using all potential threads in this sub-set (now L2), check to see if other threads in this set are currently idle. If any idle threads are found, divide the iteration space amongst the idle threads plus current thread, and schedule the idle threads to run a slice of the iteration space. The current thread is not scheduled through the QuickThread task manager, instead, the code makes a direct function call to perform the slice of the iteration space (thus avoiding a task enqueue/dequeue to itself).

The interesting thing to note about this is when there are no idle threads, there are no task enqueue/dequeue operations, resulting in the issuing thread performing the loop using the entire iteration space.

Task enqueue/dequeue is not free. It is relatively expensive. On the Atom processor, when the other thread is not available, there is no unnecessary overhead in enqueuing a task only to be performed by the current thread. Less overhead == more processing time to run application.

What does this do for you on a 4 processor, 8-cores per processor, 2 threads per core. system?

On this system, the current processor hardware threads (16 out of the 64 threads on the system) are examined to see which of these threads are idle. Those idle threads, plus the current thread will constitute the pool of threads to work on the sliced-up iteration space. Threads from the other sockets (other NUMA nodes) will not be involved (see note in programmers guide). And should no threads be idle in this subset, the task enqueuing is bypassed as with the Atom example.

The above code runs at maximal efficiency on each end of the programming envelope as well as at maximal efficiency at all positions within the envelope.

Read more about this in the QuickThread Programmers Guide at http://www.quickthreadprogramming.com

Jim Dempsey

Thursday, November 19, 2009

The birth of QuickThread

Most parallel programming extensions (OpenMP, Threading Building Blocks, Cilk, etc.) have a design heritage of incrementally adding parallel constructs into (onto) the primary programming language (e.g. C, C++, FORTRAN, etc…) without consideration of specific requirements of a particular application. These parallel programming extensions become the culmination of the incremental steps. As features are required, they may or may not get incorporated into the programming extension.

Although QuickThread has a birth date of November 18, 2009 (date it became available for purchase), it has a rather long gestation period and courtship. The development time was over a period of four years. The initial design criteria were to solve a specific problem as that problem requirements would evolve over a period of 10 to 15 years. The architectural design, and principal functionality, of QuickThread was made in 2006 for how I envisioned computing platforms would evolve through to 2020. The design criteria were to solve a particularly difficult engineering problem that needs to be solved within that time frame.

The engineering problem is a high fidelity simulation of a 2nd generation space elevator (here). Whereas a 1st generation space elevator is generally a passive tethered satellite using one tether of approximately 100,000 km, a 2nd generation space elevator is constructed using several dynamic tethers (tethers in motion). The total length of tether material for a 2nd generation space elevator might reach 300,000 km.

A high fidelity simulation might require examining the dynamics of the tether using 1m segments. Or about 300 million tether segments, each segment requiring 100-300 points of data. And with simulated (virtual) run times of multiple years. This requires a massive amount of computation.

The computational requirements of the simulation was amenable using an SMP processing architecture, but not amenable using message passing architecture. At the time of the design of QuickThread, AMD had NUMA capable Opteron dual core processors. I anticipated that Intel would offer many core NUMA capable processors in the then near future and that it would be reasonable to expect 8, 16 or 32 core NUMA capable processors in 2010 to 2015. And systems configured using multiples of these processors would be suitable for computing mid-fidelity simulations of 2nd generation space elevators.

With the computational requirements (I personally) demanded, the internals of QuickThread had to be fully NUMA capable and support multiple levels of processor cache and multiple levels of NUMA separation. This also required the thread scheduler to be fully cognizant of the platform capabilities without requiring programming change as platforms evolved, and to be able to schedule tasks to the cache level and/or NUMA node.

QuickThread also had to be an asynchronous tasking system with independent task group synchronization capabilities. This requirement was the principal design requirement of the thread scheduler within QuickThread. As a secondary, but nearly as important design requirement, was for the integration of the language extension to be relatively easy for the programmer in use with large legacy applications. In my case, the simulation code I began with had roots extending back to the mid 1980’s and consisted of approximately 750 source modules and 600,000 lines of code.

This was the basis for the initial design requirements of QuickThread. Additional posts will not delve too much into the historical aspects of QuickThread but rather into the features, capabilities and programming ease.

Jim Dempsey

Introduction

Welcome to The Parallel Void



The focus of this Blog is to share with you this writer's tips, techniques, and experience relating to parallel programming. This writer's programming experience extends over a 40 year period, including being the principal architect for two minicomputer operating systems and founding five start-ups. The 5th start-up, QuickThread Programming, LLC, can be found at http://www.quickthreadprogramming.com/ has as their core product QuickThread, a parallel programming software development kit.



My name is Jim Dempsey. Google-ing my name will likely result in hits with other Jim Dempsey's as well as mine. The easiest way for you to get a handle on me is to read my contributions to the Intel Software Developer's forum at http://software.intel.com/en-us/forums or simply Google:



"Jim Dempsey" site:intel.com



The vast majority of my posts are replies providing assistance to those in need. You will find that the majority of my posts tend to direct the thread towards resolution. Although on occasion I have mentioned QuickThread, my general posts were directed towards assisting others.



My subsequent blog posting will provide an insight into design phylosophy of QuickThread



Jim Dempsey