Sunday, November 29, 2009

Comparison between QuickThread and OpenMP 3.0

Often when you see manufacturer’s published bench marks you generally see the results data when a test program is run under ideal conditions. If you have followed my White Rabbit blogs on the Intel Software Network Blogs you will see the effects of various load conditions as it relates to unexpected delays in a test program.

A similar test program was adapted from that program for use in this comparison between QuickThread and OpenMP. One test program is the evolutionary process of adapting a single threaded program to use OpenMP 3.0 ranging from simple design to more complex designs using OpenMP 3.0 features. The other test program performs the equivalent steps using QuickThread software development kit.

Each test program runs a series of tests ranging from single threaded through various design attempts to eek maximum performance out of the test program while run under various load conditions.

All tests ran on Intel Q6600 quad core 2.4GHz system. These tests are not setup to test scalability of the application. Excepting for single threaded tests, all parallel tests use all cores for their thread pools. This is to say, the tests are setup to determine the effects on performance of the test application while running under these different system load conditions:

No load
1 Load of a console command running DIR C:\ /S
2 Loads (2 console commands) of running DIR C:\ /S
1 console window running the following hang.bat file
echo off
:loop
goto loop
2 console windows running hang.bat
3 console windows running hang.bat
4 console windows running hang.bat

The functional diagram of the application is

process 1
process 2
process 3
process 4

Where processes 2 and 3 are dependent on process 1, and process 4 is dependent on processes 2 and 3.

process 1
process 2, process 3
process 4

The data and procedure within all processes are such that slices of the data sets can migrate through the processes.

The single threaded versions are below:

// process 1 - single threaded
void Process1_st()
{
for(int i = 0; i < arraySize; ++i)
A[i] = ((double)i) / 1000.0;
}

// process 2 - single thread
void Process2_st()
{
for(int i = 0; i < arraySize; ++i)
B[i] = sin(A[i]);
}

// process 3 - single thread
void Process3_st()
{
for(int i = 0; i < arraySize; ++i)
C[i] = cos(A[i]);
}

// process 4 - single thread
void Process4_st()
{
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

The size of the arrays are (1024*1024) which is 8MB and larger than the cache size on the system. The functions are meaningless other than they emulate lightly loaded processes that might exist in a system.

After the background load is established the test program runs and takes 100 sample times of each test run using various parallelization strategies. The strategies will be described following the commentary on the charts.







For both charts the lines represent the average of a 100 point samples for each test, under varying test load conditions. The X-axis represents tests. The Y-axis represents a scaling factor from the single threaded No Load minimum run time. The tests are

st single thread
st_Str single thread with stride
pf parallel for
pf_Str parallel for in stride loop
pf_Str_tq parallel for in stride loop driven by taskq (OpenMP 3.0 feature)
pfc parallel for with dynamic scheduling and chunk size
pfc_Str parallel for with dynamic scheduling and chunk size in stride loop
pfc_Str_tq parallel for with dynamic scheduling and chunk size in stride loop driven by taskq
pfs parallel for using sections
pfcs parallel for with dynamic scheduling and chunk size using sections

NOTE
QuickThread does not have a parallel construct with the name taskq, however equivalent functionality is available using parallel_task (which has additional capabilities). The same name was used on the chart for tests with equivalent functionality.

The charts illustrate that under No Load conditions OpenMP 3.0 yields almost as good as performance as QuickThread. (as long as you avoid pf_Str and pfc_Str programming practices).
QuickThread, on the other hand, yields consistently good results under varying load conditions as well as under various programming practices. This means that you will spend less time reworking code with QuickThread than you will with OpenMP

Let’s look at the programming differences. We won’t list all the code, but we will list enough for you to be able to extrapolate what the code would look like.

Single threaded test (st)

// single thread for loops
void test_st()
{
Process1_st();
Process2_st();
Process3_st();
Process4_st();
}

Where the process code was listed earler.

The Single thread with stride test (st_Str)

// single thread for loops using stride
void test_st_Stride()
{
for(int iBegin = 0; iBegin < arraySize; iBegin+=Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;
Process1_st(iBegin, iEnd);
Process2_st(iBegin, iEnd);
Process3_st(iBegin, iEnd);
Process4_st(iBegin, iEnd);
}
}

The purpose of including this in the test was to provide some sense of the additional call overhead as well as any potential benefit of cache differences.

The parallel for tests are the same algorithms as the single threaded, so we will only list the Process1_pf functions

OpenMP

// process 4 - parallel for
void Process4_pf()
{
#pragma omp parallel for
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

QuickThread

// process 1 - parallel for
void Process1_pf()
{
parallel_for(
0, arraySize,
[&](int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
});
}

The QuickThread implementation is using the C++0x Lambda functions. If you prefer to not use Lambda functions then you simply take the Lambda function out and provide a name for the function and supply the half open range as arguments to the new task function.


// process 1 - parallel for
void Process1_pf_task(int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
}
void Process1_pf()
{
parallel_for(
Process1_pf_task,
0, arraySize); // as task args 1 and 2
}

The chunking version of the parallel for is

OpenMP

// process 4 - parallel for chunked
void Process4_pfc()
{
#pragma omp parallel for schedule(dynamic, chunk)
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

QuickThread

// process 1 - parallel for chunked
void Process1_pfc()
{
parallel_for(
chunk,
0,arraySize,
[&](int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
});
}

The Parallel for Stride Taskq

void test_pf_Stride_taskq()
{
#pragma omp parallel
{
#pragma intel omp taskq
{
for(int iBegin = 0; iBegin < arraySize; iBegin += Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;
#pragma intel omp task captureprivate(iBegin,iEnd)
{
Process1_pf(iBegin, iEnd);
Process2_pf(iBegin, iEnd);
Process3_pf(iBegin, iEnd);
Process4_pf(iBegin, iEnd);
}
}
}
}
}


QuickThread

// parallel for loops with stride and taskq
void test_pf_Stride_taskq()
{
qtControl StrideControl;
for(int iBegin = 0; iBegin < arraySize; iBegin+=Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;

// on parallel_task use [=] as opposed to [&]
// as the values of iBegin2 and iEnd2 change
// while building up the task list
parallel_task(
&StrideControl,
[=]()
{
Process1_pf(iBegin, iEnd);
Process2_pf(iBegin, iEnd);
Process3_pf(iBegin, iEnd);
Process4_pf(iBegin, iEnd);
});
}
}

The QuickThread functional equivalent (superset) is the parallel_task

The parallel for in sections

OpenMP

// multi-threaded parallel for loop sections
void test_pfs()
{
Process1_pf();
#pragma omp parallel sections
{
#pragma omp section
{
Process2_pf();
}
#pragma omp section
{
Process3_pf();
}
}
Process4_pf();
}


QuickThread

// multi-threaded parallel for loop sections
void test_pfs()
{
Process1_pf();
parallel_invoke(
[&]() { Process2_pf(); },
[&]() { Process3_pf(); });
Process4_pf();
}

As you can see from the code snips there is some difference in programming style for the parallel constructs between OpenMP and QuickThread. Consistent performance under varying load conditions is a major plus as well as slightly better performance under ideal (no load) conditions.

Jim Dempsey

2 comments:

  1. The OpenMP fragment posted in this blog under the title: "The Parallel for Stride Taskq" is actually not really OpenMP 3.0. Instead it is Intel's own version of tasking (as indicated by the word "intel" in the pragma string). OpenMP 3.0 includes a standard tasking features which are also available in the Intel compiler using "#pragma omp task" pragma syntax. It would be useful to convert this code snippet to OpenMP 3.0 format instead and re-run the results to be consistent with the title of this article.

    ReplyDelete
  2. Interesting review, it is like trying to compare Generic Cialis and generic viagra, they are both the same but with different names and brands.

    ReplyDelete