Monday, July 22, 2013

QuickThread source code is now available under GPL

The QuickThread (R) source code is now available under GNU Public License on the http://www.quickthreadprogramming.com The above website has been reported to have relatively slow download speeds. Therefor, I am in the process of creating a sourceforge.net project page. Jim Dempsey

Tuesday, June 7, 2011

Downloads direct from website

A license change has been made to QuickThread(R) which permits free evaluation of th e software. Links to download the software is available at www.quickthreadprogramming.com. You can download anonymously if you wish. If you need support or assistance though you will need to contact me via email jim@quickthreadprogramming.com.

Note, the EULA posted on the website is dated and superseeds the EULA embedded in the archives. These will be replaced on the next version update.

Wednesday, October 13, 2010

IDF 2010 Video Interviews Now Available

My ISN-TV interview from September's IDF in San Francisco is now available at:

http://software.intel.com/en-us/blogs/2010/10/11/idf-2010-video-interviews-now-available-jim-dempsey-gaston-hillar-james-reinders-and-softtalk-bloggers-sean-mcmanus-and-lauren-bishop/

Additional, the .PDF composition of my 5-part article series on ISN Parallel Programming Community can be found at:

http://www.quickthreadprogramming.com/Superscalar_programming_101_parts_1-5.pdf

Jim

Thursday, June 10, 2010

Version 2.0 Beta Program

My silence over the last few months is due to overwork in preparing a version update to QuickThread. Version 2.0 will include support for running on Linux.

Major internal changes were made to accommodate a single source QuickThread scheduler, with a few conditional compilation directives, that works for both Windows and Linux operating systems.

Preliminary testing has been performed on Ubuntu 10.0 and with g++ 4.5 (pre-release). g++ 4.4 works too however the c++0x Lambda functions are not available in g++ 4.4.

Externally your programs will run with little or no change other than being required to be re-compiled.

Exposed API difference mainly concern a different layout for the qtControl structure and some member functions called by the templates.

The internal changes relate to using pthreads on Linux to establish the QuickThread tread pool, while permitting Windows native threads on Windows.

On Linux the maximum number of compute class threads is 1024. This can be expanded but would require a recompilation of the libQuickThread.a library.

Using Windows native threads (not on Wndows 7) the upper compute class thread count is 64. If interest develops for more than 64 threads on Windows 7 then this issue can be addressed at that time.

For those interested in participating in the beta program, please contact me via my email address on www.quickthreadprogramming.com

Jim

Thursday, January 28, 2010

New blog posts on ddj.com

I am writing today to mention that two new blog posts are available on www.ddj.com/go-parallel relating to using a QuickThread two stage parallel pipeline to improve the performance of the input end and output end of an application (as opposed to 3 stage design.

The typical 3-stage pipeline is:

MyPipelineIOContext pio;
pio.Init(your init args here);
qtPipeline< MyPipelineIOContext, MyPipelineBuffer > pipeline;
pipeline.addPipe(InputPipe);
pipeline.addPipe(DoWorkPipe);
pipeline.addPipe(OutputPipe);
pipeline.run(&pio);
pipeline.WaitTillDone();

The 3-stage pipeline works quite well for most applications. However, there are certain classes of applications where a single linear pipeline is unsuitable. One such example was referenced in the ddj.com/go-parallel blog. For this you might find sandwidching your work phase between two 2-stage pipelines:

// 2-stage pipeline to suck-in and prep data
MyInputPipelineIOContext pio_in;
pio_in.Init(your init args here);
qtPipeline< MyInputPipelineIOContext, MyInputPipelineBuffer > pipeline_in;
pipeline_in.addPipe(InputPipe); // I/O
pipeline_in.addPipe(DoInputPrepWorkPipe); // compute
pipeline_in.run(&pio_in);
pipeline_in.WaitTillDone();
// input complete (back to serial code)

// start work (which has parallel constructs)
DoWorkOutsideOfPipeline();
// done with work

// use 2-stage pipeline to output results
MyOutputPipelineIOContext pio_out;
pio_out.Init(your init args here);
qtPipeline< MyOutputPipelineIOContext, MyOutputPipelineBuffer > pipeline_out;
pipeline_out.addPipe(DoOutputPrepWorkPipe); // compute
pipeline_out.addPipe(OutputPipe); // I/O
pipeline_out.run(&pio_out);
pipeline_out.WaitTillDone();

The prep work for input might consist of converting ASCII text input into binary format. The prep work for output migh be converting binary format to ASCII text. Although you are free to use as you see fit.

Jim

Tuesday, December 22, 2009

New release of QuickThread available.

A new release of QuickThread is now available to registered users at the download and support site.

Principal difference is optimization of parallel_for. A new class of task node has been created: an iterating task node. The new parallel_for (and parallel_task) are now modified to make use of this new task node. Prior versions would enqueue multiple task nodes per slice of iteration space (per chunk when chunks used), resulting in one task enqueue per thread (or chunk as the case may be). New code enqueues one iterating task node regardless of the number of slices.


int chunk=100;
int nObjects = 123456789;
parallel_for(chunk, fnObject, 0, nObjects, a3);


old format has nObjects number of task node and enqueue/dequeue
new format has 1 task node and enqueue/dequeue


The new format has significantly fewer demands on task enqueue/dequeue and numbers of task nodes in flight. In the event that you experience problems you can disable this feature by editing parallel_for_v1.h and commenting out


#define USE_new_parallel_for

New Home page at www.quickthreadprogramming.com too.

The home page contains a new document comparing QuickThread and OpenMP under various system loads.


Jim Dempsey

Sunday, November 29, 2009

Comparison between QuickThread and OpenMP 3.0

Often when you see manufacturer’s published bench marks you generally see the results data when a test program is run under ideal conditions. If you have followed my White Rabbit blogs on the Intel Software Network Blogs you will see the effects of various load conditions as it relates to unexpected delays in a test program.

A similar test program was adapted from that program for use in this comparison between QuickThread and OpenMP. One test program is the evolutionary process of adapting a single threaded program to use OpenMP 3.0 ranging from simple design to more complex designs using OpenMP 3.0 features. The other test program performs the equivalent steps using QuickThread software development kit.

Each test program runs a series of tests ranging from single threaded through various design attempts to eek maximum performance out of the test program while run under various load conditions.

All tests ran on Intel Q6600 quad core 2.4GHz system. These tests are not setup to test scalability of the application. Excepting for single threaded tests, all parallel tests use all cores for their thread pools. This is to say, the tests are setup to determine the effects on performance of the test application while running under these different system load conditions:

No load
1 Load of a console command running DIR C:\ /S
2 Loads (2 console commands) of running DIR C:\ /S
1 console window running the following hang.bat file
echo off
:loop
goto loop
2 console windows running hang.bat
3 console windows running hang.bat
4 console windows running hang.bat

The functional diagram of the application is

process 1
process 2
process 3
process 4

Where processes 2 and 3 are dependent on process 1, and process 4 is dependent on processes 2 and 3.

process 1
process 2, process 3
process 4

The data and procedure within all processes are such that slices of the data sets can migrate through the processes.

The single threaded versions are below:

// process 1 - single threaded
void Process1_st()
{
for(int i = 0; i < arraySize; ++i)
A[i] = ((double)i) / 1000.0;
}

// process 2 - single thread
void Process2_st()
{
for(int i = 0; i < arraySize; ++i)
B[i] = sin(A[i]);
}

// process 3 - single thread
void Process3_st()
{
for(int i = 0; i < arraySize; ++i)
C[i] = cos(A[i]);
}

// process 4 - single thread
void Process4_st()
{
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

The size of the arrays are (1024*1024) which is 8MB and larger than the cache size on the system. The functions are meaningless other than they emulate lightly loaded processes that might exist in a system.

After the background load is established the test program runs and takes 100 sample times of each test run using various parallelization strategies. The strategies will be described following the commentary on the charts.







For both charts the lines represent the average of a 100 point samples for each test, under varying test load conditions. The X-axis represents tests. The Y-axis represents a scaling factor from the single threaded No Load minimum run time. The tests are

st single thread
st_Str single thread with stride
pf parallel for
pf_Str parallel for in stride loop
pf_Str_tq parallel for in stride loop driven by taskq (OpenMP 3.0 feature)
pfc parallel for with dynamic scheduling and chunk size
pfc_Str parallel for with dynamic scheduling and chunk size in stride loop
pfc_Str_tq parallel for with dynamic scheduling and chunk size in stride loop driven by taskq
pfs parallel for using sections
pfcs parallel for with dynamic scheduling and chunk size using sections

NOTE
QuickThread does not have a parallel construct with the name taskq, however equivalent functionality is available using parallel_task (which has additional capabilities). The same name was used on the chart for tests with equivalent functionality.

The charts illustrate that under No Load conditions OpenMP 3.0 yields almost as good as performance as QuickThread. (as long as you avoid pf_Str and pfc_Str programming practices).
QuickThread, on the other hand, yields consistently good results under varying load conditions as well as under various programming practices. This means that you will spend less time reworking code with QuickThread than you will with OpenMP

Let’s look at the programming differences. We won’t list all the code, but we will list enough for you to be able to extrapolate what the code would look like.

Single threaded test (st)

// single thread for loops
void test_st()
{
Process1_st();
Process2_st();
Process3_st();
Process4_st();
}

Where the process code was listed earler.

The Single thread with stride test (st_Str)

// single thread for loops using stride
void test_st_Stride()
{
for(int iBegin = 0; iBegin < arraySize; iBegin+=Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;
Process1_st(iBegin, iEnd);
Process2_st(iBegin, iEnd);
Process3_st(iBegin, iEnd);
Process4_st(iBegin, iEnd);
}
}

The purpose of including this in the test was to provide some sense of the additional call overhead as well as any potential benefit of cache differences.

The parallel for tests are the same algorithms as the single threaded, so we will only list the Process1_pf functions

OpenMP

// process 4 - parallel for
void Process4_pf()
{
#pragma omp parallel for
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

QuickThread

// process 1 - parallel for
void Process1_pf()
{
parallel_for(
0, arraySize,
[&](int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
});
}

The QuickThread implementation is using the C++0x Lambda functions. If you prefer to not use Lambda functions then you simply take the Lambda function out and provide a name for the function and supply the half open range as arguments to the new task function.


// process 1 - parallel for
void Process1_pf_task(int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
}
void Process1_pf()
{
parallel_for(
Process1_pf_task,
0, arraySize); // as task args 1 and 2
}

The chunking version of the parallel for is

OpenMP

// process 4 - parallel for chunked
void Process4_pfc()
{
#pragma omp parallel for schedule(dynamic, chunk)
for(int i = 0; i < arraySize; ++i)
D[i] = sqrt((B[i]*B[i])+(C[i]*C[i]));
}

QuickThread

// process 1 - parallel for chunked
void Process1_pfc()
{
parallel_for(
chunk,
0,arraySize,
[&](int iBg, int iEn)
{
for(int i = iBg; i < iEn; ++i)
A[i] = ((double)i) / 1000.0;
});
}

The Parallel for Stride Taskq

void test_pf_Stride_taskq()
{
#pragma omp parallel
{
#pragma intel omp taskq
{
for(int iBegin = 0; iBegin < arraySize; iBegin += Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;
#pragma intel omp task captureprivate(iBegin,iEnd)
{
Process1_pf(iBegin, iEnd);
Process2_pf(iBegin, iEnd);
Process3_pf(iBegin, iEnd);
Process4_pf(iBegin, iEnd);
}
}
}
}
}


QuickThread

// parallel for loops with stride and taskq
void test_pf_Stride_taskq()
{
qtControl StrideControl;
for(int iBegin = 0; iBegin < arraySize; iBegin+=Stride)
{
int iEnd = iBegin + Stride;
if(iEnd > arraySize)
iEnd = arraySize;

// on parallel_task use [=] as opposed to [&]
// as the values of iBegin2 and iEnd2 change
// while building up the task list
parallel_task(
&StrideControl,
[=]()
{
Process1_pf(iBegin, iEnd);
Process2_pf(iBegin, iEnd);
Process3_pf(iBegin, iEnd);
Process4_pf(iBegin, iEnd);
});
}
}

The QuickThread functional equivalent (superset) is the parallel_task

The parallel for in sections

OpenMP

// multi-threaded parallel for loop sections
void test_pfs()
{
Process1_pf();
#pragma omp parallel sections
{
#pragma omp section
{
Process2_pf();
}
#pragma omp section
{
Process3_pf();
}
}
Process4_pf();
}


QuickThread

// multi-threaded parallel for loop sections
void test_pfs()
{
Process1_pf();
parallel_invoke(
[&]() { Process2_pf(); },
[&]() { Process3_pf(); });
Process4_pf();
}

As you can see from the code snips there is some difference in programming style for the parallel constructs between OpenMP and QuickThread. Consistent performance under varying load conditions is a major plus as well as slightly better performance under ideal (no load) conditions.

Jim Dempsey