Friday, November 20, 2009

Extreme Parallel Programming

Extreme Parallel Programming, if we really get technical, is programming at the extremes of the programming environment envelope. This envelop is not necessarily linear but comprises of either an area, or volume, depending on how you wish to map the topology.

On one extreme, you will find something like a single processor, single core system with Hyper Threading such as an Intel Atom within a Netbook. On the other extreme you might find a cloud programming environment with ubiquitous processors. This article will examine a subset consisting of what you can fit into a single box. On the low end this is called Single Processor (SP) and on the high end, this is typically called Symmetrical Multi-Processor system (SMP).

For the purpose of this article, the extremes will be the single core with HTon the low end (e.g. Atom), and on the high-end a NUMA SMP system with 4 or 8 processors, each an Intel Nehalem EX, with each processor having 8 cores and Hyper Threading (64 to 128 hardware threads). See this.

In between these two extremes you will have a non-linear area or volume depicting the various blends of processor generations. Note, this article references Intel platforms, what holds for Intel will hold for AMD.

With the playground defined as the full gambit of SP and SMP systems, is it possible to write a single application source file(s) that scales to the fullest extent of the environment? The applications I refer to are the applications that you intend to use or sell (i.e. not a simple benchmark doing trivial work).

While you can write programs that exhibit some measure of scaling across this environment, it is difficult to write programs that scale to the fullest extent of the runtime environment. Especially when the environment is not known until the program runs, and when the environment is not fully anticipated when the programmer writes the code.

While a benchmark may consist of a simple loop, who’s data set size can be adjusted to fit the environment, and scalability charted. A real world application is not generally this simple.

While on the low end you may have a limited amount of L1 and L2 cache shared by one pair of Hyper Threads, on the upper end you have multiple sets of L1, L2, L3 as well as NUMA node separations of one or two levels (3 levels when using 3rd party interfaces). Obtaining the fullest extent of processing capacity requires the knowledge of your topology, as well as knowledge as to how to schedule threads within this topology. This is so difficult to do well, that very few of the best programmers have done so successfully.

As stated in an earlier blog post, the principal design consideration of QuickThread was to provide for a programming paradigm that would scale to the maximum extent of the hardware and do so over a period of 10 to 15 years of generational development of processors and SMP system configurations. This programming paradigm had to be simple to use, would not encumber the system when run on low-end systems, and would scale exceptionally well onto next generation systems well into the future.

Remember that these goals were set for my own personal interest in advancing my Space Elevator simulations. My driving factor was to reduce years or decades of computation time into days or weeks. And I did not want to re-write or re-tweak the program every time a new platform came out.

The design criteria lead to a task level paradigm where task could be asynchronous, synchronous, loops, teams (fixed or variable), completion tasks, compute and I/O tasks, NUMA capable, based on availability of threads, sensitive to data location, among other factors. The principal problem was how to accomplish this idealized system without burdening the programmer with managing the details.

The solution to this was to add an optional qualifying variable (qtPlacement) and an optional task control object (qtControl). See the QuickThread Programmers Reference Guide for detailed description.

When programming without the optional qtPlacement and qtControl arguments, QuickThread is syntactically similar to Threading Building Blocks (another task based parallel programming toolkit).

The qtPlacement argument is most useful on the higher-end systems but has no adverse effect on the low-end systems. On the lower-end systems qtPlacement can be use for opportunistic thread scheduling.

Consider the Atom processor, with 1 core and 2 hardware threads. When using a loop level parallel programming paradigm such as with OpenMP 2.0 programs are written to fan-out and fan-it. That is to say you run in a serial region of your application using one thread, enter a parallel region to run on multiple threads (fan-out), exit the parallel region back to serial code (fan-in) and run to next parallel region. With this type of programming paradigm you are assured that your application’s extra threads are all available upon entry into the parallel region. Therefore when programming

serial code
parallel loop
serial code
parallel loop

You are confident that all threads are available as you enter each parallel loop

In a tasking system, you may have one or more of the above code sequences in progress at any one time. You may also have similar code sequences running in different task(s) at the same time. Should you care if multiple tasks concurrently execute the above sequence, or other similar sequences? More importantly to you, should you expend the effort to write your code to make this determination and to take two code paths:

serial code
if(threadsAvailable)
{
parallel loop
}
else
{
serial loop
}
serial code
if(threadsAvailable)
{
parallel loop
}
else
{
serial loop
}

And, does your threading system inform you as to if threads are available at the time you execute that section of code? Coding as above would be awkward and error prone.

With QuickThread the code would look something like this

serial code
parallel_for( Waiting_L3$, fnDoWork1, vec.begin(), vec.end(), vec);
serial code
parallel_for( Waiting_L3$, fnDoWork2, vec.begin(), vec.end(), vec);


Where the qtPlacement argument is the Waiting_L3$, followed by a functor (address of the work function), note this could be a C++0x Lambda function in line, then followed by the half open range then argument list (the object vector in this case).

What do you think happens with this code when it runs on an Atom processor which has no Level 3 cache (assuming you guessed at L3$ was referring to Level-3 cache). And what is this Waiting thing all about? (the above could have used Waiting$ + L3$)

On the Atom (which has L2 but no L3 cache), it says: if I had an L3, the current thread’s L2 would be the next smaller set of hardware threads adjoining the thread issuing the parallel_for. Using all potential threads in this sub-set (now L2), check to see if other threads in this set are currently idle. If any idle threads are found, divide the iteration space amongst the idle threads plus current thread, and schedule the idle threads to run a slice of the iteration space. The current thread is not scheduled through the QuickThread task manager, instead, the code makes a direct function call to perform the slice of the iteration space (thus avoiding a task enqueue/dequeue to itself).

The interesting thing to note about this is when there are no idle threads, there are no task enqueue/dequeue operations, resulting in the issuing thread performing the loop using the entire iteration space.

Task enqueue/dequeue is not free. It is relatively expensive. On the Atom processor, when the other thread is not available, there is no unnecessary overhead in enqueuing a task only to be performed by the current thread. Less overhead == more processing time to run application.

What does this do for you on a 4 processor, 8-cores per processor, 2 threads per core. system?

On this system, the current processor hardware threads (16 out of the 64 threads on the system) are examined to see which of these threads are idle. Those idle threads, plus the current thread will constitute the pool of threads to work on the sliced-up iteration space. Threads from the other sockets (other NUMA nodes) will not be involved (see note in programmers guide). And should no threads be idle in this subset, the task enqueuing is bypassed as with the Atom example.

The above code runs at maximal efficiency on each end of the programming envelope as well as at maximal efficiency at all positions within the envelope.

Read more about this in the QuickThread Programmers Guide at http://www.quickthreadprogramming.com

Jim Dempsey

No comments:

Post a Comment