HPC OpenMP Exercise: Hands-On Lab

Introduction to OpenMP

Open MultiProcessing (OpenMP) is an open standard for multithreading in C, C++, and Fortran. Most modern compilers, including GNU's gcc, Intel's icc, and Microsoft's Visual Studio include built-in support for OpenMP, so it provides a means of writing portable multithreaded programs.

In many multithreading platforms (e.g., POSIX threads, Java threads), the programmer must explicitly manage the threads. By contrast, OpenMP provides implicit multithreading, meaning the OpenMP library handles most of the details of managing the threads. Because of this, OpenMP is relatively easy to use compared to other multithreading platforms; in the HPC world, it has become the de facto standard for C multithreading.

Part I: The Fork-Join Pattern in OpenMP

The Fork-Join pattern plays a central role in OpenMP. To let you explore it, download and run forkJoin and forkJoin2. Read the comment at the beginning of each source file to see what you should do with it. Things to look for include:

How to fork a child thread.
How to join a thread.

Each example includes a Makefile to simplify building it, so create a separate folder for each program and its Makefile.

Once you have explored these examples and are confident you understand how OpenMP's use of the Fork-Join pattern works, you may continue to the next part of today's exercise.

Part II: The SPMD Pattern in OpenMP

Like many other parallel platforms, OpenMP uses the Single Program Multiple Data (SPMD) pattern. To explore it, download and run spmd and spmd2. As before, read the comment at the beginning of each source file to see what you should do with it.

When you are confident you understand how OpenMP uses the SPMD pattern, continue on to the next part.

Part III: The Barrier Pattern in OpenMP

OpenMP also provides built-in support for the Barrier pattern. To explore it, download and run barrier. As before, read the comment at the beginning of the source file to see what you should do with it.

When you are confident you understand how to use OpenMP's version of the Barrier pattern, continue on to the next part.

Part IV: The Master-Worker and Master-only Patterns in OpenMP

OpenMP provides functions by which a thread can discover how many threads there are, and its identity. These (combined with the Fork-Join pattern) are sufficient to implement the Master-Worker and Master-only patterns in OpenMP. To explore these patterns, download and run masterWorker and masterOnly. As before, read the comments at the beginning of each source file to see what you should do with it.

When you are confident you understand how to implement the Master-Worker and Master-only patterns in OpenMP, continue on to the next part.

Part V: Parallel Loops in OpenMP

Simple Parallel Loops. As we have seen before, there are two versions of the Parallel Loop pattern:

the Equal Chunks version, which groups a for loop's iterations into equal-sized 'chunks' of consecutive numbers and assigns each 'chunk' to a different thread; and
the Stripes version, which divides the loop's iterations by having different threads perform successive iterations in a round-robin fashion, effectively "slicing" the iterations across different threads.

You may explore both versions of this pattern in OpenMP by downloading and running parallelLoopChunks, and parallelLoopSlices. As before, read the comment at the beginning of the source file to see what you should do with it.

OpenMP's parallel loop supports many other features, including dynamic scheduling (aka work stealing) and the ability to control the size of the blocks in each version of the Parallel Loop pattern. We will not explore them here, but you should feel free to investigate them on your own.

Parallel While Loops. OpenMP's #pragma omp parallel for provides an easy and convenient way to parallelize a for loop, but it has some key restrictions: (i) the iterations of the loop must be independent of one another, and (ii) the loop must have the canonical for loop form:

   for (initExpr; testExpr; incrExpr) 
      structuredBlock

The most efficient solutions to many problems' require iteration without this second (canonical form) restriction -- for example, when the iteration should continue so long as changes keep occurring, or stop as soon as a solution is found.

In C/C++, there is an equivalent while loop for any for loop:

For Loop Structure Equivalent While Loop Structure

for (initExpr; testExpr; incrExpr) { stmts }

initExpr; while (testExpr) { stmts incrExpr; }

For Loop Structure		Equivalent While Loop Structure
for (initExpr; testExpr; incrExpr) { stmts }		initExpr; while (testExpr) { stmts incrExpr; }

so for problems whose sequential solutions use a while loop or a non-canonical for loop, the solution can be parallelized in OpenMP using an parallel while instead of its #pragma omp parallel for.

Download the files from the parallelWhile folder. The file parallelWhile.c is a patternlet that illustrates how to implement both the 'chunks' and the 'slices' versions of a parallel loop using OpenMP and a while loop. Examine the code in parallelWhile.c to see how to implement such loops, and follow the directions at the beginning of the file to explore their behavior.

Parallel Loops and Reductions. In OpenMP, the Reduction pattern is implemented as an optional clause of the omp parallel directive. Since OpenMP's omp parallel for is a special case of that directive, Reduction can also be used with it. To see how it works, download the files in the parallelLoop-reduction folder, and follow the directions at the beginning of the source file.

Part VI: Race Conditions and the Mutual Exclusion Pattern in OpenMP

OpenMP has no implementation of the broadcast pattern because it is not needed: if the master thread has read a dataset into an array that is in shared memory, then all threads can access the dataset via that array.

However, when multiple threads are accessing a shared variable (e.g., the shared array just mentioned), it is all too easy to create race conditions.

If all threads just read from the shared variable, then no conflict occurs, because reading is a non-destructive operation.
If any thread tries to write to the shared variable when:
- another thread may be reading from it, a read-write conflict exists;
- another thread may be writing to it, a write-write conflict exists;
Either of these situations produces a race condition -- a situation in which the final value of the variable is non-deterministic. This occurs because writing is a destructive operation --the old value of the variable is destroyed-- so the outcome depends on when that write occurs relative to the other operation.

To illustrate this, download the files in the private folder, and follow the directions at the beginning of the source file.

As you work through the exercise, work with your neighbor to find the answers to these questions:

What shared variable is the source of the conflict?
Is it a write-write conflict, a read-write conflict, or both? Why?
How does OpenMP's private clause resolve the conflict?

For race conditions where the code contains a critical section, the Mutual Exclusion pattern can be used. OpenMP provides two different mechanisms for this pattern: the atomic mechanism and the critical mechanism:

The atomic mechanism is very fast, but it has a key limitation: it can only resolve a race condition when a single machine instruction exists for the operation causing the conflict. Put differently, if the CPU's instruction set includes a single machine instruction to perform the operation causing the conflict, the atomic mechanism will resolve the read-write or write-write conflict by using that one instruction to perform the operation atomically, thus resolving the race condition. This means that the atomic directive may perform differently on Complex Instruction Set Computer (CISC) and Reduced Instruction Set Computer (RISC) architectures; the latter have fewer machine instructions, and so may not have a single machine instruction to perform the operation causing the conflict.
The critical mechanism can be used to resolve any race condition, but it is much slower than the atomic mechanism.

You can explore the atomic mechanism by downloading the files from the atomic folder. As usual, follow the directions at the beginning of the file, and map the behavior you observe to the source code producing it.

You can explore the critical mechanism by downloading the files from the critical folder and the critical2 folder. By now, you should know the drill: follow the directions at the beginning of the file, and figure out how the source code is producing the behavior you observe.

If you compare these patterns in OpenMP against the corresponding patterns in other parallel platforms, it should be evident that OpenMP's mechanisms are much simpler and easier to use than the equivalent mechanisms of other platforms. This simplicity and comparative ease of use are what have made OpenMP so popular.

Part VII: The Parallel Tasks Pattern in OpenMP

The Parallel Loop patterns can be used to divide a dataset into chunks or slices and process those chunks or slices in parallel. As such, the Parallel Loop supports a particular parallel algorithm design strategy called Data Decomposition --we decompose the dataset into chunks or slices and process them in parallel.

Another common parallel algorithm design strategy is called Task Decomposition, in which a problem is decomposed into tasks, and those that can be performed in parallel are performed by different PEs.

OpenMP supports Task Decomposition, so the final parallel pattern we will examine is the Parallel Task pattern. OpenMP implements this pattern using its sections directive. To see how it works, download the files from the sections folder. Note that each section can be performing a different function, so if a problem can be decomposed into independent tasks (i.e., that can run in parallel), the sections directive provides a way to map those tasks to different PEs.

When you are finished, you may continue to this week's project.

CS > 374 > Exercise > 05 > Hands-On Lab

This page maintained by Joel Adams.