CS 112 Lab 10: Shared-Memory Parallelism Using OpenMP

Objectives:

  1. Learn about parallel programming for multicore CPUs.
  2. Learn about simple shared-memory parallelism.
  3. Learn basic OpenMP pragmas.

Introduction

Before 2005, CPU clock speeds rose dramatically, doubling roughly every two years. The following chart shows this change in speed over 35 years:

Clock speeds rose exponentially from 1975 to 2005

Thanks to this rapid increase, software companies whose programs were sluggish and inefficient could count on faster clockspeeds in the future to make their programs perform acceptably for their users. In some circles, this became known as the free lunch, because the same software application would run twice as fast every two years, without its manufacturer having to do a thing.

As can be seen in the chart above, this "free lunch" ended in 2005, when the costs of dealing with heat buildup, electron leakage, and other problems made it impractical for CPU manufacturers to continue doubling their clock speeds.

However, while clock speeds stopped increasing in 2005, Moore's Law continued unabated: Every two years, CPU manufacturers could squeeze the same amount of circuitry into half the area. In 2005, CPU manufacturers used this capability to fit two complete CPUs into the space formerly required for one, and called the results dual-core CPUs. Two years later, manufacturers could fit four CPUs into the space of one, yielding quad-core CPUs. And so on, with many-core CPUs now on the horizon:

From 2004 to 2008, CPUs went from unicore to dual-core to
quad-core.

In today's multicore world, CPUs are no longer speeding up, but a CPU's total processing capability continues to increase exponentially. That is, instead of a 5 GHz CPU, you can buy a 2.5 GHz dual-core CPU, whose two cores provide 5 GHz (2 x 2.5 GHz) worth of total CPU cycles. Instead of a 10 GHz CPU, you can buy a 2.5 GHz quad-core CPU, whose four cores provides 10 GHz (4 x 2.5 GHz) worth of total CPU cycles.

The exercises we have done previously have all involved writing sequential programs, meaning the program's statements are performed one at a time (i.e., in a sequence). This made sense when a CPU had just one core, since it could only perform one statement at a time.

By contrast, a multicore CPU can perform a different statement on each core at the same time. This makes it possible to write parallel programs -- programs that are divided into pieces, and each piece runs on a different core. This parallel execution is what can provide faster execution, but it also can makes program design much more complex, because when multiple things can happen at the same time, the program logic has to be correct regardless of which combination of things occurs.

Today's exercise is to take two time-consuming operations and "parallelize" them, so that they run faster on a multicore CPU. The operations we will parallelize today are simple ones: Matrix addition, and Matrix tranpose().

Matrix Addition

The Matrix addition operation is fairly simple. A statement:

   matrix3 = matrix1 + matrix2;
sums matrix1 and matrix2, and stores the result into matrix3, as follows:
matrix addition.

The two matrices being summed must be the same dimensions, so the addition operation can be defined using two nested for loops:

Matrix Matrix::operator+(const Matrix& mat2) const {
   if (myRows != mat2.getRows() || myColumns != mat2.getColumns() ) {
      throw invalid_argument("Matrix::operator+: matrix dimensions mismatch");
   }
   Matrix result(myRows, myColumns);
   for (unsigned row = 0; row < myRows; row++) {
      for (unsigned col = 0; col < myColumns; col++) {
         result[row][col] = (*this)[row][col] + mat2[row][col] ;
      }
   }
   return result;
}
In this method, the receiver of the + message acts as matrix1, the argument mat2 acts as matrix2, and the return value result will be assigned to matrix3.

Using these nested for loops, the effect is that the values in each Matrix are accessed row by row. We might visualize this for matrix1 as follows:

matrix addition access pattern.

If our Matrix class uses a single dynamically allocated array for its storage, then the rows of the Matrix are stored in consecutive memory locations, so this access pattern amounts to sequentially stepping through memory:

sequentially accessing adjacent memory locations.

Note that the addition operation accesses the memory of all three matrices (matrix1, matrix2, matrix3) using this same sequential pattern. Many computers anticipate this kind of access and prefetch values from memory before they are actually needed. Operations that access memory sequentially then run faster than if each value is only fetched when it is actually needed.

Matrix Transpose

The Matrix transpose operation is also fairly simple. A statement:
   matrix2 = matrix1.transpose();
makes matrix2 contain the values of matrix1, but each row of matrix1 is a column in matrix2. The operation does not change matrix1. We can visualize this as follows:
matrix transpose
The transpose operation can also be defined using two nested for loops. The key idea is that we access the values in matrix1 row-by-row, but write the values into matrix2 column-by-column:
Matrix Matrix::transpose() const {
   Matrix result(myColumns, myRows);
   for (unsigned row = 0; row < myRows; row++) {
      for (unsigned col = 0; col < myColumns; col++) {
         result[col][row] = (*this)[row][col];
      }
   }
   return result;
}
This leads to different memory access patterns for the two matrices during the transpose operation:
transpose access patterns.

That is, the memory access pattern in matrix1 is the same sequential pattern we saw before, but the access pattern in matrix2 is non-sequential, "jumping around" in memory:

transpose does some non-sequential memory accesses.

Keep this in mind, as we work through today's exercise.

OpenMP

To parallelize these operations, we will use OpenMP (for Open MultiProcessing), a library which is supported by most modern C/C++ compilers, including the GNU suite (gcc/g++), MS-Visual Studio, and Intel. To see how this works, suppose that we have a normal C/C++ program, and the work it does (i.e., the sequence of statements it performs) is represented by a blue line. A traditional program might look like this, as it executes from start to finish:

A traditional computation has 1 thread.

If a part of this program can be done in parallel, a programmer can embed an OpenMP #pragma directive into his or her program to form a parallel region. When the program work-flow reaches one of these parallel regions, the single program flow divides into multiple threads, normally one per core. So on a quad-core CPU, we might get this:

OpenMP divides a computation among multiple threads.

In this diagram, the blue line represents the original "parent" or "master" thread, and the green lines represent the new "child" threads that OpenMP forms when the program flow reaches the parallel region. On a multicore computer, each thread can run in parallel on a different core, so the total amount of work done during the computation remains the same, but doing pieces of the computation in parallel lets the program finish faster. We will be using this approach to "parallelize" the Matrix addition and transpose operations, so that they run faster.

To measure how much faster we can get these operations to run, an important part of today's exercise is timing these Matrix operations. The more applications you have running, the more competition your program will have for memory and the CPU, and the less accurate your timing results will be, so close down all non-essential applications (e.g., e-mail, chat clients, etc.) before proceeding.

Getting Started

A tarball is the common Unix/Linux term for a folder of files that has been archived using the tar command, and then compressed. Because we have several files for today's exercise, we have created a tarball for you to download, so that you don't have to download and save each file individually.

Start by downloading and saving lab10.tar.gz. (The .tar.gz extension indicates this tarball was tarred and then compressed with the GNU ZIP-compression utility gzip.) Save it somewhere convenient, like your desktop.

To open the tarball, launch a terminal, and use the ls and cd commands to navigate to the directory/folder where you saved the tarball. Be sure to use the ls command to verify that the folder you are in contains the tarball.

Once you are in right place, type:

   tar xvzf lab10.tar.gz
The x switch tells tar to extract the tarred folder, the v tells it to do so verbosely (i.e., with lots of output), the z tells it to decompress (unzip) the tarball, and the f tells it that the very next thing is the tarball file. Thanks to the v switch, you should see:
lab10/
lab10/MatrixTester.h
lab10/Matrix.h
lab10/MatrixTimer.h
lab10/Makefile
lab10/tester.cpp
lab10/MatrixTester.cpp
lab10/runTimeTrials.cpp
lab10/testFiles/
lab10/testFiles/mat2.txt
lab10/testFiles/mat6.txt
lab10/testFiles/2039x2039.txt
lab10/testFiles/mat3.txt
lab10/testFiles/mat5.txt
lab10/testFiles/4096x4096.txt
lab10/testFiles/mat1.txt
lab10/testFiles/mat4.txt
lab10/Matrix.cpp
If you enter the ls command, you should see both the tarball file and the lab10 folder listed. If you enter:
   ls lab10
you should see these files listed:
Makefile
Matrix.h
Matrix.cpp
MatrixTester.h
MatrixTester.cpp
MatrixTimer.h
runTimeTrials.cpp
tester.cpp
plus a folder named testFiles that contains some test matrices.

Next, launch eclipse, as usual. However, when you use File > New > C++ Project to create a new project for today's exercise, specify Makefile project > Empty Project in the C++ Project dialog box. Once you have created the new project as a Makefile project, import all the files you extracted from the tarball into your new project. Note that to import the test files, you will need to first use New > Folder to create a folder named testFiles in your project, and then import the test files into that folder.

Then take a few minutes to open the files, and look over their contents, to see what they do and how they do it. Once you understand what purpose each file serves, continue.

Building

Unlike past exercises, we are going to use a Makefile to build the files for today's exercise, to simplify the task of building our multicore application. If you want, open up the Makefile and look over its contents. After a series of variable declarations, you should see a series of groups of lines, each with the form:
targetFile: dependencyFiles
	commandsToBuildTargetFile
The make program reads this file and uses the relationships and commands on these lines to compile and link the files in the project.

To build your project, open a terminal, cd to your project folder, and type make on the command-line.) Eclipse will execute

   make all
in your project folder. This will invoke the make program, tell it to build the target all, and it will look for a file named Makefile telling it how to do so. Since our Makefile specifies that the target all depends on the files tester and runTimeTrials, and these files do not exist, make will look for line-groups in the Makefile that tell it how to make each of them. If you compare the commands that are executed to build your project with those in the Makefile, you can trace the recursive logic that make uses as it compiles and links your files. When it is finished, you should see several files appear in your Project Explorer, including runTimeTrials and tester. You should also see several files whose names end in .o, which you can ignore. (You may need to right-click your project folder and choose Refresh to see these.)

It is worth mentioning that in a standard (i.e., non-Makefile) Eclipse project, there is still a Makefile behind the scenes coordinating the compilation and linking. The difference is that in a standard project, Eclipse auto-generates its own Makefile (it is stored in the project's Debug folder), while in a Makefile-project, the programmer must supply the Makefile.

Running The Tests

Let's start by running tester. To run it, open a terminal window, use the cd command to navigate to your project folder, and enter:
   ./tester
The tests should begin running, and their output should appear. (The operator+ test takes a while, so be patient.) When you have verified that the Matrix operations all pass their tests, you may continue.

Timing the Matrix Operations

The first part of today's exercise is to measure how long the Matrix addition and transpose operations take. This will provide a base-line against which we can compare subsequent times. There is a lot of variance in Unix-system clocks, so when we time an operation, we want to run it several times and compute the average of those times. The MatrixTimer class found in MatrixTimer.h lets us specify the number of times to run each command as an argument to its constructor. The program runTimeTrials.cpp uses 10 as the number of times, which is enough trials to provide a useful average.

If you examine the timing methods in this class, you will find that they use an OpenMP function:

To run the runTimeTrials program, use the same approach we used to run tester.

You should see an indicator that the matrices are loading. When this is completed, the MatrixTimer displays the operation being timed (addition or transpose), a "progress bar" of 10 asterisks as it performs its 10 time-trials, followed by the average of the times required to perform the operation.

For more information, the program also creates a text file (e.g., trials-10-001.txt) in which, for each operation, it logs the actual timing results of each time trial, and the average of those trials. Each time you run your program, take a moment to look over the actual time-trial times for that run. See if you can find any interesting patterns from execution to execution...

Recording The Results

To keep track of our times, launch a spreadsheet application. (A Google-Doc spreadsheet should work fine; if you are using OpenOffice, be sure to use version 3.0 or later.) On your spreadsheet, record
  1. the average time for addition, and
  2. the average time for transpose.
We will be adding more values to the spreadsheet, so be sure you label each row and column, to clearly indicate what it represents.

Experiment 1: Using omp parallel for

There is a rule of thumb in computer science called The 90-10 Rule, which states 90% of the time is spent in 10% of the code. If you think about what happened when you ran tester, you can get a sense of this -- the Matrix addition operation is a pretty small percentage of the overall code, but the program spent a lot of time there because it is loading and summing two fairly large matrices.

Open up Matrix.cpp and find the definition of operator+. As you can see, this method uses two nested for loops. In Matrix addition, each repetition of the outer loop accesses different matrix elements from the other repetitions. This means that the outer loop's repetitions are independent -- they can be parallelized across different threads without the threads interfering with one another.

For situations like this -- where a for loop's iterations are independent of one another -- OpenMP provides a special directive:

   #pragma omp parallel for OptionalClauses
Inserting this directive before a for loop will cause OpenMP to distribute the iterations of the for loop across different threads. For example, on a dual core CPU, OpenMP will spawn two threads, and each thread will perform half of the iterations in parallel with the other thread. Note that the loop's iterations must be independent of one another for this to work properly -- if any iteration j of a for loop somehow depends on a different iteration i, then this directive cannot be used to parallelize that loop.

Fortunately, the iterations of the loops in the Matrix addition operation are independent of one another, so we can use this directive to speed up the operation. To illustrate, we can add it to our method as follows:

Matrix Matrix::operator+(const Matrix& mat2) const {
   if (myRows != mat2.getRows() || myColumns != mat2.getColumns() ) {
      throw invalid_argument("Matrix::operator+: matrix dimensions mismatch");
   }
   Matrix result(myRows, myColumns);
   #pragma omp parallel for
   for (unsigned row = 0; row < myRows; row++) {
      for (unsigned col = 0; col < myColumns; col++) {
         result[row][col] = (*this)[row][col] + mat2[row][col] ;
      }
   }
   return result;
}
Inserting this directive will cause different threads to perform different iterations of the outer for loop, so that different rows of the Matrix will be summed by different threads. By default, the number of threads is the number of cores on the computer, so on a dual-core computer, 2 threads will be used; on a quad-core computer, 4 threads will be used; and so on.

Use the same approach to "parallelize" the transpose operation. With these directives in place, rebuild your program. If you see two warnings:

   warning: iteration variable 'rows' is unsigned
you may safely ignore them -- they are a bug in OpenMP -- but you may not ignore any other warnings!

Then rerun runTimeTrials. Is the average time for addition very different? What about the average time for transpose()?

Add these times to your spreadsheet. Congratulations! You've just created and run your first parallel C++ program!

Experiment 2: Using omp parallel for schedule(dynamic)

The OpenMP parallel for directive has various clauses that can be used with it, so let's try one of them. Modify your addition method so that it looks like this:

Matrix Matrix::operator+(const Matrix& mat2) const {
   if (myRows != mat2.getRows() || myColumns != mat2.getColumns() ) {
      throw invalid_argument("Matrix::operator+: matrix dimensions mismatch");
   }
   Matrix result(myRows, myColumns);
   #pragma omp parallel for schedule(dynamic)
   for (unsigned row = 0; row < myRows; row++) {
      for (unsigned col = 0; col < myColumns; col++) {
         result[row][col] = (*this)[row][col] + mat2[row][col] ;
      }
   }
   return result;
}
By default, the omp parallel for directive divides up the repetitions of the outer for loop evenly across the threads. This means that if one thread should finish its work while another still has work to do, the thread that finishes early doesn't "help out" the other thread, but sits idle, waiting for the other thread to finish. The schedule(dynamic) clause changes that: if a thread finishes its work and there are still repetitions to be performed, the thread that's finished helps out and does some of those repetitions. When different iterations take differing lengths of time, this can balance the load or amount of work each thread performs. Such load balancing can make some computations run faster.

Make the same change to the transpose operation; then rebuild and rerun the program, and record your times. Do you see any improvement in the addition method? What about the transpose method? How do they compare to your original (sequential) times?

The degree to which this kind of parallelization improves performance depends on lots of different things, including memory access patterns. Recall how a Matrix is laid out in memory. Then recall how the addition operation accesses that memory, and how the transpose operation accesses that memory. How are they different? Can that help explain the results you are observing?

Experiment 3: Using omp parallel for schedule(dynamic,2)

The schedule(dynamic) clause has an optional second parameter that tells it the minimal number of loop repetitions that a thread should be assigned. By default, a thread gets one repetition, so let's try giving it two repetitions, to see if we get any improvement:

Matrix Matrix::operator+(const Matrix& mat2) const {
   if (myRows != mat2.getRows() || myColumns != mat2.getColumns() ) {
      throw invalid_argument("Matrix::operator+: matrix dimensions mismatch");
   }
   Matrix result(myRows, myColumns);
   #pragma omp parallel for schedule(dynamic, 2)
   for (unsigned row = 0; row < myRows; row++) {
      for (unsigned col = 0; col < myColumns; col++) {
         result[row][col] = (*this)[row][col] + mat2[row][col] ;
      }
   }
   return result;
}

Make the same change to the transpose operation; then rebuild and rerun the program, and record your times. Do you see any improvement in the addition method? What about transpose()? How do they compare to your original (sequential) times?

Experiment 4: Using the omp parallel block

Our next-to-last experiment is to try replacing the omp parallel for directive with an omp parallel block directive:


   #pragma omp parallel OptionalClauses
   {
      // statements to be parallelized
   }
Unlike the omp parallel for, which can only be applied to a for loop, the omp parallel block directive is more general, more flexible, and more broadly useful.

When execution reaches this directive, OpenMP (by default) spawns new threads so that there is a separate thread for each core in the CPU. Each thread then performs the statements inside the block in parallel:

OpenMP divides a computation among multiple threads.

The master thread is numbered 0, the first child thread is numbered 1, the second child thread is numbered 2, and so on. OpenMP provides functions that you can use inside the block to retrieve useful information about the threads:

Note that these operations only work this way inside the parallel block.

The OpenMP parallel block and these functions make it possible to build operations that specify the exact parallel behavior each thread is to perform. To illustrate, we can modify our addition method so that it has the following form:

Matrix Matrix::operator+(const Matrix& mat2) const {
   if (myRows != mat2.getRows() || myColumns != mat2.getColumns() ) {
      throw invalid_argument("Matrix::operator+: matrix dimensions mismatch");
   }
   Matrix result(myRows, myColumns);
   #pragma omp parallel
   {
      unsigned threadID = omp_get_thread_num();
      unsigned numThreads = omp_get_num_threads();
      for (unsigned row = threadID; row < myRows; row += numThreads) {
         for (unsigned col = 0; col < myColumns; col++) {
            result[row][col] = (*this)[row][col] + mat2[row][col];
         }
      }
   }
   return result;
}
Don't forget that closing-brace after the for loop!

When the threads reach this statement

      unsigned threadID = omp_get_thread_num();
the omp_get_thread_num() function returns 0 in the master thread, 1 in the first child thread, and so on -- each thread gets its own id number.

In the next statement:

      unsigned numThreads = omp_get_num_threads();
the omp_get_num_threads() function returns the number of threads in the parallel block: 2 on a dual-core CPU, 4 on a quad-core CPU, and so on.

Now, look carefully at the first line of the for loop.

    for (unsigned row = threadID; row < myRows; row += numThreads) {
Since the loop-variable row controls the number of the row the inner loop is processing, the master thread will process row 0, the first child thread will process row 1, and so on.

Moreover, when a thread reaches the end of its row, it adds to row the number of threads in the block:

The effect is thus to explicitly spread the processing of the Matrix's rows across the threads.

Discuss this question with your neighbor: Will this modified method still work correctly on a single-core CPU (i.e., if there is just a single thread)? Why or why not? Together, trace the execution of the for loop to see whether or not it still works correctly.

Since the number of cores available varies from computer to computer, it is important for parallel operations to scale -- to run correctly without modification -- regardless of whether there are 1, 2, 4, 8, ..., or N cores available. We don't want to have to change our program every time we move our program to a computer with a different number of cores!

Make the same change to the transpose operation; Rebuild and rerun the program, and record your times. How does this compare with your previous results for addition? How does this compare with your previous results for transpose?

Controlling the Number of Threads

OpenMP provides several ways to control the number of threads in a parallel block instead of using the default values. One way is the command:

   omp_set_num_threads(N);
which tells OpenMP to use N threads in parallel blocks. To illustrate, we might modify runTimeTrials.cpp as follows:
int main() {
	unsigned numTrials = 10;	      // number of trials to average
	string logFileName = buildLogFileName(numTrials);
	omp_set_num_threads(1);
	MatrixTimer matrixTimer(numTrials, logFileName);
	matrixTimer.run();
}
Rebuild your program, run it, and record your results. How do they compare to your previous results?

Next, change the command so that OpenMP uses two threads in its parallel blocks. Rebuild your program, run it, and record your results. How do they compare to your previous results?

Next, change the command so that OpenMP uses four threads in its parallel blocks. Rebuild your program, run it, and record your results. How do they compare to your previous results?

Change the command so that OpenMP uses eight threads in its parallel blocks. Rebuild your program, run it, and record your results. How do they compare to your previous results?

Finally, change the command so that OpenMP uses sixteen threads in its parallel blocks. Rebuild your program, run it, and record your results. How do they compare to your previous results?

Visualizing Your Results

Using your spreadhsheet:

  1. Create a chart for each experiment showing two series -- one for addition, and one for transpose -- comparing the average times of your sequential versions to the average times of your parallelized versions. For each of experiments 1-3, create a bar chart; for experiment 4, create a line chart that plots the execution times on the Y-axis against the number of threads used (i.e., 1, 2, 4, 8, 16) on the X-axis.
  2. Create two bar charts -- one for addition, and one for transpose -- in which you compare the times of the different approaches that we used to parallelize these operations (i.e., experiments 1, 2, 3, and 4), using all of the cores on your computer (2 on a dual-core machine, 4 on a quad-core machine, etc).
  3. In your spreadsheet, answer the following questions:
  4. Compare your charts with those of at least two classmates. Are your results consistent with, or different from theirs?

    If you find that your results for an experiment are consistent with those of your classmates, then your results are more likely to be significant and reliable. If you and your classmates get different results for an experiment, then those results are less likely to be significant and more likely to reflect imprecision in the clock/timing mechanism.

Turn In

Make sure your spreadsheet is saved in your project folder, in an Excel-compatible format (.xls). I recommend the one for 97/2000/XP, as it seems to work well.

Then copy your project to a new /home/cs/112/current/yourUserName/lab10 folder.

If you have time, you may wish to begin work on this week's project.


CS > 112 > Labs > 10
This page maintained by Joel Adams.