HPC MPI Exercise 1: Hands-On Lab

Getting Started

Within your course directory, make a new subdirectory for this exercise, and save a copy of the C program spmd.c in that subdirectory. This program is a parallel "hello world"-style program, using the Message Passing Interface (MPI). It demonstrates the single program multiple date (SPMD) pattern that is commonly used in parallel programming. We will use it to get started compiling and running MPI programs, and to verify that MPI is working correctly for you.

Compiling an MPI Program

To compile the program spmd.c, use the command:

   mpicc spmd.c -Wall -o spmd

mpicc compiles a C program in the usual way (i.e., using gcc), except that it correctly processes and links to the MPI commands the program uses. (These would generate errors if a regular C compiler were used.) The -Wall switch tells the compiler to generate all warnings; the -o spmd switch stores the resulting binary program in a file named spmd.

(This should not occur in the lab, but if you happen to get an error message like:

   bash: mpicc: file not found

then you will need to add MPI's bin directory to your PATH variable. For example, if MPI's bin directory were in /opt/openmpi/bin/, you would add:

    PATH=/opt/openmpi/bin/:$PATH
    export PATH

to your .bash_profile file in your home directory.)

Running an MPI Program

To run an MPI program in parallel on multiple workstations, we will use the mpirun command. In a Calvin CS lab, we will invoke the mpirun command using two switches, the command:

  mpirun -np numProcesses -machinefile hostNameFile binaryFileName

will run numProcesses instances of the program in binaryFileName, launching remote processes on the workstations listed in hostNameFile. For example, to run the program spmd using 4 processes on the workstations in the file hosts, we would enter:

   mpirun -np 4 -machinefile hosts ./spmd

The mpirun command will start up processes on remote machines in the order they are listed in your hosts file.

Run the program several times, giving different arguments to the -np switch (e.g., 1, 2, 4, 6, 8, 10, 12) and compare its behavior against the source code in spmd.c. Discuss what is happening with your neighbor until you understand how spmd.c is generating the behavior you observe.

Question (discuss with your neighbor): SPMD stands for Single Program Multiple Data; spmd.c is clearly a single program; what multiple data items do its processes have?

Before you continue, try running spmd this way:

   mpirun -np 4 ./spmd

Question (discuss with your neighbor): What happens when you run mpirun without the -machinefile hosts switch?

The Parallel Loop Pattern

When you are confident that you understand how spmd.c is working, you are ready to proceed to our second common parallel design pattern: the parallel loop pattern, which divides the iterations of a loop among a computation's processes.

A normal for loop iterates over a range of values. When these values are consecutive integers and the iterations are independent of one another:

   for (i = first; i < last; ++i) {
      // do something independent of other iterations
   }

then it is fairly easy to divide those iterations across multiple processes. There are different ways to do this, but two of the most common are:

The simpler 'Slices' approach (aka the 'Stripes' approach), in which the loop's iterations are divided into fine-grained slices or stripes; and
The more complex 'Chunks' approach (aka the 'Blocks' approach), in which the loop's iterations are divided into coarse-grained chunks or blocks.

1. The 'Slices' Parallel Loop

The behavior of the 'slices' version of a parallel loop might be described as follows:

   for (i = processID; i <= last; i += numberOfProcesses) {
      // do something independent of other iterations
   }

This pattern spreads the loop's iterations across a computation's processes.

To explore and experiment with it, create a new slices folder, cd to that folder, and download the files parallelLoopSlices.c and Makefile from this folder to your slices folder. Using an approach similar to what you used for spmd, compile and run parallelLoopSlices using different numbers of processes (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9).

Question (discuss with your neighbor): If there are P processes and N iterations (P < N), then in the 'slices' version of the Parallel Loop, which iterations does process 0 do? Process 1? Process 2? Process P-1?

Makefiles. Before proceeding, use the cat or less command to view the contents of the file named Makefile. The program make is a Unix-family utility for building executable programs from their source files. When a source program (like parallelLoopSlices.c) is distributed with a Makefile, all that you need to do to build the binary executable program is enter:

   make

and the commands in the Makefile will build the program.

Try entering make now. What happens?

Using any text editor (e.g., vi, emacs, nano, ...) make a trivial change to parallelLoopSlices.c (e.g., add your name and the date to the opening comment), save the file, and quit the editor. Then re-enter:

   make

What happens now? Is it different from what happened before?

Then re-enter:

   make

What happens now? Is it different from what happened before?

Question (discuss with your neighbor): What determines whether or not make re-compiles a source program?

If you are interested in learning how a Makefile works, let me know and we can schedule a special session sometime.

2. The 'Chunks' Parallel Loop

If the number of iterations is evenly divisible by the number of processes, then a simple version of the 'chunks' version of the parallel loop can be implemented as follows:

   chunkSize = ITERATIONS / numberOfProcesses;
   start = processID * chunkSize;
   stop = start + chunkSize.
   for (i = start; i < stop; ++i) {
      // do something independent of other iterations
   }

This simple version assumes that the number of iterations is always evenly divisible by the number of processes. Since that is not generally true, implementing a non-simplified version of this pattern is a bit more complicated, as we shall see.

Like the 'slices' approach, the 'chunks' approach spreads the iterations across a computation's processes, but each of the P processes performs a 'chunk' of consecutive iterations, and each 'chunk' is approximately the same size.

To explore and experiment with this version of the pattern, create a new folder and download the files parallelLoopChunks.c, parallelLoopChunks.h, and Makefile from this folder. Then take a few minutes to look over parallelLoopChunks.c and parallelLoopChunks.h. Proceed when you think you understand how the two files are working together.

Using the approach you used previously, compile and run parallelLoopChunks using different numbers of processes (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9). Compare its output against that of the 'slices' version.

You will be using both versions of the parallel loop pattern in this week's homework project, so take careful note of how the two versions of the Parallel Loop pattern differ.

Question (discuss with your neighbor): If there are P processes and N iterations (P < N), then in the 'chunks' version of the Parallel Loop, which iterations does process 0 do? Process 1? Process 2? Process P-1?

Digging Deeper

If you want to visualize how the two different versions of the Parallel Loop pattern differ, you may find visualParallelLoopSlices and visualParallelLoopChunks to be of interest.

These programs use a graphical library called MPE, which requires some specific compiler switches, so you should save each to its own folder and compile it using the provided Makefile.

Each program creates a graphical window, assigns each process a unique color, and then uses the Parallel Loop pattern to iterate through the rows of the window, with each process coloring the pixels of the rows it processes using its unique color. Since the rows are being spread across multiple processes and each process has its own color, the result is a visual representation of how that version of the Parallel Loop pattern is dividing up the loop iterations.

Note: MPE graphics used to work across the network, but if a process on workstation A can draw on the screen of workstation B, this is a potential security vulnerability, so it is no longer permitted. Because of this, when launching this program, have mpirun run all of the processes on your local workstation (i.e., don't use the -machinefile hosts switch).

Try running these Parallel Loop programs using 1, 2, 4, 8, 16, 32, 64, and 128 processes and see how the windows differ.

Question (discuss with your neighbor): The 'Chunks' version of the Parallel Loop pattern is significantly more complicated than the 'Slices' version. Can you think of circumstances in which the 'Chunks' version will perform significantly faster than the 'Slices' version? Can you think of circumstances in which the 'Slices' version might perform significantly faster than the 'Chunks' version?

Note that both versions of the Parallel Loop pattern divide a loops' iterations among the computation's processes. Given P processes and N iterations, some people find it helpful to think of the 'Chunks' version as dividing the iterations into P blocks whose sizes are as equal as possible, and the 'Slices' version as dividing the iterations into N blocks whose sizes are 1. Put differently, the 'Chunks' version divides the iterations into the P largest possible blocks, and the 'Slices' version divides the iterations into the smallest possible blocks.

When you are confident you understand how these two versions of the Parallel Loop pattern differ in behavior, feel free to proceed to this week's homework project.

CS > 374 > Exercise > 01 > Hands-On Lab

This page maintained by Joel Adams.