HPC MPI Exercise 1: Hands-On Lab


Getting Started

Begin by making a new directory for this course, containing a subdirectory for this exercise, and save a copy of the C program spmd.c in that subdirectory. This program is a parallel "hello world"-style program, using the Message Passing Interface (MPI). It demonstrates the single program multiple date (SPMD) pattern that is commonly used in parallel programming. We will use it to get started compiling and running MPI programs.

Compiling an MPI Program

To compile the program spmd.c, use the command:

   mpicc spmd.c -Wall -o spmd
mpicc compiles a C program in the usual way (i.e., using gcc), except that it correctly processes and links in the MPI commands the program uses. (These would generate errors if a regular C compiler were used.) The -o switch stores the resulting binary program in a file named spmd.

(If you get an error message like:

   bash: mpicc: file not found 
then you will need to add MPI's bin directory to your PATH variable. For example, if MPI's bin directory were in /opt/openmpi/bin/, you would add:
    PATH=/opt/openmpi/bin/:$PATH
    export PATH 
to your .bash_profile file in your home directory.)

Running an MPI Program

To run an MPI program in parallel on multiple workstations, we will use the mpirun command. In the Unix lab, we will invoke the mpirun command using two switches:

  mpirun -np numProcesses -machinefile hostNameFile binaryFileName 
will run numProcesses instances of the program in binaryFileName, launching remote processes on the workstations listed in hostNameFile. For example, to run the program spmd using 4 processes on the workstations in hosts, we would enter:
   mpirun -np 4 -machinefile hosts ./spmd

The mpirun command will start up processes on remote machines in the order they are listed in this file, so you may want to ensure that the machine on which you are working occurs first in this file.

Run the program with different arguments for this switch (e.g., 1, 2, 4, 6, 8, 10, 12) and compare its behavior against its source code, until you understand how it is generating the behavior you observe.

Question (discuss with your neighbor): SPMD stands for Single Program Multiple Data; spmd.c is clearly a single program; what multiple data items do its processes have?

The Parallel Loop Pattern

When you are confident that you understand how spmd.c is working, you are ready to proceed to our second common parallel design pattern: the parallel loop pattern, which divides the iterations of a loop among a computation's processes.

A normal for loop iterates over a range of values. When these values are consecutive integers and the iterations are independent of one another:

   for (i = first; i <= last; i++) {
      // do something independent of other iterations
   }
then it is fairly easy to divide those iterations across multiple processes. There are different ways to do this, but two of the most common are:
  1. The simpler 'slices' approach (aka the 'stripes' approach), in which the loop's iterations are divided into fine-grained slices or stripes; and
  2. The more complex 'Chunks' approach (aka the 'blocks' approach), in which the loop's iterations are divided into coarse-grained chunks or blocks.

1. The 'Slices' Approach

The behavior of the 'slices' version of a parallel loop might be described as follows:

   for (i = processID; i <= last; i += numberOfProcesses) {
      // do something independent of other iterations
   }
This pattern spreads the loop's iterations across a computation's processes.

To explore and experiment with it, create a new folder and download the program parallelLoopSlices. Using the same approach you used for spmd, compile and run parallelLoopSlices using different numbers of processes (e.g., 1, 2, 3, 4, 5, 6, 7, 8).

You will be using this 'slices' version of the parallel loop pattern in this week's homework project.

Question (discuss with your neighbor): If there are P processes and N iterations (P < N), then in the 'slices' version of the Parallel Loop, which iterations does process 0 do? Process 1? Process 2? Process P-1?

2. The 'Chunks' Approach

If the number of iterations is evenly divisible by the number of processes, then the 'chunks' version of the parallel loop can be implemented as follows:

   chunkSize = ITERATIONS / numberOfProcesses;
   start = processID * chunkSize;
   stop = start + chunkSize.
   for (i = start; i <= stop; i++) {
      // do something independent of other iterations
   }
If the number of iterations is not evenly divisible by the number of processes, then implementing this pattern is slightly more complicated, as we shall see.

Like the 'slices' approach, the 'chunks' approach spreads the iterations across a computation's processes, but each of the P processes performs a 'chunk' of consecutive iterations.

To explore and experiment with this version of the pattern, create a new folder and download the program parallelLoopChunks. Using the same approach you used previously, compile and run parallelLoopChunks using different numbers of processes (e.g., 1, 2, 3, 4, 5, 6, 7, 8). Compare its output against that of the 'slices' version and take careful note of how the two versions of the Parallel Loop pattern differ.

Question (discuss with your neighbor): If there are P processes and N iterations (P < N), then in the 'chunks' version of the Parallel Loop, which iterations does process 0 do? Process 1? Process 2? Process P-1?

Digging Deeper

If you want to visualize how the two different versions differ, you may find visualParallelLoopSlices and visualParallelLoopChunks to be of interest.

These programs use a graphical library called MPE, which requires some specific compiler switches, so you should save each to its own folder and compile it using the provided Makefile.

Each program creates a graphical window, assigns each process a color, and then uses the parallel loop pattern to iterate through the rows of the window, coloring the pixels with the color of the current thread. Since the rows are being spread across multiple processes and each process has its own color, the result is a visual representation of how that version of the Parallel Loop pattern divides up the loop iterations. Try them using 1, 2, 4, 8, 16, 32, 64, and 128 processes and see how the window changes.

We have (so far) been unsuccessful in getting MPE graphics to work across the network, so when launching this program, run all of the processes on your local workstation (i.e., don't use the -machinefile hosts switch).

Question (discuss with your neighbor): The 'chunks' version of the Parallel Loop pattern takes more work to implement than the 'slices' version. Can you think of circumstances in which the 'chunks' version will perform significantly faster than the 'slices' version? Can you think of circumstances in which the 'slices' version will perform significantly faster than the 'chunks' version?

Note that both versions of the Parallel Loop pattern divide a loops' iterations among the computation's processes. Given P processes and N iterations, some people find it helpful to think of the 'chunks' version as dividing the iterations into P blocks that are as equal in size as possible, and the 'slices' version as dividing the iterations into N blocks (i.e., blocks whose sizes are 1). Put differently, the 'chunks' version divides the iterations into the P largest possible blocks, and the 'slices' version divides the iterations into the smallest possible blocks.

When you are confident you understand how these two versions of the Parallel Loop pattern behave, feel free to proceed to this week's homework project.


CS > 374 > Exercise > 01 > Hands-On Lab


This page maintained by Joel Adams.