HPC Project 4: Combining Patterns to Solve a Simple Problem

Overview

In this exercise, we want to use parallelism to speedup the solution to a simple problem: computing the sum of the squares of the values stored in an input file. The basic idea is to parallelize this summation by distributing it across N different PEs. To make the problem more interesting, we'll be using the large files you used in the lab part of this exercise.

Getting Started

The file squareAndSum.c contains a sequential program to sum the values in an array, and Makefile contains the commands needed to build it via make.

The files in the directory /home/cs/374/exercises/04 can be used to test this program, either by using their absolute pathnames, or by creating symbolic links to them in your project directory. Please access the files directly from there -- don't copy them, to avoid wasting space.

Add the necessary calls to squareAndSum.c to separately time:

the time required to open the file, allocate the array, and read the file's values into the array,
the time required to compute the sum of the squares of the values in the array, and
the total time (i.e., the sum of the previous two times).

Modify the printf() statement at the end so that it reports these times.

Then compile the file, and when it is error-free, execute it using each of the .txt files in /home/cs/374/exercises/04. In a spreadsheet, record the sums and times you get for each file.

Part I: Sequential Binary

As we saw in the lab exercise, reading from text files is very slow, so the first thing you should do to speed up the program is convert it to use binary files instead of text files.

Make a copy of squareAndSum.c and, using what we learned in the lab, modify it so that it reads the values from a binary file instead of a text file.

Compile your program and when it is error-free, execute it using each of the .bin files in /home/cs/374/exercises/04. You should get approximately the same sums for the .bin files as you got for the .txt files. Note that you are unlikely to get the exact same sums because the text representation of a number may not be quite the same as its binary representation.

In your spreadsheet, record the sums and times you get for each file.

Part II: Parallel Binary

To improve the performance of squareAndSum.c further, we can incorporate parallelism into it. To do so, you have two options:

One option is to have the master process open and read the values from the file into the array, and then scatter the array's values to the workers. Each process then each computes the sum of the squares of its fragment of the array, and then use a reduction to combine those partial sums into an overall sum. Note that since you cannot assume the number of values is evenly divisible by P, the number of processes, you will need to use MPI_Scatterv() if you choose this option.
The other option for incorporating parallelism into squareAndSum.c is to convert it to C++ and use the OO_MPI_IO class templates to replace the input and scatter steps. Using a ParallelReader, each process can read its chunk of the input file, use a loop to square and sum the values in its chunk, and then use a reduction to combine these partial sums into a total sum.

Make a copy of your binary version of squareAndSum.c and modify it to use one of these approaches.

If you choose Option A, add additional MPI_Wtime() calls to distinguish the input time from the scatter time, so that we can tease out how much each of these steps is contributing to the overall time.

In addition the I/O, you should parallelize the function that computes the sum of the squares in the array.

Compile your program and when it is error-free, execute it three times using each of the .bin files in /home/cs/374/exercises/04. You should get approximately the same sums for the .bin files as you got previously.

Note that when performing operations like multiplying and adding real numbers, round-off error occur, and the particular round-off errors depend on the order in which the multiplies and adds occur. Since the exact order in which the adds and multiplies occur within a parallel computation is non-deterministic, one may not get exactly the same results in different executions of the parallel computation that uses real numbers.

When it works correctly, record the sums and times you get for each file in your spreadsheet using 1, 2, 4, 6, and 8 processes. For each number of processes, use the spreadsheet to compute the average of the three times, the parallel speedup, and the parallel efficiency.

Part III: Visualizing the Results

For each of our four input file sizes: Create two stacked bar-charts with time on the Y-axis:

one comparing the average total times of the two sequential approaches, and
one showing the average total times of the parallel approach for an X-axis of 1, 2, 4, 6, and 8 processes.

In each bar, the time required for the input step (or the input+scatter steps) should be at the bottom of the stacked bar, and the square+sum (or square+sum+reduce) time on top of it; the height of the bar should reflect the total time.

Then create two line-charts:

one showing the parallel speedup for each file-size, and
one showing the computational efficiency for each file size,

for your parallel program using 2, 4, 6, and 8 processes on the X-axis. Each file-size should be a separate line/series on these charts.

This should produce a total of 10 charts that you can use to explore how the sequential and parallel versions of the program behave as the sizes of the input files change. Be sure to use descriptive titles and labels for your charts and their axes, and use bars or lines on your charts that can be easily distinguished from one another when the chart is printed on a gray-scale printer.

To give you sufficient time for all of this, this project is due the Wed after Thanksgiving break, giving you two calendar weeks to complete it. I strongly recommend that you complete at least Parts I and II this coming week, so that you have the second week to create your visualizations, and perform your analysis.

Hand In

Hard copies of:

A script-program recording in which you use cat to list the source code for the different versions of the program, and show they compile and run correctly using the files containing ten million values. Use 6 processes for the parallel programs.
Your 10 charts and the spreadsheet data from which you created them. Make each chart a full page in size to maximize our ability to see its details.
A 1-2 page analysis of your results. Explain what the timing measurements you've observed reveal. Compare and contrast the performance of the different versions of the program. Where are the different versions losing and/or gaining time? For example, in Part II, was the read, the scatter, or the square+sum+reduce the most time-consuming step? Did the file size make any difference? Do any of the programs's behaviors reflect Amdahl's or Gustafson's laws? As always, make this a quantitative analysis, citing specific data points from your data to support any claims.

Please staple these pages together and make certain that your name is on each page.

CS > 374 > Exercises > 04 > Homework Project

This page maintained by Joel Adams.