### HPC MPI Exercise 1: Homework Project

The Circuit Satisfiability Problem for a given binary circuit is to find the set of inputs that cause that circuit to produce '1' as its output. For example, here is a 32-bit circuit diagram:

Since the circuit contains AND-gates, OR-gates, and inverters, we can model that circuit in C using the AND (&&), OR (||), and NOT (!) operators. The sequential program circuitSatisfiability.c. does this, and solves the problem by using a for loop to iterate through all possible inputs, outputting those inputs that produce '1'.

#### Homework

This week's assignment has these parts:

1. In the Unix Lab:

Using your favorite text editor, modify circuitSatisfiability.c. so that it uses the MPI_Wtime() function to time the computation:

```     ...
double startTime = 0.0, totalTime = 0.0;
startTime = MPI_Wtime();

for (i = 0; i < UINT_MAX; i++) {
count += checkCircuit(id, i);
}

totalTime = MPI_Wtime() - startTime;

printf("Process %d finished in time %f secs.\n", id, totalTime);
...
```
(Note that you will have to #include <mpi.h> in order for the program to compile.)

Then compile the program as follows:

`     mpicc circuitSatisfiability.c -Wall -ansi -pedantic -std=c99 -o circuitSatisfiability`
Then run the program as follows:
`     ./circuitSatisfiability`
With these modifications, the program will self-report how long it took to check the circuit.

Warning: This program must try 232 values; on my machine, it took ~7 minutes to finish. You may want to bring some other work to do while it runs!

2. Create a Google Docs spreadsheet with column headings Number of Processes and Time. In the first row beneath these headings, enter '1' and the time (in seconds) reported by the program. This will serve as our baseline time against which we'll compare subsequent times. In an adjacent cell of the spreadsheet, record how many solutions this sequential version of the program found.
3. Using parallelLoopSlices.c as a model, use the parallel loop pattern to speed up circuitSatisfiability.c.
4. Use MPI's version of the reduction pattern to sum the distributed processes' count-values into a global count, and have process 0 output this global count. (Make certain it matches the count produced by the sequential version.)
5. When your program compiles correctly and produces the correct results, time its execution with 2, 4, 8, 16, 32, 64, 128, and 256 processes. Enter these numbers and the corresponding times in your spreadsheet.
6. Create a line-chart of your spreadsheet's values, with the number of processes on the X-axis and time on the Y-axis. Label the scale of your X-axis with the number of processes (1, 2, 4, 8, 16, 32, 64, 128, and 256) so that this axis indicates the scale at which measurements were taken.
2. On The Cluster:
1. Transfer your source program to the cluster (dahl.calvin.edu). To do this, you will need to set up key-based authentication on the cluster. Once you have done so, you can use the secure-copy program scp to transfer your file. For example, if my program is in a file named circuitSatisfiability.c, to transfer it to my 374/proj01 directory on Dahl, I would type:
`      scp -P 22122 circuitSatisfiability.c adams@dahl.calvin.edu:~/374/proj01 `
You can use the same approach to transfer your Makefile to the cluster, and can use a similar approach to transfer files from the cluster back to the Ulab.
2. Once you have transferred your program to the cluster, use ssh to login to the cluster. I would type:
`      ssh -p 22122 adams@dahl.calvin.edu `
Once you are logged into the cluster, you can cd to your project directory and use make or mpicc to recompile your program there.

To run your program on the cluster, we do not want your classmates' programs to affect the performance of your program. To keep your programs from conflicting with others, everyone must use a batch queueing system to run programs on the cluster. To run your program, follow these directions for Using Torque on Dahl.

Using these directions, repeat the timing experiments you did in the Unix lab, using 1, 2, 4, 8, 16, 32, 64, 128, and 256 processes. The cluster has 32 nodes, each with 8 cores. To get the best performance, do the 1, 2, 4, and 8 process runs using 1 node; then use 8 processes and double the number of nodes for each subsequent run.

Note that the entire class can work at the same time on the cluster. Since the batch system uses a queue to schedule the submitted jobs, and only runs a job when the resources it requires are available, it will prevent anyone's jobs from interfering with anyone else's jobs. You can even submit multiple jobs at the same time and they will not interfere with one another -- they will just be queued up and scheduled as the resources they require become available.

#### Hand In

Hard copies of:
1. Your line chart and spreadsheet data. Be sure you title your chart, label your axes, and clearly indicate the system each line represents (sufficient to tell them apart when printed in black-and-white or gray scale).
2. A 1-page analysis in which you interpret the data of your chart. Make this analysis as quantitative as you can, referring to specific data points where possible. You should also report any other observations you've made during this exercise.
3. A script file in which you cat your program and show that it compiles without errors or warnings.
4. One of the .o output files generated by Torque on the cluster.
Please staple these pages together and make certain that your name is on each page.

CS > 374 > Exercises > 01 > Homework Project