Within your 374 course folder, create a new folder for this lab exercise (e.g., 4).
The first part of this week's lab exercise is to explore different aspects of file input. The ideas we'll explore also apply to file output, but we will focus on input today.
To avoid wasting storage-space, we will all read from the same set of (large) data files, which are stored in the directory: /home/cs/374/exercises/04/. Verify that you have access to this directory by entering:
ls /home/cs/374/exercises/04Each of the files in this folder contains randomly generated double values; the name of a file indicates the number of doubles in that file.
Note that this directory is located on a file server, which is "across the network" from your workstation. Reading from shared files on a file server is:
A. Sequential Text Input (Default).
Begin by making a new folder named seqTextIn, cd to it, and then download the files seqTextIn.c and Makefile from this folder.
Take a few minutes to look at the program in seqTextIn.c. It is a fairly simple program that reads the contents of a text file containing double values into an array. In order to determine the size of the array, the first line of this text file is an integer N that indicates the number of double values in the file. After opening the file, the program reads this value N, allocates an array of N double elements, and then uses a for loop to read the N values from the file into the array. The program thus assumes the file has a certain structure: N on the first line, followed by N double values.
Identify the statements that open the file, fill the array, and close the file. Surround this group of statements with calls to MPI_Wtime() to time how long it takes the program to perform this group of statements.
Since MPI_Wtime() is an MPI function, you will also need to 'wrap' these calls in calls to MPI_Init() and MPI_Finalize(), and add #include <mpi.h> to the program's include-directives.
Finally, modify the printf() at the end so that it also reports how long it the program took to open+read+close the file.
Having made these changes, use the make command to build the program. Continue when your program builds without errors or warnings.
To run the program, enter:
./seqTextIn /home/cs/374/exercises/04/1m-doubles.txtYour program should run, reporting that it read 1,000,000 double values and the time it took to do so. Note that the time reported is quite short, and short timings can be inaccurate. To compensate, we will perform three trials and use the minimum of these three times.
Open a spreadsheet, write Sequential Text Input, Default in the first row. Beneath that, create the column headings N, Trial 1, Trial 2, Trial 3, and Minimum. Record 1000000 beneath N and the time your program reported beneath Trial 1. Run the program 2 more times, record the times beneath Trial 2 and Trial 3; then use a spreadsheet function to compute and record the minimum of the three trials beneath Minimum.
Then repeat this procedure using the other text files:
You should now have a record of the trial and minimum times to read text files containing one million, ten million, one hundred million, and one billion double values.
To view a long-listing of these files sorted by size, enter the command
ls -lS /home/cs/374/exercises/04The fifth column (to the left of the month) that ls displays is the size of each file in bytes. Add a new column File Size to your spreadsheet and beneath it, record each file's size information.
Enter the command
less /home/cs/374/exercises/04/1m-doubles.txt
The less program lets you use the spacebar to scroll forward in the file, the b key to scroll back, and the q key to quit.
Discuss with your neighbor: Are you able to read the double values this file contains?
When you have determined the answer to that question, type q to quit the less program and continue.
B. Sequential Text Input (Optimized).
If you examine your Makefile, you will see that it contains the line:
CFLAGS = -Wall -ansi -pedantic -std=c99This line defines the compilation flags being used to build the program:
CFLAGS = -O2 -Wall -ansi -pedantic -std=c99
GNU's different levels of optimization are:
If you enter
make
it will tell you that your program is up to date, because we have not changed it since you last built it.
Enter:
touch seqTextIn.c
This will update the 'last modified' date on your source file,
so make will think it is newer than your program file
and rebuild the latter.
Then re-enter
make
and your program should rebuild.
Inspect the compilation command performed by make
to verify that the -O2 switch is being used.
If your find that adding -O2 causes the compiler to generate a warning about not using the return value of fread(), you can disable that warning by editing the Makefile again and adding:
CFLAGS = -O2 -Wall -ansi -pedantic -std=c99 -Wno-unused-result
In your spreadsheet, add a new row Sequential Text Input, -O2 Optimized. Below this, copy-and-paste the same N, Trial 1, Trial 2, Trial 3, Minimum, and File Size column headings you used in Section A. To provide the values below each column-heading, repeat the activities from Section A. For each of the four input files (1m-doubles.txt, 10m-doubles.txt, 100m-doubles.txt, 1b-doubles.txt) run the newly-optimized version of your program three times, record these three trial-times in your spreadsheet, and use it to compute the minimum value for each file-size.
Discuss with your neighbor: How do these times compare with your times from Section A?
Optional: Feel free to experiment with the -O1 and -03 optimization levels to see how they compare with the default and -O2 times.
C. Sequential Binary Input.
Next, let's see how using a binary input file affects performance.
Use cd .. to change to the parent directory, make a new directory there named seqBinIn, cd to that directory, and then download the files seqBinIn.c and Makefile from this folder to that directory.
Open up seqBinIn.c and take a few minutes to compare its contents to those of seqTextIn.c. Identify the statements that open the file, read values from the file into the array, and close the file and take special note how they differ in this program compared to the previous program.
One key difference in this program is that it uses a function getFileSize() that uses POSIX system calls to determine the size of the file in bytes. The program then uses that information to compute N, the number of doubles in the file, and then uses N to allocate the array. This approach works because unlike a double in a text file, each double in a binary file occupies exactly the same number of bytes as a double in main memory.
The other key difference is that this program reads all of the values from the file into the array via a single read. That one read should use the computer's Direct Memory Access (DMA) hardware to transfer all of the values from the file into the array. This ability to fill the array with a single read (plus the use of the binary format) will have a profound effect on the program's performance.
To see this, add calls to MPI_Wtime() to time how long it takes to open+read+close the file, and modify the printf() function to report this time.
Then use the make command to build the program. Continue when your program builds without errors or warnings.
Discuss with your neighbor: What optimization level is the Makefile using to build the program?
To run this program, enter:
./seqBinIn /home/cs/374/exercises/04/1m-doubles.binNote that the file's .bin extension indicates that this is a binary-format file. Similar to seqTextIn, this program will read 1,000,000 doubles from a file, but the numbers are stored in binary-format in this file.
Note that since a program can compute N, the number of items in a binary-format file, and use that value to allocate an array of the necessary size, there is no need to store N at the beginning of the file, the way we did with our text files.
In your spreadsheet, add a new row Sequential Binary Input. Beneath it, add column headings for N, Trial 1, Trial 2, Trial 3, Minimum, and File Size. Copy-paste the N values from a Sequential Text Input area of your spreadsheet and enter the time your program reported for N=1,000,000 under Trial 1. Then run the program twice more, record those times, and compute the minimum of the three trials.
Repeat this process using the other binary files:
Discuss with your neighbor: How do these times compare to your text output times?
Discuss with your neighbor: Which is more time-efficient, text or binary file input?
Re-enter the command:
ls -lS /home/cs/374/exercises/04Back in your spreadsheet, enter the sizes for each of the binary files your program read under the File Sizes column heading.
Discuss with your neighbor: How do these binary file sizes compare to their text file counterparts?
Discuss with your neighbor: Which are more space-efficient, text or binary files?
Lastly, enter:
less /home/cs/374/exercises/04/1m-doubles.bin
Discuss with your neighbor: Are you able to read the values this file contains?
Discuss with your neighbor: Which are more human-friendly, text or binary files?
When you have determined the answers to those questions, type q to quit the less program and continue.
When you have finished all of the preceding steps, congratulations, you are ready for part 2 of this exercise!
Each CS lab workstation has a local solid state device (SSD) and any files stored in the directory /scratch are stored on this SSD. If you wish to compare the access times for files stored on the network file server vs. files stored on a local SSD, feel free to make a 374 folder in /scratch, copy the files from /home/cs/374/exercises/04 to /scratch/374/; then rerun the programs from this exercise using those local files as the input files to your programs (e.g., /scratch/374/1m-doubles.txt).
Note that since /scratch is local to each workstation, you will lose the convenience of being able to work from any workstation in the lab -- you will only be able to access those files from that workstation.
CS > 374 > Exercise > 04 > Hands-On Lab, Part 1