Many of today's computing systems contain powerful graphics cards that can be used as general purpose computing devices. There are three programming platforms for using these graphics cards this way:
To get started, download the source file and Makefile from vectorAdd. We will be doing this for multiple programs, so store these in a folder named vectorAdd.
One thing to observe is that simple (one-file) CUDA programs are named using the .cu extension.
Take a moment to view the Makefile. From it, we can learn that we:
This program is a tweaked version of a sample program that comes with Nvidia's CUDA Toolkit. Aside from cleaning up some error-handling and adding support for a command-line argument, the main tweak was to add a sequential loop that performs the same computation as the CUDA kernel, so that we can compare their performance. Use the provided Makefile to build the program, and verify that it builds and runs without errors before continuing. (The nvcc compiler is located in /usr/local/cuda-7.5/bin/; that should already be in your PATH variable.)
Our first research question is:
For what size problem is the CUDA computation faster than the sequential computation?
Using the omp_get_wtime() function, modify vectorAdd.cu so that it reports:
Use the Makefile to build your modified version of the program. When it builds successfully, run it as follows:
./vectorAddBy default, the program's array size is set to 50,000 elements.
In a spreadsheet, record and label your timings in a 50000 column. Which is faster, the CUDA version or the sequential version?
Perhaps the problem size is the issue. Run it again, but increase the size of the array to 500,000 elements:
./vectorAdd 500000As before, record your timings. How do these timings compare to those using 50,000 elements?
Run it again, using 5,000,000 elements, and record your timings. How do these times compare to your previous ones?
Run it again, using 50,000,000 elements, and record your timings. How do these times compare?
Run it again, using 500,000,000 elements, and record your timings. What happens this time? (Our GTX 970 cards have 4GB of memory.)
Create a line chart, with one line for the sequential code's times and one line for the CUDA code's total times. Your X-axis should be labeled with 50,000; 500,000; 5,000,000; and 50,000,000; your Y-axis should be the time.
Then create a "stacked" barchart of the CUDA times with the same X and Y axes as your first chart.. For each X-axis value, this chart should "stack" the CUDA computation's
What observations can you make about the CUDA vs the sequential computations? How much time does the CUDA computation spend transferring data compared to computing? What is the answer to our research question?
When you have completed Part I, continue on to Part II.
Let's try the same research question, but using a more "expensive" operation. Multiplication is a more expensive operation than addition, so let's try that.
In your vectorAdd directory, use
make cleanto remove the binary. Then use
cd .. cp -r vectorAdd vectorMultto create a copy of your vectorAdd folder named vectorMult. Inside there, rename vectorAdd.cu vectorMult.cu and modify the Makefile to build vectorMult instead of vectorAdd.
Then edit vectorMult.cu and change it so that instead of storing the sum of A[i] and B[i] in C[i], the program stores the product of A[i] times B[i] in C[i]. Note that you will need to change:
What is the answer to our research question? How do your results compare to those of Part I?
When you have completed Part II, continue to Part III.
Let's try the same research question, but using an even more "expensive" operation AND reducing the amount of data we're transferring. Square root is a more expensive operation than multiplication, so let's try that.
As in Part II, clean and make a copy of your vectorMult folder named vectorRoot. Inside it, rename vectorMult.cu vectorRoot.cu and modify the Makefile to build vectorRoot.
Then edit vectorRoot.cu and change it so that it computes the square root of A[i] in C[i].
Then build vectorRoot and run it using 50,000; 500,000; 5,000,000; and 50,000,000 array elements. As before, record the timings for each of these in your spreadsheet, and create charts to help us visualize the results.
What is the answer to our research question? How do these results compare to those of Parts I and II?
When you have completed Part III, continue to Part IV.
Let's try the same research question one more time. This time, we will use a less expensive operation than square root, but keep the amount of data we're transferring the same.
As in Part III, clean and make a copy of your vectorRoot folder named vectorSquare. Inside it, rename vectorSquare.cu vectorSquare.cu and modify the Makefile to build vectorSquare.
Then edit vectorSquare.cu and change it so that it computes the square of A[i] in C[i].
Then build vectorSquare and run it using 50,000; 500,000; 5,000,000; and 50,000,000 array elements. As before, record the timings for each of these in your spreadsheet, and create charts to help us visualize the results.
What is the answer to our research question? How do your results compare to those of the previous parts?
When you are finished with Part IV, you may continue to this week's project.
CS > 374 > Exercise > 09 > Hands-On Lab