Most of today's computing systems contain powerful graphics processing units (GPUs) that, in addition to graphical uses, can be used as general purpose computing devices. There are several different programming platforms for using these GPUs for general purpose computing, including:
A typical CUDA computation is a program that:
To see this in action, make a new folder for this exercise named cuda and then create a subfolder within it named hello; then download the source file and Makefile from helloCUDA into your cuda/hello folder. Note that CUDA source files end in the .cu extension.
Take a moment to view the Makefile. From it, we can learn a few things, including:
Using any text editor, open the file hello.cu, locate the kernel function and the main function, and look over each to get an idea of what the program does.
Discuss with your neighbor: What is the name of the kernel function?
CUDA devices organize their threads into groups called blocks, and organize their blocks into a grid, which can have 1, 2, or 3 dimensions. For today, we will keep things simple and just use 1-dimensional grids.
When a CUDA program launches a kernel on a GPU, it must pass two arguments to the GPU:
The technical name for these values is the kernel's execution configuration. For 1-dimensional grids, these values can be integers; for 2- or 3-dimensional grids, they must be variables of a special CUDA type called dim3.
The main() function calls the kernel function with the following unusual syntax:
sayHello<<<numBlocks, threadsPerBlock>>>();As this illustrates, we call a kernel function by giving its name, passing the execution configuration values surrounded by triple less-than and greater-than symbols, and finally a pair of parentheses (the function-call operator) containing any arguments the function requires. This kernel function has no parameters; if it had any, we would need to pass arguments to those parameters within the parentheses.
The rest of the main() function just checks that the kernel launched successfully, and then resets the GPU for subsequent use.
Building. Using the provided Makefile, try to build the program. If all is well, you should see the program build without any errors or warnings.
Running. To run the program with default values, just enter:
./helloThe program should run, launching the kernel function on the GPU using the default execution configuration: a 1-dimensional grid of 3 blocks, each with 4 threads.
You can vary the execution configuration by entering numBlocks and threadsPerBlock values on the commandline:
./hello 2 5Experiment with this a few times, comparing the kernel's code to the output it generates for a given set of GPU parameters.
When CUDA launches a kernel, it uses the execution configuration to define several kernel global variables, including:
unsigned long id = blockDim.x * blockIdx.x + threadIdx.x;The rest of the kernel function just uses this id value (and the blockIdx.x and theadIdx.x values) to display its greeting from the thread.
With your neighbor: Is there a maximum number of threads that one block can have?
With your neighbor: Is there a maximum number of blocks a CUDA program can have?
In the rest of this exercise, we will try to answer this research question:
When is a CUDA computation faster than the equivalent sequential computation?
Or put differently, when is using CUDA worthwhile?
The answer to this question depends on at least two factors, including:
We'll start very simply: given two vectors A and B, we'll add them together and put the result in a third vector C.
C = A + B;
To get started, within your cuda folder, create a new subfolder named vectorAdd and cd to that folder. download the source file and Makefile from vectorAdd to that subfolder.
The program in vectorAdd.cu is a tweaked version of a sample program that comes with Nvidia's CUDA Toolkit. Aside from cleaning up the error-handling and letting the user use a command-line argument to override the default array-size, the main change was to add a sequential loop that performs the same computation as the CUDA kernel, so that we can compare CUDA's performance against the equivalent sequential performance.
Use the provided Makefile to build the program, and verify that it builds and runs without errors before continuing.
Using the omp_get_wtime() function, modify vectorAdd.cu so that it calculates:
Readability suggestion: to make your timing results easy to compare, use tab characters (i.e., \t) in the format-string of your printf(), so that your timing results align with one another.
We do not want I/O to affect our timing results, so comment out the printf() statements being used to trace the execution through each of these sections. Don't forget to #include <omp.h> in order for omp_get_wtime() to be declared!
Save your changes and use the Makefile to build your modified version of the program. When it builds successfully, run it as follows:
./vectorAddBy default, the program uses an array size of 20,000 elements.
Which is faster, the CUDA version or the sequential version? Are you seeing any speedup for the CUDA version?
Perhaps the problem size is the issue. Run it again, but increase the size of the array to 200,000 elements:
./vectorAdd 200000
Then run it again, but increase the size of the array to 2,000,000 elements:
./vectorAdd 2000000How do these timings compare to those using 20,000 elements?
Run it again, using 20,000,000 elements. How do these times compare to your previous ones?
Run it again, using 200,000,000 elements. How do these times compare?
Run it again, using 2,000,000,000 elements. What happens this time?
In light of that, run it again using 1,000,000,000 elements. Does that work?
Time Trials. Now that we have identified a limitation on how many values the GPU's memory can store, let's conduct some time trials. Using a spreadsheet and your program, record the timings for 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements.
Visualization. Create a line chart, with a solid line for the sequential code's times and a dotted line for the CUDA code's total times. Your chart's X-axis should be labeled with 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. its Y-axis should be the time.
Then create a second chart, but make this one a "stacked" column chart of the CUDA times with the same X and Y axes as your first chart. For each X-axis value, this chart should "stack" the CUDA computation's
Discuss with your neighbor:
Let's revisit the same research question, but using a more "expensive" operation. Multiplication is a more time-consuming operation than addition, so let's try that.
In your vectorAdd directory, use
make cleanto remove the binary. Then use
cd .. cp -r vectorAdd vectorMultto create a copy of your vectorAdd folder named vectorMult. Inside that folder, rename vectorAdd.cu vectorMult.cu and modify the Makefile to build vectorMult instead of vectorAdd.
Then edit vectorMult.cu and change it so that instead of storing the sum of A[i] and B[i] in C[i], the program stores the product of A[i] times B[i] in C[i]. Note that you will need to change:
Discuss with your neighbor: How do your results compare to those of Part A --are they similar or different--and why?
Are you able to answer our research question?
Let's try again, but using an even more "expensive" operation AND reducing the amount of data we're transferring. Square root is a more expensive operation than multiplication, so let's try that.
As in Part B, clean and make a copy of your vectorMult folder named vectorRoot. Inside it, rename vectorMult.cu as vectorRoot.cu and modify the Makefile to build vectorRoot.
Then edit vectorRoot.cu and change it so that it computes the square root of A[i] in C[i].
Then build vectorRoot and run it using 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. As before, record the timings for each of these in your spreadsheet, and create charts to help visualize the results.
Discuss with your neighbor: How do these results compare to those of Parts A and B?
Has your ability to answer our research question changed or stayed the same?
Let's keep going. You should have seen some benefit from using CUDA in Part C, but it could have been either because (i) square root is an expensive operation, or (ii) we only transferred one array (instead of two) from the host to the device.
To try to see which of these two made the difference, let's use a less expensive operation than square root, but keep the amount of data we're transferring the same.
As in Part C, clean and make a copy of your vectorRoot folder named vectorSquare. Inside it, rename vectorRoot.cu vectorSquare.cu and modify the Makefile to build vectorSquare.
Then edit vectorSquare.cu and change it so that it computes the square of A[i] in C[i].
Then build vectorSquare and run it using 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. As before, record the timings for each of these in your spreadsheet, and create charts to help visualize the results.
Discuss with your neighbor: How do these results compare to those of the previous parts?
Has your ability to answer our research question changed or stayed the same?
Finally, let's transfer 3 vectors (as you did in Parts I and II) but use a more expensive computation than we have so far: calculating C[i] as the hypotenuse of a right triangle whose leg lengths are in A[i] and B[i].
Your final task is to write vectorHypot.cu that computes C[i] = sqrt( A[i]*A[i] + B[i]*B[i] ) for all the elements in arrays A, B, and C. Your program should compute this sequentially and using CUDA, time both computations, and verify the correctness of the computations, as we did in Parts A-D.
As before, create a line chart that compares your sequential and CUDA computation times for arrays of size 100,000 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements, and a stacked column chart showing the times spent in the different portions of the CUDA computation.
Discuss with your neighbor: How do these results compare to those of the previous parts?
What is your final answer to our research question?
The comparisons we have performed in this exercise are highly dependent on two clock speeds of the computer on which you run these programs:
A core's clock speed controls the rate at which it performs its instructions --cores with higher clock speeds perform more instructions per second than those with lower clock speeds.
In this exercise, we have run programs of varying difficulties--in terms of the 'time expense' of the required operations and data-transfers--to gauge when CUDA provides a performance advantage.
If you run the programs of this exercise on a computer where the CPU's clock speed is much higher than that of the GPU, then the sequential computation's instructions will be performed much more quickly than the CUDA computation's instructions. That means a harder problem will be required in order for the many-but-slower cores of the GPU to surpass the CPU's single-but-faster sequential performance.
But if there is little difference between the CPU and GPU clock speeds, then the rate at which the CPU and GPU perform instructions will be more similar. The closer these two clock speeds are to one another, the easier it will be for the GPU's parallel cores to outperform the sequential computation on the CPU, so CUDA will provide a performance advantage on easier problems.
To bring this closer to home: If you perform this exercise in Calvin's Gold Lab and your friend performs it in the Systems Lab, you will get very different results because both the CPU and GPU clock speeds in each labs' computers are different!
To find the clock speed of your Linux-system CPU, you can enter either:
cat /proc/cpuinfo | grep MHzor
lscpu | grep HzThe /proc/cpuinfo command has the advantage of indicating the number of cores your CPU claims to have. The lscpu indicates your CPU's base clock speed, its maximum (e.g., turbo-boosted) speed, and its minimum (e.g., power-saving) speed.
To find the clock speed of your Nvidia GPU, you can enter:
nvidia-smi base-clocks
To learn more about CUDA, see your textbook's chapter on CUDA and GPUs, or check out Nvidia's Introduction to CUDA, or their CUDA C++ Programming Guide.
Congratulations--you have reached the end of the final CS 374 exercise!
I hope you have found learning about HPC interesting, enlightening, and fun!!
CS > 374 > Exercise > 06 > Hands-On Lab