HPC Computing with CUDA Exercise: Hands-On Lab

Introduction

Most of today's computing systems contain powerful graphics processing units (GPUs) that, in addition to graphical uses, can be used as general purpose computing devices. There are several different programming platforms for using these GPUs for general purpose computing, including:

CUDA is Nvidia's proprietary but free platform that works (only) with Nvidia devices. It is the oldest and most well-established framework, and it is the best documented.
OpenCL is a free, non-proprietary standard that can use any OpenCL compatible cores (CPU or GPU) in a system. Originally devised by Apple, it is now supported by a consortium that includes Apple, AMD/Radeon, Nvidia, Intel, ...
OpenACC is a proprietary, but free standard that provides a higher level #pragma-based interface to the GPU, similar to OpenMP for multithreading. It is supported by Cray, as well as the Portland Group International (PGI) software company, which Nvidia bought in 2013.
OpenMP was originally a multithreading library but is being extended with additional #pragma directives for offloading work to the GPU.

This week, we will explore how to use CUDA. The workstations in Calvin's Gold Lab have Nvidia RTX 3060 graphics cards, each of which has 3584 "Ampere" cuda cores and 12GB VRAM to throw at a problem. (The Systems Lab workstations have Nvidia GTX 1050 Ti graphics cards, each with 768 "Pascal" cuda cores and just 4GB VRAM, so I highly recommend you use a Gold Lab workstation for this exercise.)

Part 1. Getting Started

A typical CUDA computation is a program that:

Runs the mundane parts of the program on the CPU, and
Runs one or more compute-intensive functions on the GPU.

A function that runs on the GPU is called a kernel, and is marked by having __global__ before its return-type. (Note: This kind of 'kernel' is unrelated to an operating system 'kernel'.)

To see this in action, make a new folder for this exercise named cuda and then create a subfolder within it named hello; then download the source file and Makefile from helloCUDA into your cuda/hello folder. Note that CUDA source files end in the .cu extension.

Take a moment to view the Makefile. From it, we can learn a few things, including:

CUDA programs are compiled using nvcc, which is Nvidia's compiler for CUDA-C++ programs. (See the definition of variable NVCC.)
Different Nvidia GPUs have different architectures, and you must tell nvcc the architecture of the GPU you will be using. Since the machines in Calvin's CS labs were purchased at different times, they have different GPUs and different architectures. To tell nvcc which architecture your GPU uses, you may need to uncomment one of the lines that define CUDA_FLAGS. These lines use nvcc's -gencode flag to specify the architecture (arch) and code base (code) of the GPU being used. Based on the lab in which you are working, make certain that the appropriate CUDA_FLAGS line in the Makefile is uncommented. (If you are doing this exercise outside of Calvin University, this page is helpful in identifying the arch and code of Nvidia's various GPUs.)
We will be using OpenMP's omp_get_wtime() function to time our programs today. To use OpenMP with nvcc, the linking flag -lgomp is needed, so our Makefile's LFLAGS variable includes that flag.

If you have made any changes, save them before proceeding.

Using any text editor, open the file hello.cu, locate the kernel function and the main function, and look over each to get an idea of what the program does.

Discuss with your neighbor: What is the name of the kernel function?

CUDA devices organize their threads into groups called blocks, and organize their blocks into a grid, which can have 1, 2, or 3 dimensions. For today, we will keep things simple and just use 1-dimensional grids.

When a CUDA program launches a kernel on a GPU, it must pass two arguments to the GPU:

The number of thread-blocks, and
The number of threads per block.

If you study the main() function, you'll see that it defines variables for these values. These variables are initialized with default values (3 and 4, respectively), but the main function allows the user to reset their values via the commandline.

The technical name for these values is the kernel's execution configuration. For 1-dimensional grids, these values can be integers; for 2- or 3-dimensional grids, they must be variables of a special CUDA type called dim3.

The main() function calls the kernel function with the following unusual syntax:

   sayHello<<<numBlocks, threadsPerBlock>>>();

As this illustrates, we call a kernel function by giving its name, passing the execution configuration values surrounded by triple less-than and greater-than symbols, and finally a pair of parentheses (the function-call operator) containing any arguments the function requires. This kernel function has no parameters; if it had any, we would need to pass arguments to those parameters within the parentheses.

The rest of the main() function just checks that the kernel launched successfully, and then resets the GPU for subsequent use.

Building. Using the provided Makefile, try to build the program. If all is well, you should see the program build without any errors or warnings.

Running. To run the program with default values, just enter:

   ./hello

The program should run, launching the kernel function on the GPU using the default execution configuration: a 1-dimensional grid of 3 blocks, each with 4 threads.

You can vary the execution configuration by entering numBlocks and threadsPerBlock values on the commandline:

   ./hello 2 5

Experiment with this a few times, comparing the kernel's code to the output it generates for a given set of GPU parameters.

When CUDA launches a kernel, it uses the execution configuration to define several kernel global variables, including:

blockDim.x, which stores the number of threads in a block in the grid's x-dimension. (In our example, if threadsPerBlock is 4, then blockDim.x is 4.)
For a 1-dimensional grid, the x-dimension is the only dimension, but grids can have up to 3 dimensions, in which case blockDim.y and blockDim.z store the number of threads in blocks in the other two dimensions.
blockIdx.x, which stores the index of the current thread's block in the grid's x-dimension. (In our example, if numBlocks is 3, the blockIdx.x is 0 for threads in the first block [global threads 0-3], 1 for threads in the second block [global threads 4-7], and 2 for threads in the third block [global threads 8-11].)
threadIdx.x, which stores the index of the current thread along its block's x-axis (the only axis for a 1-dimensional block). In our example, if threadsPerBlock is 4, then each block contains 4 threads whose threadIdx.x values are 0, 1, 2, and 3, respectively.

The hello.cu kernel function uses these values to compute each thread's global id, using the formula:

   unsigned long id = blockDim.x * blockIdx.x + threadIdx.x;

The rest of the kernel function just uses this id value (and the blockIdx.x and theadIdx.x values) to display its greeting from the thread.

With your neighbor: Is there a maximum number of threads that one block can have?

With your neighbor: Is there a maximum number of blocks a CUDA program can have?

Part 2. Delving Deeper

In the rest of this exercise, we will try to answer this research question:

When is a CUDA computation faster than the equivalent sequential computation?

Or put differently, when is using CUDA worthwhile?

The answer to this question depends on at least two factors, including:

the amount of data-movement that is needed to solve the problem, and
the costliness or "expense" (in terms of time) of the operations that must be performed on the data to solve the problem.

A. Vector Addition

We'll start very simply: given two vectors A and B, we'll add them together and put the result in a third vector C.

   C = A + B;

To get started, within your cuda folder, create a new subfolder named vectorAdd and cd to that folder. download the source file and Makefile from vectorAdd to that subfolder.

The program in vectorAdd.cu is a tweaked version of a sample program that comes with Nvidia's CUDA Toolkit. Aside from cleaning up the error-handling and letting the user use a command-line argument to override the default array-size, the main change was to add a sequential loop that performs the same computation as the CUDA kernel, so that we can compare CUDA's performance against the equivalent sequential performance.

Use the provided Makefile to build the program, and verify that it builds and runs without errors before continuing.

Using the omp_get_wtime() function, modify vectorAdd.cu so that it calculates:

The times required by the CUDA computation, specifically:
1. The time spent copying the A and B arrays from the host to the device.
2. The time spent computing the sum of the A and B arrays into C.
3. The time spent copying the C array from the device to the host.
4. The program's total time.
The time required by the sequential computation.

At an appropriate place near the end of the program, add a printf() to display these times.

Readability suggestion: to make your timing results easy to compare, use tab characters (i.e., \t) in the format-string of your printf(), so that your timing results align with one another.

We do not want I/O to affect our timing results, so comment out the printf() statements being used to trace the execution through each of these sections. Don't forget to #include <omp.h> in order for omp_get_wtime() to be declared!

Save your changes and use the Makefile to build your modified version of the program. When it builds successfully, run it as follows:

   ./vectorAdd

By default, the program uses an array size of 20,000 elements.

Which is faster, the CUDA version or the sequential version? Are you seeing any speedup for the CUDA version?

Perhaps the problem size is the issue. Run it again, but increase the size of the array to 200,000 elements:

   ./vectorAdd 200000

Then run it again, but increase the size of the array to 2,000,000 elements:

   ./vectorAdd 2000000

How do these timings compare to those using 20,000 elements?

Run it again, using 20,000,000 elements. How do these times compare to your previous ones?

Run it again, using 200,000,000 elements. How do these times compare?

Run it again, using 2,000,000,000 elements. What happens this time?

In light of that, run it again using 1,000,000,000 elements. Does that work?

Time Trials. Now that we have identified a limitation on how many values the GPU's memory can store, let's conduct some time trials. Using a spreadsheet and your program, record the timings for 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements.

Visualization. Create a line chart, with a solid line for the sequential code's times and a dotted line for the CUDA code's total times. Your chart's X-axis should be labeled with 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. its Y-axis should be the time.

Then create a second chart, but make this one a "stacked" column chart of the CUDA times with the same X and Y axes as your first chart. For each X-axis value, this chart should "stack" the CUDA computation's

host-to-device transfer time
computation time
device-to-host transfer time

Discuss with your neighbor:

According to your line chart, how does the performance of the CUDA computation compare to that of the sequential computation?
According to your column chart, where does the CUDA computation spend most of its time?
How much time does the CUDA computation spend transferring data compared to adding the vectors?
The CUDA computation transfers two vectors from the host to the device but just one vector from the device back to the host. How does the time to transfer data from the host to the device compare to the time to transfer data from the device to the host?
Are you able to answer our research question?

B. Vector Multiplication

Let's revisit the same research question, but using a more "expensive" operation. Multiplication is a more time-consuming operation than addition, so let's try that.

In your vectorAdd directory, use

   make clean

to remove the binary. Then use

   cd .. 
   cp -r vectorAdd vectorMult

to create a copy of your vectorAdd folder named vectorMult. Inside that folder, rename vectorAdd.cu vectorMult.cu and modify the Makefile to build vectorMult instead of vectorAdd.

Then edit vectorMult.cu and change it so that instead of storing the sum of A[i] and B[i] in C[i], the program stores the product of A[i] times B[i] in C[i]. Note that you will need to change:

The CUDA kernel that added the vectors.
The verification test for the CUDA kernel.
The sequential code that added the vectors.

Then build vectorMult and run it using 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. As in part A, record the timings for each of these in your spreadsheet, and recreate the same two charts as before to help visualize and compare the results.

Discuss with your neighbor: How do your results compare to those of Part A --are they similar or different--and why?

Are you able to answer our research question?

C. Vector Square Root

Let's try again, but using an even more "expensive" operation AND reducing the amount of data we're transferring. Square root is a more expensive operation than multiplication, so let's try that.

As in Part B, clean and make a copy of your vectorMult folder named vectorRoot. Inside it, rename vectorMult.cu as vectorRoot.cu and modify the Makefile to build vectorRoot.

Then edit vectorRoot.cu and change it so that it computes the square root of A[i] in C[i].

Then build vectorRoot and run it using 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. As before, record the timings for each of these in your spreadsheet, and create charts to help visualize the results.

Discuss with your neighbor: How do these results compare to those of Parts A and B?

Has your ability to answer our research question changed or stayed the same?

D. Vector Square

Let's keep going. You should have seen some benefit from using CUDA in Part C, but it could have been either because (i) square root is an expensive operation, or (ii) we only transferred one array (instead of two) from the host to the device.

To try to see which of these two made the difference, let's use a less expensive operation than square root, but keep the amount of data we're transferring the same.

As in Part C, clean and make a copy of your vectorRoot folder named vectorSquare. Inside it, rename vectorRoot.cu vectorSquare.cu and modify the Makefile to build vectorSquare.

Then edit vectorSquare.cu and change it so that it computes the square of A[i] in C[i].

Then build vectorSquare and run it using 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. As before, record the timings for each of these in your spreadsheet, and create charts to help visualize the results.

Discuss with your neighbor: How do these results compare to those of the previous parts?

Has your ability to answer our research question changed or stayed the same?

E. Vector Hypotenuse

Finally, let's transfer 3 vectors (as you did in Parts I and II) but use a more expensive computation than we have so far: calculating C[i] as the hypotenuse of a right triangle whose leg lengths are in A[i] and B[i].

Your final task is to write vectorHypot.cu that computes C[i] = sqrt( A[i]*A[i] + B[i]*B[i] ) for all the elements in arrays A, B, and C. Your program should compute this sequentially and using CUDA, time both computations, and verify the correctness of the computations, as we did in Parts A-D.

As before, create a line chart that compares your sequential and CUDA computation times for arrays of size 100,000 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements, and a stacked column chart showing the times spent in the different portions of the CUDA computation.

Discuss with your neighbor: How do these results compare to those of the previous parts?

What is your final answer to our research question?

Final Words

The comparisons we have performed in this exercise are highly dependent on two clock speeds of the computer on which you run these programs:

the clock speed of the CPU's cores (typically between 1- and 4-GHz), and
the clock speed of the GPU's cores (typically between 1- and 2-GHz).

Cores running at higher clock speeds generate more heat, and GPUs have many, many more cores than CPUs, so GPU clock speeds are generally much lower than CPU clock speeds to keep from generating too much heat.

A core's clock speed controls the rate at which it performs its instructions --cores with higher clock speeds perform more instructions per second than those with lower clock speeds.

In this exercise, we have run programs of varying difficulties--in terms of the 'time expense' of the required operations and data-transfers--to gauge when CUDA provides a performance advantage.

If you run the programs of this exercise on a computer where the CPU's clock speed is much higher than that of the GPU, then the sequential computation's instructions will be performed much more quickly than the CUDA computation's instructions. That means a harder problem will be required in order for the many-but-slower cores of the GPU to surpass the CPU's single-but-faster sequential performance.

But if there is little difference between the CPU and GPU clock speeds, then the rate at which the CPU and GPU perform instructions will be more similar. The closer these two clock speeds are to one another, the easier it will be for the GPU's parallel cores to outperform the sequential computation on the CPU, so CUDA will provide a performance advantage on easier problems.

To bring this closer to home: If you perform this exercise in Calvin's Gold Lab and your friend performs it in the Systems Lab, you will get very different results because both the CPU and GPU clock speeds in each labs' computers are different!

To find the clock speed of your Linux-system CPU, you can enter either:

   cat /proc/cpuinfo | grep MHz

   lscpu | grep Hz

The /proc/cpuinfo command has the advantage of indicating the number of cores your CPU claims to have. The lscpu indicates your CPU's base clock speed, its maximum (e.g., turbo-boosted) speed, and its minimum (e.g., power-saving) speed.

To find the clock speed of your Nvidia GPU, you can enter:

   nvidia-smi base-clocks

To learn more about CUDA, see your textbook's chapter on CUDA and GPUs, or check out Nvidia's Introduction to CUDA, or their CUDA C++ Programming Guide.

Congratulations--you have reached the end of the final CS 374 exercise!

I hope you have found learning about HPC interesting, enlightening, and fun!!

CS > 374 > Exercise > 06 > Hands-On Lab

This page maintained by Joel Adams.