HPC CUDA Exercise: Hands-On Lab

Introduction

Most of today's computing systems contain powerful graphics processing units (GPUs) that, in addition to graphical uses, can be used as general purpose computing devices. There are several different programming platforms for using these GPUs for general purpose computing, including:

CUDA is Nvidia's proprietary but free platform that works (only) with Nvidia devices. It is the oldest and most well-established framework, and it is the best documented.
OpenCL is a non-proprietary standard that can use any OpenCL compatible cores (CPU or GPU) in a system. Originally devised by Apple, it is now supported by a consortium that includes Apple, AMD/Radeon, Nvidia, Intel, ...
OpenACC is a (currently proprietary) standard that provides a higher level #pragma-based interface to the GPU, similar to OpenMP for multithreading. It is supported by the Portland Group International (PGI) software company, which was recently bought by Nvidia.
OpenMP was originally a multithreading library but has recently been extended to include #pragma directives for offloading work to the GPU.

This week, we will explore how to use CUDA. The workstations in our Gold Lab have Nvidia RTX 3060 graphics cards, each of which has 3584 "Ampere" cuda cores and 12GB VRAM to throw at a problem. (The Maroon Lab workstations have Nvidia GTX 1050 Ti graphics cards, each with 768 "Pascal" cuda cores and just 4GB VRAM, so I highly recommend you use a Gold Lab workstation for this exercise.)

In this exercise, our research question is:

When is a CUDA computation faster than the equivalent sequential computation?

Or put differently, when is using CUDA worthwhile?

The answer to this question depends on at least two factors, including:

the amount of data-movement that is needed to solve the problem, and
the costliness or "expense" (in terms of time) of the operations that must be performed on the data to solve the problem.

Part I: Vector Addition

We'll start very simply: given two vectors A and B, we'll add them together and put the result in a third vector C.

   C = A + B;

To get started, download the source file and Makefile from vectorAdd. We will be downloading several source programs and Makefiles today, so store these in a new folder named vectorAdd and cd to that folder.

One thing to observe is that simple (one-file) CUDA source files are named using the .cu extension.

Take a moment to view the Makefile. From it, we can learn a few things, including:

CUDA programs are compiled using nvcc, which is Nvidia's compiler for CUDA-C++ programs. (See the definition of NVCC.)
Different Nvidia GPUs have different architectures, and you must tell nvcc the architecture of the GPU you will be using. Since the machines in Calvin's 3 CS labs were purchased at differnt times, they have different GPUs that have different architectures. To tell nvcc which architecture your GPU uses, you will need to uncomment one of the lines that define CUDA_FLAGS. These lines use nvcc's -gencode flag to specify the architecture (arch) and code base (code) of the GPU being used. Based on the lab in which you are working, uncomment the appropriate CUDA_FLAGS line in the Makefile. (If you are doing this exercise outside of Calvin University, this page is helpful in identifying the arch and code of Nvidia's various GPUs.)
We will be using OpenMP's omp_get_wtime() function to time this program. To use OpenMP with nvcc, the linking flag -lgomp is needed, so the LFLAGS variable-definition includes that flag.

After uncommenting the definition of CUDA_FLAGS that corresponds to the lab in which you are working, save your changes, and open vectorAdd.cu.

The program in vectorAdd.cu is a tweaked version of a sample program that comes with Nvidia's CUDA Toolkit. Aside from cleaning up the error-handling and adding support for a command-line argument, the main change was to add a sequential loop that performs the same computation as the CUDA kernel, so that we can compare CUDA's performance against the equivalent sequential performance.

Use the provided Makefile to build the program, and verify that it builds and runs without errors before continuing. (The nvcc compiler is located in /usr/local/cuda/bin/nvcc; that should already be in your PATH variable.)

Using the omp_get_wtime() function, modify vectorAdd.cu so that it calculates:

The times required by the CUDA computation, specifically:
1. The time spent copying the A and B arrays from the host to the device.
2. The time spent computing the sum of the A and B arrays into C.
3. The time spent copying the C array from the device to the host.
4. The total time of the CUDA computation (i.e., the sum of a+b+c).
The time required by the sequential computation.

Near the end of the program (i.e., after all the verification tests have been passed), add a printf() to display these times.

Readability suggestion: to make your timing results easy to compare, use tab characters (i.e., \t) in the format-string of your printf(), to align your timing results.

We do not want I/O to affect our timing results, so comment out the printf() statements being used to trace the execution through each of these sections. Likewise, memory allocation times are not usually very significant, so don't bother timing how long cudaMalloc() takes (unless you really want to). Don't forget to #include <omp.h> in order for omp_get_wtime() to be declared!

Save your changes and use the Makefile to build your modified version of the program. When it builds successfully, run it as follows:

   ./vectorAdd

By default, the program uses an array size of 20,000 elements.

Which is faster, the CUDA version or the sequential version? Are you seeing any speedup for the CUDA version?

Perhaps the problem size is the issue. Run it again, but increase the size of the array to 200,000 elements:

   ./vectorAdd 200000

Then run it again, but increase the size of the array to 2,000,000 elements:

   ./vectorAdd 2000000

How do these timings compare to those using 20,000 elements?

Run it again, using 20,000,000 elements. How do these times compare to your previous ones?

Run it again, using 200,000,000 elements. How do these times compare?

Run it again, using 2,000,000,000 elements. What happens this time?

In light of that, run it again using 1,000,000,000 elements. Does that work?

Time Trials. Now that we have identified a limitation on how many values the GPU's memory can store, let's conduct some time trials. Using a spreadsheet and your program, record the timings for 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements.

Visualization. Create a line chart, with a solid line for the sequential code's times and a dotted line for the CUDA code's total times. Your chart's X-axis should be labeled with 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. its Y-axis should be the time.

Then create a second chart, but make this one a "stacked" barchart of the CUDA times with the same X and Y axes as your first chart. For each X-axis value, this chart should "stack" the CUDA computation's

host-to-device transfer time
computation time
device-to-host transfer time

Questions to ponder:

According to your line chart, how does the performance of the CUDA computation compare to that of the sequential computation?
According to your bar chart, where does the CUDA computation spend most of its time?
How much time does CUDA spend transferring data compared to adding the vectors?
The CUDA computation transfers two vectors from the host to the device but just one vector from the device back to the host. How does the time to transfer data from the host to the device compare to the time to transfer data from the device to the host?
What is the answer our research question?

When you have completed Part I, continue on to Part II.

Part II: Vector Multiplication

Let's revisit the same research question, but using a more "expensive" operation. Multiplication is a more time-consuming operation than addition, so let's try that.

In your vectorAdd directory, use

   make clean

to remove the binary. Then use

   cd .. 
   cp -r vectorAdd vectorMult

to create a copy of your vectorAdd folder named vectorMult. Inside that folder, rename vectorAdd.cu vectorMult.cu and modify the Makefile to build vectorMult instead of vectorAdd.

Then edit vectorMult.cu and change it so that instead of storing the sum of A[i] and B[i] in C[i], the program stores the product of A[i] times B[i] in C[i]. Note that you will need to change:

The CUDA kernel that added the vectors.
The verification test for the CUDA kernel.
The sequential code that added the vectors.

Then build vectorMult and run it using 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. As in part I, record the timings for each of these in your spreadsheet, and recreate the same two charts as before to help visualize and compare the results.

How do your results compare to those of Part I --are they similar or different--and why?

What is the answer to our research question?

When you have completed Part II, continue to Part III.

Part III: Vector Square Root

Let's revisit the same research question again, but using an even more "expensive" operation AND reducing the amount of data we're transferring. Square root is a more expensive operation than multiplication, so let's try that.

As in Part II, clean and make a copy of your vectorMult folder named vectorRoot. Inside it, rename vectorMult.cu as vectorRoot.cu and modify the Makefile to build vectorRoot.

Then edit vectorRoot.cu and change it so that it computes the square root of A[i] in C[i].

Then build vectorRoot and run it using 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. As before, record the timings for each of these in your spreadsheet, and create charts to help visualize the results.

How do these results compare to those of Parts I and II?

Has your answer to our research question changed or stayed the same?

When you have completed Part III, continue to Part IV.

Part IV: Vector Square

Let's revisit the same research question again. You should have seen some benefit from using CUDA in Part III, but it could have been either because (i) square root is an expensive operation, or (ii) we only transferred one array (instead of two) from the host to the device.

To try to see which of these two made the difference, let's use a less expensive operation than square root, but keep the amount of data we're transferring the same.

As in Part III, clean and make a copy of your vectorRoot folder named vectorSquare. Inside it, rename vectorRoot.cu vectorSquare.cu and modify the Makefile to build vectorSquare.

Then edit vectorSquare.cu and change it so that it computes the square of A[i] in C[i].

Then build vectorSquare and run it using 100,000, 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements. As before, record the timings for each of these in your spreadsheet, and create charts to help us visualize the results.

How do these results compare to those of the previous parts?

Has your answer to our research question changed or stayed the same?

When you are finished with Part IV, you may continue to Part V, the last part.

Part V: Vector Hypotenuse

Finally, let's transfer 3 vectors (as you did in Parts I and II) but use a more expensive computation than we have so far--calculating C[i] as the hypotenuse of a right triangle whose leg lengths are in A[i] and B[i].

Your final task is to write vectorHypot.cu that computes C[i] = sqrt( A[i]*A[i] + B[i]*B[i] ) for all the elements in arrays A, B, and C. Your program should compute this sequentially and using CUDA, time both computations, and verify the correctness of the computations, as we did in parts I-IV.

As before, create a line chart that compares your sequential and CUDA computation times for arrays of size 100,000 1,000,000; 10,000,000; 100,000,000, and 1,000,000,000 array elements, and a stacked bar chart showing the times spent in the different portions of the CUDA computation.

How do these results compare to those of the previous parts?

What is your final answer to our research question?

A Final Analysis

The comparisons we have performed in this exercise are highly dependent on two clock speeds of the computer on which you run these programs:

the clock speed of the CPU's cores (typically between 1- and 4-GHz), and
the clock speed of the GPU's cores (typically between 1- and 2-GHz).

Cores running at higher clock speeds generate more heat, and GPUs have many, many more cores than CPUs, so GPU clock speeds are generally lower than CPU clock speeds to avoid generating too much heat.

A core's clock speed controls the rate at which it performs its instructions --cores with higher clock speeds perform instructions more quickly; those with lower clock speeds perform instructions less quickly.

In this exercise, we have run programs of varying difficulties--in terms of the 'time expense' of the required operations and data-transfers--to gauge when CUDA provides a performance advantage.

If you run the programs of this exercise on a computer where the CPU's clock speed is much higher than that of the GPU, then the sequential computation's instructions will be performed much more quickly than the CUDA computation's instructions. That means a harder problem will be required in order for the many cores of the GPU to surpass the CPU's single-core sequential performance.

But if there is little difference between the CPU and GPU clock speeds, then the rate at which the CPU and GPU perform instructions will be more similar. The closer these two clock speeds are to one another, the easier it will be for the GPU's parallel cores to outperform the sequential computation on the CPU, so CUDA will provide a performance advantage on easier problems.

To bring this closer to home: If you perform this exercise in Calvin's Gold Lab and your friend performs it in the Maroon Lab (or the Systems Lab), you will get very different results because both the CPU (and GPU) clock speeds in each labs' computers are different!

To find the clock speed of your Linux-system CPU, you can enter either:

   cat /proc/cpuinfo | grep MHz

   lscpu | grep Hz

The /proc/cpuinfo command has the advantage of indicating the number of cores your CPU claims to have. The lscpu indicates your CPU's base clock speed, its maximum (e.g., turbo-boosted) speed, and its minimum (e.g., power-saving) speed.

To find the clock speed of your Nvidia GPU, you can enter:

   nvidia-smi base-clocks

Nvidia also provides a utility called deviceQuery that reveals a wealth of information about their GPUs; to run it, enter:

   /usr/local/cuda/extras/demo_suite/deviceQuery

To filter these results and and find the number of cores on the GPU, enter:

   /usr/local/cuda/extras/demo_suite/deviceQuery | grep Cores

and to find out the amount of global memory on the GPU, enter:

   /usr/local/cuda/extras/demo_suite/deviceQuery | grep MBytes

(The amount of global memory on the GPU matters because it ultimately limits the amount of data you can transfer to it using cudaMemcpy(), as we discovered in Part I.)

Congratulations--you have reached the end of the last CS 374 exercise! See the chapter in your textbook for more details about CUDA.

I hope you found learning about HPC enlightening and interesting!!

CS > 374 > Exercise > 06 > Hands-On Lab

This page maintained by Joel Adams.