C++ Development GPU

GPU Programming with C++

Overview

In this guide, we’ll explore the power of GPU programming with C++. Developers can expect incredible performance with C++, and accessing the phenomenal power of the GPU with a low-level language can yield some of the fastest computation currently available.

Requirements

While any machine capable of running a modern version of Linux can support a C++ compiler, you’ll need an NVIDIA-based GPU to follow along with this exercise. If you don’t have a GPU, you can spin up a GPU-powered instance in Amazon Web Services or another cloud provider of your choice.

If you choose a physical machine, please ensure you have the NVIDIA proprietary drivers installed. You can find instructions for this here: https://linuxhint.com/install-nvidia-drivers-linux/

In addition to the driver, you’ll need the CUDA toolkit. In this example, we’ll use Ubuntu 16.04 LTS, but there are downloads available for most major distributions at the following URL: https://developer.nvidia.com/cuda-downloads

For Ubuntu, you would choose the .deb based download. The downloaded file will not have a .deb extension by default, so I recommend renaming it to have a .deb at the end. Then, you can install with:

sudo dpkg -i package-name.deb

You will likely be prompted to install a GPG key, and if so, follow the instructions provided to do so.

Once you’ve done that, update your repositories:

sudo apt-get update
sudo apt-get install cuda -y

Once done, I recommend rebooting to ensure everything is properly loaded.

The Benefits of GPU Development

CPUs handle many different inputs and outputs and contain a large assortment of functions for not only dealing with a wide assortment of program needs but also for managing varying hardware configurations. They also handle memory, caching, the system bus, segmenting, and IO functionality, making them a jack of all trades.

GPUs are the opposite – they contain many individual processors that are focused on very simple mathematical functions. Because of this, they process tasks many times faster than CPUs. By specializing in scalar functions (a function that takes one or more inputs but returns only a single output), they achieve extreme performance at the cost of extreme specialization.

Example Code

In the example code, we add vectors together. I have added a CPU and GPU version of the code for speed comparison.
gpu-example.cpp contents below:

#include "cuda_runtime.h"
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cstdio>
#include <chrono>

typedef std::chrono::high_resolution_clock Clock;

#define ITER 65535

// CPU version of the vector add function
void vector_add_cpu(int *a, int *b, int *c, int n) {
    int i;

    // Add the vector elements a and b to the vector c
    for (i = 0; i < n; ++i) {
    c[i] = a[i] + b[i];
    }
}

// GPU version of the vector add function
__global__ void vector_add_gpu(int *gpu_a, int *gpu_b, int *gpu_c, int n) {
    int i = threadIdx.x;
    // No for loop needed because the CUDA runtime
    // will thread this ITER times
    gpu_c[i] = gpu_a[i] + gpu_b[i];
}

int main() {

    int *a, *b, *c;
    int *gpu_a, *gpu_b, *gpu_c;

    a = (int *)malloc(ITER * sizeof(int));
    b = (int *)malloc(ITER * sizeof(int));
    c = (int *)malloc(ITER * sizeof(int));

    // We need variables accessible to the GPU,
    // so cudaMallocManaged provides these
    cudaMallocManaged(&gpu_a, ITER * sizeof(int));
    cudaMallocManaged(&gpu_b, ITER * sizeof(int));
    cudaMallocManaged(&gpu_c, ITER * sizeof(int));

    for (int i = 0; i < ITER; ++i) {
        a[i] = i;
        b[i] = i;
        c[i] = i;
    }

    // Call the CPU function and time it
    auto cpu_start = Clock::now();
    vector_add_cpu(a, b, c, ITER);
    auto cpu_end = Clock::now();
    std::cout << "vector_add_cpu: "
    << std::chrono::duration_cast<std::chrono::nanoseconds>(cpu_end - cpu_start).count()
    << " nanoseconds.\n";

    // Call the GPU function and time it
    // The triple angle brakets is a CUDA runtime extension that allows
    // parameters of a CUDA kernel call to be passed.
    // In this example, we are passing one thread block with ITER threads.
    auto gpu_start = Clock::now();
    vector_add_gpu <<<1, ITER>>> (gpu_a, gpu_b, gpu_c, ITER);
    cudaDeviceSynchronize();
    auto gpu_end = Clock::now();
    std::cout << "vector_add_gpu: "
    << std::chrono::duration_cast<std::chrono::nanoseconds>(gpu_end - gpu_start).count()
    << " nanoseconds.\n";

    // Free the GPU-function based memory allocations
    cudaFree(a);
    cudaFree(b);
    cudaFree(c);

    // Free the CPU-function based memory allocations
    free(a);
    free(b);
    free(c);

    return 0;
}

Makefile contents below:

INC=-I/usr/local/cuda/include
NVCC=/usr/local/cuda/bin/nvcc
NVCC_OPT=-std=c++11

all:
    $(NVCC) $(NVCC_OPT) gpu-example.cpp -o gpu-example

clean:
    -rm -f gpu-example

To run the example, compile it:

make

Then run the program:

./gpu-example

As you can see, the CPU version (vector_add_cpu) runs considerably slower than the GPU version (vector_add_gpu).

If not, you may need to adjust the ITER define in gpu-example.cu to a higher number. This is due to the GPU setup time being longer than some smaller CPU-intensive loops. I found 65535 to work well on my machine, but your mileage may vary. However, once you clear this threshold, the GPU is dramatically faster than the CPU.

Conclusion

I hope you’ve learned a lot from our introduction into GPU programming with C++. The example above doesn’t accomplish a great deal, but the concepts demonstrated provide a framework that you can use to incorporate your ideas to unleash the power of your GPU.

About the author

Robert Oliver

Robert Oliver

Writer, System Admin, Full Stack Developer, Philosopher.
https://rwo2.com