Development GPU

GPU Programming with Python

In this article, we’ll dive into GPU programming with Python. Using the ease of Python, you can unlock the incredible computing power of your video card’s GPU (graphics processing unit). In this example, we’ll work with NVIDIA’s CUDA library.


For this exercise, you’ll need either a physical machine with Linux and an NVIDIA-based GPU, or launch a GPU-based instance on Amazon Web Services. Either should work fine, but if you choose to use a physical machine, you’ll need to make sure you have the NVIDIA proprietary drivers installed, see instructions:

You’ll also need the CUDA Toolkit installed. This example uses Ubuntu 16.04 LTS specifically, but there are downloads available for most major Linux distributions at the following URL:

I prefer the .deb based download, and these examples will assume you chose that route. The file you download is a .deb package but doesn’t have a .deb extension, so renaming it to have a .deb at the end his helpful. Then you install it with:

sudo dpkg -i package-name.deb

If you are prompted about installing a GPG key, please follow the instructions given to do so.

Now you’ll need to install the cuda package itself. To do so, run:

sudo apt-get update
sudo apt-get install cuda -y

This part can take a while, so you might want to grab a cup of coffee. Once it’s done, I recommend rebooting to ensure all modules are properly reloaded.

Next, you’ll need the Anaconda Python distribution. You can download that here:

Grab the 64-bit version and install it like this:

sh Anaconda*.sh

(the star in the above command will ensure that the command is ran regardless of the minor version)

The default install location should be fine, and in this tutorial, we’ll use it. By default, it installs to ~/anaconda3

At the end of the install, you’ll be prompted to decide if you wish to add Anaconda to your path. Answer yes here to make running the necessary commands easier. To ensure this change takes place, after the installer finishes completely, log out then log back in to your account.

More info on Installing Anaconda:

Finally we’ll need to install Numba. Numba uses the LLVM compiler to compile Python to machine code. This not only enhances performance of regular Python code but also provides the glue necessary to send instructions to the GPU in binary form. To do this, run:

conda install numba

Limitations and Benefits of GPU Programming

It’s tempting to think that we can convert any Python program into a GPU-based program, dramatically accelerating its performance. However, the GPU on a video card works considerably differently than a standard CPU in a computer.

CPUs handle a lot of different inputs and outputs and have a wide assortment of instructions for dealing with these situations. They also are responsible for accessing memory, dealing with the system bus, handling protection rings, segmenting, and input/output functionality. They are extreme multitaskers with no specific focus.

GPUs on the other hand are built to process simple functions with blindingly fast speed. To accomplish this, they expect a more uniform state of input and output. By specializing in scalar functions. A scalar function takes one or more inputs but returns only a single output. These values must be types pre-defined by numpy.

Example Code

In this example, we’ll create a simple function that takes a list of values, adds them together, and returns the sum. To demonstrate the power of the GPU, we’ll run one of these functions on the CPU and one on the GPU and display the times. The documented code is below:

import numpy as np
from timeit import default_timer as timer
from numba import vectorize

# This should be a substantially high value. On my test machine, this took
# 33 seconds to run via the CPU and just over 3 seconds on the GPU.
NUM_ELEMENTS = 100000000

# This is the CPU version.
def vector_add_cpu(a, b):
  c = np.zeros(NUM_ELEMENTS, dtype=np.float32)
  for i in range(NUM_ELEMENTS):
    c[i] = a[i] + b[i]
  return c

# This is the GPU version. Note the @vectorize decorator. This tells
# numba to turn this into a GPU vectorized function.
@vectorize(["float32(float32, float32)"], target='cuda')
def vector_add_gpu(a, b):
  return a + b;

def main():
  a_source = np.ones(NUM_ELEMENTS, dtype=np.float32)
  b_source = np.ones(NUM_ELEMENTS, dtype=np.float32)

  # Time the CPU function
  start = timer()
  vector_add_cpu(a_source, b_source)
  vector_add_cpu_time = timer() - start

  # Time the GPU function
  start = timer()
  vector_add_gpu(a_source, b_source)
  vector_add_gpu_time = timer() - start

  # Report times
  print("CPU function took %f seconds." % vector_add_cpu_time)
  print("GPU function took %f seconds." % vector_add_gpu_time)

  return 0

if __name__ == "__main__":

To run the example, type:


NOTE: If you run into issues when running your program, try using “conda install accelerate”.

As you can see, the CPU version runs considerably slower.

If not, then your iterations are too small. Adjust the NUM_ELEMENTS to a larger value (on mine, the breakeven mark seemed to be around 100 million). This is because the setup of the GPU takes a small but noticeable amount of time, so to make the operation worth it, a higher workload is needed. Once you raise it above the threshold for your machine, you’ll notice substantial performance improvements of the GPU version over the CPU version.


I hope you’ve enjoyed our basic introduction into GPU Programming with Python. Though the example above is trivial, it provides the framework you need to take your ideas further utilizing the power of your GPU.

About the author

Robert Oliver

Writer, System Admin, Full Stack Developer, Philosopher.