Table Of Contents

Previous topic

Using different compiling modes

Next topic

Some general Remarks

This Page

Using the GPU

One of the Theano’s design goals is to specify computations at an abstract level, so that the internal function compiler has a lot of flexibility about how to carry out those computations. One of the ways we take advantage of this flexibility is in carrying out calculations on an Nvidia graphics card when there is a CUDA-enabled device in your computer.

Setting up CUDA

The first thing you’ll need for Theano to use your GPU is Nvidia’s GPU-programming toolchain. You should install at least the CUDA driver and the CUDA Toolkit, as described here. The CUDA Toolkit installs a folder on your computer with subfolders bin, lib, include, and some more too. (Sanity check: The bin subfolder should contain an nvcc program which is the compiler for GPU code.) This folder is called the cuda root directory. On Linux or OS-X >= 10.4, you must add the ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a 64-bit computer) to your $LD_LIBRARY_PATH environment variable.

Making Theano use CUDA

You must tell Theano where the cuda root folder is, and there are three ways to do it. Any one of them is enough.

  • Define a $CUDA_ROOT environment variable to equal the cuda root directory, as in CUDA_ROOT=/path/to/cuda/root, or
  • add a cuda.root flag to THEANO_FLAGS, as in THEANO_FLAGS='cuda.root=/path/to/cuda/root', or
  • add a [cuda] section to your .theanorc file containing the option root = /path/to/cuda/root.

Once that is done, the only thing left is to change the device option to name the GPU device in your computer. For example: THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu0'. You can also set the device option in the .theanorc file’s [global] section. If your computer has multiple gpu devices, you can address them as gpu0, gpu1, gpu2, or gpu3. (If you have more than 4 devices you are very lucky but you’ll have to modify theano’s configdefaults.py file and define more gpu devices to choose from.)

Note

There is a compatibility issue affecting some Ubuntu 9.10 users, and probably anyone using CUDA 2.3 with gcc-4.4. Symptom: errors about “__sync_fetch_and_add” being undefined. Solution 1: make gcc-4.3 the default gcc (http://pascalg.wordpress.com/2010/01/14/cuda-on-ubuntu-9-10linux-mint-helena/) Solution 2: make another gcc (e.g. gcc-4.3) the default just for nvcc. Do this by making a directory (e.g. $HOME/.theano/nvcc-bindir) and installing two symlinks in it: one called gcc pointing to gcc-4.3 (or lower) and one called g++ pointing to g++-4.3 (or lower). Then add compiler_bindir = /path/to/nvcc-bindir to the [nvcc] section of your .theanorc (libdoc_config).

Putting it all Together

To see if your GPU is being used, cut and paste the following program into a file and run it.

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
t0 = time.time()
for i in xrange(iters):
    r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Used the','cpu' if any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

The program just computes the exp() of a bunch of random numbers. Note that we use the shared function to make sure that the input x are stored on the graphics device.

If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds, whereas on the GPU it takes just over 0.4 seconds. Note that the results are close but not identical! The GPU will not always produce the exact same floating-point numbers as the CPU. As a point of reference, a loop that calls numpy.exp(x.value) also takes about 7 seconds.

$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python thing.py
Looping 100 times took 7.17374897003 seconds
Result is [ 1.23178032  1.61879341  1.52278065 ...,  2.20771815  2.29967753 1.62323285]

bergstra@tikuanyin:~/tmp$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python thing.py
Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.418929815292 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

Note that for now GPU operations in Theano require floatX to be float32 (see below also).

Returning a handle to device-allocated data

The speedup is not greater in the example above because the function is returning its result as a numpy ndarray which has already been copied from the device to the host for your convenience. This is what makes it so easy to swap in device=gpu0, but if you don’t mind being less portable, you might prefer to see a bigger speedup by changing the graph to express a computation with a GPU-stored result. The gpu_from_host Op means “copy the input from the host to the gpu” and it is optimized away after the T.exp(x) is replaced by a GPU version of exp().

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
t0 = time.time()
for i in xrange(iters):
    r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
print 'Used the','cpu' if any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

The output from this program is

Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.185714006424 seconds
Result is <CudaNdarray object at 0x3e9e970>
Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

Here we’ve shaved off about 50% of the run-time by simply not copying the resulting array back to the host. The object returned by each function call is now not a numpy array but a “CudaNdarray” which can be converted to a numpy ndarray by the normal numpy casting mechanism.

Running the GPU at Full Speed

To really get maximum performance in this simple example, we need to use an Out instance to tell Theano not to copy the output it returns to us. Theano allocates memory for internal use like a working buffer, but by default it will never return a result that is allocated in the working buffer. This is normally what you want, but our example is so simple that it has the un-wanted side-effect of really slowing things down.

from theano import function, config, shared, sandbox, Out
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([],
        Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
            borrow=True))
t0 = time.time()
for i in xrange(iters):
    r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
print 'Used the','cpu' if any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'

Running this version of the code takes just under 0.05 seconds, over 140x faster than the CPU implementation!

Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.0497219562531 seconds
Result is <CudaNdarray object at 0x31eeaf0>
Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761 1.62323296]

This version of the code using borrow=True is slightly less safe because if we had saved the r returned from one function call, we would have to take care and remember that its value might be over-written by a subsequent function call. Although borrow=True makes a dramatic difference in this example, be careful! The advantage of borrow=True is much weaker in larger graphs, and there is a lot of potential for making a mistake by failing to account for the resulting memory aliasing.

What can be accelerated on the GPU?

The performance characteristics will change as we continue to optimize our implementations, and vary from device to device, but to give a rough idea of what to expect right now:

  • Only computations with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but float64 computations are still relatively slow (Jan 2010).
  • Matrix multiplication, convolution, and large element-wise operations can be accelerated a lot (5-50x) when arguments are large enough to keep 30 processors busy.
  • Indexing, dimension-shuffling and constant-time reshaping will be equally fast on GPU as on CPU.
  • Summation over rows/columns of tensors can be a little slower on the GPU than on the CPU
  • Copying of large quantities of data to and from a device is relatively slow, and often cancels most of the advantage of one or two accelerated functions on that data. Getting GPU performance largely hinges on making data transfer to the device pay off.

Tips for improving performance on GPU

  • Consider adding floatX = float32 to your .theanorc file if you plan to do a lot of GPU work.
  • Prefer constructors like ‘matrix’ ‘vector’ and ‘scalar’ to ‘dmatrix’, ‘dvector’ and ‘dscalar’ because the former will give you float32 variables when floatX=float32.
  • Ensure that your output variables have a float32 dtype and not float64. The more float32 variables are in your graph, the more work the GPU can do for you.
  • Minimize tranfers to the GPU device by using shared ‘float32’ variables to store frequently-accessed data (see shared()). When using the GPU, ‘float32’ tensor shared variables are stored on the GPU by default to eliminate transfer time for GPU ops using those variables.
  • If you aren’t happy with the performance you see, try building your functions with mode=’PROFILE_MODE’. This should print some timing information at program termination (atexit). Is time being used sensibly? If an Op or Apply is taking more time than its share, then if you know something about GPU programming have a look at how it’s implemented in theano.sandbox.cuda. Check the line like ‘Spent Xs(X%) in cpu Op, Xs(X%) in gpu Op and Xs(X%) transfert Op’ that can tell you if not enought of your graph is on the gpu or if their is too much memory transfert.