One of the Theano’s design goals is to specify computations at an abstract level, so that the internal function compiler has a lot of flexibility about how to carry out those computations. One of the ways we take advantage of this flexibility is in carrying out calculations on an Nvidia graphics card when there is a CUDA-enabled device in your computer.
The first thing you’ll need for Theano to use your GPU is Nvidia’s GPU-programming toolchain. You should install at least the CUDA driver and the CUDA Toolkit, as described here. The CUDA Toolkit installs a folder on your computer with subfolders bin, lib, include, and some more too. (Sanity check: The bin subfolder should contain an nvcc program which is the compiler for GPU code.) This folder is called the cuda root directory. On Linux or OS-X >= 10.4, you must add the ‘lib’ subdirectory (and/or ‘lib64’ subdirectory if you have a 64-bit computer) to your $LD_LIBRARY_PATH environment variable.
You must tell Theano where the cuda root folder is, and there are three ways to do it. Any one of them is enough.
Once that is done, the only thing left is to change the device option to name the GPU device in your computer. For example: THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu0'. You can also set the device option in the .theanorc file’s [global] section. If your computer has multiple gpu devices, you can address them as gpu0, gpu1, gpu2, or gpu3. (If you have more than 4 devices you are very lucky but you’ll have to modify theano’s configdefaults.py file and define more gpu devices to choose from.)
Note
There is a compatibility issue affecting some Ubuntu 9.10 users, and probably anyone using CUDA 2.3 with gcc-4.4. Symptom: errors about “__sync_fetch_and_add” being undefined. Solution 1: make gcc-4.3 the default gcc (http://pascalg.wordpress.com/2010/01/14/cuda-on-ubuntu-9-10linux-mint-helena/) Solution 2: make another gcc (e.g. gcc-4.3) the default just for nvcc. Do this by making a directory (e.g. $HOME/.theano/nvcc-bindir) and installing two symlinks in it: one called gcc pointing to gcc-4.3 (or lower) and one called g++ pointing to g++-4.3 (or lower). Then add compiler_bindir = /path/to/nvcc-bindir to the [nvcc] section of your .theanorc (libdoc_config).
To see if your GPU is being used, cut and paste the following program into a file and run it.
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
t0 = time.time()
for i in xrange(iters):
r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Used the','cpu' if any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'
The program just computes the exp() of a bunch of random numbers. Note that we use the shared function to make sure that the input x are stored on the graphics device.
If I run this program (in thing.py) with device=cpu, my computer takes a little over 7 seconds, whereas on the GPU it takes just over 0.4 seconds. Note that the results are close but not identical! The GPU will not always produce the exact same floating-point numbers as the CPU. As a point of reference, a loop that calls numpy.exp(x.value) also takes about 7 seconds.
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python thing.py
Looping 100 times took 7.17374897003 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753 1.62323285]
bergstra@tikuanyin:~/tmp$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python thing.py
Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.418929815292 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
Note that for now GPU operations in Theano require floatX to be float32 (see below also).
The speedup is not greater in the example above because the function is returning its result as a numpy ndarray which has already been copied from the device to the host for your convenience. This is what makes it so easy to swap in device=gpu0, but if you don’t mind being less portable, you might prefer to see a bigger speedup by changing the graph to express a computation with a GPU-stored result. The gpu_from_host Op means “copy the input from the host to the gpu” and it is optimized away after the T.exp(x) is replaced by a GPU version of exp().
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
t0 = time.time()
for i in xrange(iters):
r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
print 'Used the','cpu' if any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'
The output from this program is
Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.185714006424 seconds
Result is <CudaNdarray object at 0x3e9e970>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
Here we’ve shaved off about 50% of the run-time by simply not copying the resulting array back to the host. The object returned by each function call is now not a numpy array but a “CudaNdarray” which can be converted to a numpy ndarray by the normal numpy casting mechanism.
To really get maximum performance in this simple example, we need to use an Out instance to tell Theano not to copy the output it returns to us. Theano allocates memory for internal use like a working buffer, but by default it will never return a result that is allocated in the working buffer. This is normally what you want, but our example is so simple that it has the un-wanted side-effect of really slowing things down.
from theano import function, config, shared, sandbox, Out
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([],
Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
borrow=True))
t0 = time.time()
for i in xrange(iters):
r = f()
print 'Looping %d times took'%iters, time.time() - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
print 'Used the','cpu' if any( [isinstance(x.op,T.Elemwise) for x in f.maker.env.toposort()]) else 'gpu'
Running this version of the code takes just under 0.05 seconds, over 140x faster than the CPU implementation!
Using gpu device 0: GeForce GTX 285
Looping 100 times took 0.0497219562531 seconds
Result is <CudaNdarray object at 0x31eeaf0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761 1.62323296]
This version of the code using borrow=True is slightly less safe because if we had saved the r returned from one function call, we would have to take care and remember that its value might be over-written by a subsequent function call. Although borrow=True makes a dramatic difference in this example, be careful! The advantage of borrow=True is much weaker in larger graphs, and there is a lot of potential for making a mistake by failing to account for the resulting memory aliasing.
The performance characteristics will change as we continue to optimize our implementations, and vary from device to device, but to give a rough idea of what to expect right now: