For an introductory discussion of *Graphical Processing Units* (GPU)
and their use for intensive parallel computation purposes, see GPGPU.

One of Theano’s design goals is to specify computations at an abstract level, so that the internal function compiler has a lot of flexibility about how to carry out those computations. One of the ways we take advantage of this flexibility is in carrying out calculations on a graphics card.

There are two ways currently to use a gpu, one of which only supports NVIDIA cards (*CUDA backend*) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (*GpuArray Backend*).

If you have not done so already, you will need to install Nvidia’s
GPU-programming toolchain (CUDA) and configure Theano to use it.
We provide installation instructions for *Linux*,
*MacOS* and *Windows*.

To see if your GPU is being used, cut and paste the following program into a file and run it.

```
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
```

The program just computes the `exp()` of a bunch of random numbers.
Note that we use the `shared` function to
make sure that the input *x* is stored on the graphics device.

If I run this program (in check1.py) with `device=cpu`, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
same floating-point numbers as the CPU. As a benchmark, a loop that calls `numpy.exp(x.get_value())` takes about 46 seconds.

```
$ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.06635117531 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284]
Used the cpu
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.638810873032 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
```

Note that GPU operations in Theano require for now `floatX` to be *float32* (see also below).

The speedup is not greater in the preceding example because the function is
returning its result as a NumPy ndarray which has already been copied from the
device to the host for your convenience. This is what makes it so easy to swap in `device=gpu`, but
if you don’t mind less portability, you might gain a bigger speedup by changing
the graph to express a computation with a GPU-stored result. The `gpu_from_host`
op means “copy the input from the host to the GPU” and it is optimized away
after the `T.exp(x)` is replaced by a GPU version of `exp()`.

```
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
print 'Numpy result is', numpy.asarray(r)
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
```

The output from this program is

```
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
Using gpu device 0: GeForce GTX 580
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
Looping 1000 times took 0.34898686409 seconds
Result is <CudaNdarray object at 0x6a7a5f0>
Numpy result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
```

Here we’ve shaved off about 50% of the run-time by simply not copying
the resulting array back to the host. The object returned by each
function call is now not a NumPy array but a “CudaNdarray” which can
be converted to a NumPy ndarray by the normal NumPy casting mechanism
using something like `numpy.asarray()`.

For even more speed you can play with the `borrow` flag. See
*Borrowing when Constructing Function Objects*.

The performance characteristics will change as we continue to optimize our implementations, and vary from device to device, but to give a rough idea of what to expect right now:

- Only computations
with
*float32*data-type can be accelerated. Better support for*float64*is expected in upcoming hardware but*float64*computations are still relatively slow (Jan 2010). - Matrix multiplication, convolution, and large element-wise operations can be accelerated a lot (5-50x) when arguments are large enough to keep 30 processors busy.
- Indexing, dimension-shuffling and constant-time reshaping will be equally fast on GPU as on CPU.
- Summation over rows/columns of tensors can be a little slower on the GPU than on the CPU.
- Copying of large quantities of data to and from a device is relatively slow, and often cancels most of the advantage of one or two accelerated functions on that data. Getting GPU performance largely hinges on making data transfer to the device pay off.

- Consider
adding
`floatX=float32`to your`.theanorc`file if you plan to do a lot of GPU work. - Use the Theano flag
`allow_gc=False`. See*GPU Async capabilities* - Prefer
constructors like
`matrix`,`vector`and`scalar`to`dmatrix`,`dvector`and`dscalar`because the former will give you*float32*variables when`floatX=float32`. - Ensure
that your output variables have a
*float32*dtype and not*float64*. The more*float32*variables are in your graph, the more work the GPU can do for you. - Minimize
tranfers to the GPU device by using
`shared`*float32*variables to store frequently-accessed data (see`shared()`). When using the GPU,*float32*tensor`shared`variables are stored on the GPU by default to eliminate transfer time for GPU ops using those variables. - If you aren’t happy with the performance you see, try building your functions with
`mode='ProfileMode'`. This should print some timing information at program termination. Is time being used sensibly? If an op or Apply is taking more time than its share, then if you know something about GPU programming, have a look at how it’s implemented in theano.sandbox.cuda. Check the line similar to*Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*. This can tell you if not enough of your graph is on the GPU or if there is too much memory transfer. - Use nvcc options. nvcc supports those options to speed up some computations: -ftz=true to flush denormals values to zeros., –prec-div=false and –prec-sqrt=false options to speed up division and square root operation by being less precise. You can enable all of them with the nvcc.flags=–use_fast_math Theano flag or you can enable them individually as in this example: nvcc.flags=-ftz=true –prec-div=false.

Ever since Theano 0.6 we started to use the asynchronous capability of GPUs. This allows us to be faster but with the possibility that some errors may be raised later than when they should occur. This can cause difficulties when profiling Theano apply nodes. There is a NVIDIA driver feature to help with these issues. If you set the environment variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be automatically synchronized. This reduces performance but provides good profiling and appropriately placed error messages.

This feature interacts with Theano garbage collection of intermediate
results. To get the most of this feature, you need to disable the gc
as it inserts synchronization points in the graph. Set the Theano flag
`allow_gc=False` to get even faster speed! This will raise the memory
usage.

If you have not done so already, you will need to install libgpuarray as well as at least one computing toolkit. Instructions for doing so are provided at libgpuarray.

While all types of devices are supported if using OpenCL, for the remainder of this section, whatever compute device you are using will be referred to as GPU.

Warning

While it is fully our intention to support OpenCL, as of May 2014 this support is still in its infancy. A lot of very useful ops still do not support it because they were ported from the old backend with minimal change.

To see if your GPU is being used, cut and paste the following program into a file and run it.

```
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
```

The program just compute `exp()` of a bunch of random numbers. Note
that we use the `theano.shared()` function to make sure that the
input *x* is stored on the GPU.

```
$ THEANO_FLAGS=device=cpu python check1.py
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 2.6071999073 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
$ THEANO_FLAGS=device=cuda0 python check1.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 2.28562092781 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
```

By default functions that execute on the GPU still return a standard
numpy ndarray. A transfer operation is inserted just before the
results are returned to ensure a consistent interface with CPU code.
This allows changing the deivce some code runs on by only replacing
the value of the `device` flag without touching the code.

If you don’t mind a loss of flexibility, you can ask theano to return the GPU object directly. The following code is modifed to do just that.

```
from theano import function, config, shared, tensor, sandbox
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], sandbox.gpuarray.basic_ops.gpu_from_host(tensor.exp(x)))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', numpy.asarray(r)
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
```

Here the `theano.sandbox.gpuarray.basic.gpu_from_host()` call
means “copy input to the GPU”. However during the optimization phase,
since the result will already be on th gpu, it will be removed. It is
used here to tell theano that we want the result on the GPU.

The output is

```
$ THEANO_FLAGS=device=cuda0 python check2.py
Using device cuda0: GeForce GTX 275
[GpuElemwise{exp,no_inplace}(<GpuArray<float64>>)]
Looping 1000 times took 0.455810785294 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the gpu
```

While the time per call appears to be much lower than the two previous invocations (and should indeed be lower, since we avoid a transfer) the massive speedup we obtained is in part due to asynchronous nature of execution on GPUs, meaning that the work isn’t completed yet, just ‘launched’. We’ll talk about that later.

The object returned is a GpuArray from pygpu. It mostly acts as a
numpy ndarray with some exceptions due to its data being on the GPU.
You can copy it to the host and convert it to a regular ndarray by
using usual numpy casting such as `numpy.asarray()`.

For even more speed, you can play with the `borrow` flag. See
*Borrowing when Constructing Function Objects*.

The performance characteristics will of course vary from device to device, and also as we refine our implementation.

This backend supports all regular theano data types (float32, float64, int, ...) however GPU support varies and some units can’t deal with double (float64) or small (less than 32 bits like int16) data types. You will get an error at compile time or runtime if this is the case.

Complex support is untested and most likely completely broken.

In general, large operations like matrix multiplication, or element-wise operations with large inputs, will be significatly faster.

By default, all operations on the GPU are run asynchronously. This means that they are only scheduled to run and the function returns. This is made somewhat transparently by the underlying libgpuarray.

A forced synchronization point is introduced when doing memory transfers between device and host. Another is introduced when releasing active memory buffers on the GPU (active buffers are buffers that are still in use by a kernel).

It is possible to force synchronization for a particular GpuArray by
calling its `sync()` method. This is useful to get accurate timings
when doing benchmarks.

The forced synchronization points interact with the garbage collection
of the intermediate results. To get the fastest speed possible, you
should disable the garbage collector by using the theano flag
`allow_gc=False`. Be aware that this will increase memory usage
sometimes significantly.

Leaving aside Theano which is a meta-programmer, there are:

**CUDA**: GPU programming API by NVIDIA based on extension to C (CUDA C)- Vendor-specific
- Numeric libraries (BLAS, RNG, FFT) are maturing.

**OpenCL**: multi-vendor version of CUDA- More general, standardized.
- Fewer libraries, lesser spread.

**PyCUDA**: Python bindings to CUDA driver interface allow to access Nvidia’s CUDA parallel computation API from PythonConvenience:

Makes it easy to do GPU meta-programming from within Python.

Abstractions to compile low-level CUDA code from Python (

`pycuda.driver.SourceModule`).GPU memory buffer (

`pycuda.gpuarray.GPUArray`).Helpful documentation.

Completeness: Binding to all of CUDA’s driver API.

Automatic error checking: All CUDA errors are automatically translated into Python exceptions.

Speed: PyCUDA’s base layer is written in C++.

Good memory management of GPU objects:

Object cleanup tied to lifetime of objects (RAII, ‘Resource Acquisition Is Initialization’).

Makes it much easier to write correct, leak- and crash-free code.

PyCUDA knows about dependencies (e.g. it won’t detach from a context before all memory allocated in it is also freed).

(This is adapted from PyCUDA’s documentation and Andreas Kloeckner’s website on PyCUDA.)

**PyOpenCL**: PyCUDA for OpenCL

If you already enjoy a good proficiency with the C programming language, you may easily leverage your knowledge by learning, first, to program a GPU with the CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA API with a Python wrapper.

The following resources will assist you in this learning process:

**CUDA API and CUDA C: Introductory****CUDA API and CUDA C: Advanced**- MIT IAP2009 CUDA (full coverage: lectures, leading Kirk-Hwu textbook, examples, additional resources)
- Course U. of Illinois (full lectures, Kirk-Hwu textbook)
- NVIDIA’s knowledge base (extensive coverage, levels from introductory to advanced)
- practical issues (on the relationship between grids, blocks and threads; see also linked and related issues on same page)
- CUDA optimisation

**PyCUDA: Introductory****PYCUDA: Advanced**

The following examples give a foretaste of programming a GPU with PyCUDA. Once you feel competent enough, you may try yourself on the corresponding exercises.

**Example: PyCUDA**

```
# (from PyCUDA's documentation)
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
assert numpy.allclose(dest, a*b)
print dest
```

Run the preceding example.

Modify and execute to work for a matrix of shape (20, 10).

**Example: Theano + PyCUDA**

```
import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda
class PyCUDADoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()])
def make_thunk(self, node, storage_map, _, _2):
mod = SourceModule("""
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
pycuda_fct = mod.get_function("my_fct")
inputs = [ storage_map[v] for v in node.inputs]
outputs = [ storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape!=inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
block=(512,1,1), grid=grid)
return thunk
```

Use this code to test it:

```
>>> x = theano.tensor.fmatrix()
>>> f = theano.function([x], PyCUDADoubleOp()(x))
>>> xv=numpy.ones((4,5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2)
>>> print numpy.asarray(f(xv))
```

Run the preceding example.

Modify and execute to multiply two matrices: *x* * *y*.

Modify and execute to return two outputs: *x + y* and *x - y*.

(Notice that Theano’s current *elemwise fusion* optimization is
only applicable to computations involving a single output. Hence, to gain
efficiency over the basic solution that is asked here, the two operations would
have to be jointly optimized explicitly in the code.)

Modify and execute to support *stride* (i.e. to avoid constraining the input to be *C-contiguous*).