PyCUDA/CUDAMat/Gnumpy compatibility


Currently, PyCUDA and Theano have different objects to store GPU data. The two implementations do not support the same set of features. Theano’s implementation is called CudaNdarray and supports strides. It also only supports the float32 dtype. PyCUDA’s implementation is called GPUArray and doesn’t support strides. However, it can deal with all NumPy and CUDA dtypes.

We are currently working on having the same base object for both that will also mimic Numpy. Until this is ready, here is some information on how to use both objects in the same script.


You can use the theano.misc.pycuda_utils module to convert GPUArray to and from CudaNdarray. The functions to_cudandarray(x, copyif=False) and to_gpuarray(x) return a new object that occupies the same memory space as the original. Otherwise it raises a ValueError. Because GPUArrays don’t support strides, if the CudaNdarray is strided, we could copy it to have a non-strided copy. The resulting GPUArray won’t share the same memory region. If you want this behavior, set copyif=True in to_gpuarray.

Compiling with PyCUDA

You can use PyCUDA to compile CUDA functions that work directly on CudaNdarrays. Here is an example from the file theano/misc/tests/

import sys
import numpy
import theano
import theano.sandbox.cuda as cuda_ndarray
import theano.misc.pycuda_init
import pycuda
import pycuda.driver as drv
import pycuda.gpuarray

def test_pycuda_theano():
    """Simple example with pycuda function and Theano CudaNdarray object."""
    from pycuda.compiler import SourceModule
    mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];

    multiply_them = mod.get_function("multiply_them")

    a = numpy.random.randn(100).astype(numpy.float32)
    b = numpy.random.randn(100).astype(numpy.float32)

    # Test with Theano object
    ga = cuda_ndarray.CudaNdarray(a)
    gb = cuda_ndarray.CudaNdarray(b)
    dest = cuda_ndarray.CudaNdarray.zeros(a.shape)
    multiply_them(dest, ga, gb,
                  block=(400, 1, 1), grid=(1, 1))
    assert (numpy.asarray(dest) == a * b).all()

Theano Op using a PyCUDA function

You can use a GPU function compiled with PyCUDA in a Theano op:

import numpy, theano
import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda

class PyCUDADoubleOp(theano.Op):
    __props__ = ()
    def make_node(self, inp):
        inp = cuda.basic_ops.gpu_contiguous(
        assert inp.dtype == "float32"
        return theano.Apply(self, [inp], [inp.type()])
    def make_thunk(self, node, storage_map, _, _2, impl=None):
        mod = SourceModule("""
    __global__ void my_fct(float * i0, float * o0, int size) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
        o0[i] = i0[i] * 2;
        pycuda_fct = mod.get_function("my_fct")
        inputs = [ storage_map[v] for v in node.inputs]
        outputs = [ storage_map[v] for v in node.outputs]
        def thunk():
            z = outputs[0]
            if z[0] is None or z[0].shape!=inputs[0][0].shape:
                z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
            grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
            pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
                       block=(512, 1, 1), grid=grid)
        thunk.lazy = False
        return thunk


There are functions for conversion between CUDAMat objects and Theano’s CudaNdArray objects. They obey the same principles as Theano’s PyCUDA functions and can be found in

WARNING: There is a peculiar problem associated with stride/shape with those converters. In order to work, the test needs a transpose and reshape...


There are conversion functions between Gnumpy garray objects and Theano CudaNdArray objects. They are also similar to Theano’s PyCUDA functions and can be found in