pygpu package

pygpu.gpuarray module

class pygpu.gpuarray.GpuArray

Device array

To create instances of this class use zeros(), empty() or array(). It cannot be instanciated directly.

You can also subclass this class and make the module create your instances by passing the cls argument to any method that return a new GpuArray. This way of creating the class will NOT call your __init__() method.

You can also implement your own __init__() method, but you must take care to ensure you properly initialized the GpuArray C fields before using it or you will most likely crash the interpreter.

T
astype(dtype, order='A', copy=True)

Cast the elements of this array to a new type.

This function returns a new array will all elements cast to the supplied dtype, but otherwise unchanged.

If copy is False and the type and order match self is returned.

Parameters:
  • dtype (str or numpy.dtype or int) – type of the elements of the result
  • order ({'A', 'C', 'F'}) – memory layout of the result
  • copy (bool) – Always return a copy?
base
base_data

Return a pointer to the backing OpenCL object.

context
copy(order='C')

Return a copy if this array.

Parameters:order ({'C', 'A', 'F'}) – memory layout of the copy
data

Return a pointer to the raw OpenCL buffer object.

This will fail for arrays that have an offset.

dtype

The dtype of the element

flags

Return a flags object describing the properties of this array.

This is mostly numpy-compatible with some exceptions:
  • Flags are always constant (numpy allows modification of certain flags in certain cicumstances).
  • OWNDATA is always True, since the data is refcounted in libgpuarray.
  • UPDATEIFCOPY is not supported, therefore always False.
get_ipc_handle()
gpudata

Return a pointer to the raw backend object.

itemsize

The size of the base element.

ndim

The number of dimensions in this object

offset

Return the offset into the gpudata pointer for this array.

read(dst)

Reads from this GpuArray into host’s Numpy array.

This method is as fast as or even faster than :ref:__array__ method and thus :ref:numpy.asarray. This is because it skips allocation of a new buffer in host’s memory to contain device’s GpuArray. It uses an existing Numpy ndarray as a buffer to get the GpuArray. It is required though that the GpuArray and the Numpy array to be compatible in byte size, contiguity and data type. It is also needed for dst to be writeable and properly aligned in host’s memory and for self to be contiguous. It is allowed for this GpuArray and dst to have different shapes.

Parameters:dst (numpy.ndarray) – destination array in host
Raises:ValueError – If this GpuArray is not compatible with src or if dst is not well behaved.
reshape(shape, order='C')

Returns a new array with the given shape and order.

The new shape must have the same size (total number of elements) as the current one.

shape

shape of this ndarray (tuple)

size

The number of elements in this object.

strides

data pointer strides (in bytes)

sync()

Wait for all pending operations on this array.

This is done automatically when reading or writing from it, but can be useful as a separate operation for timings.

take1(idx)
transfer(new_ctx)
transpose(*params)
typecode

The gpuarray typecode for the data type of the array

view(cls=GpuArray)

Return a view of this array.

The returned array shares device data with this one and both will reflect changes made to the other.

Parameters:cls (type) – class of the view (must inherit from GpuArray)
write(src)

Writes host’s Numpy array to device’s GpuArray.

This method is as fast as or even faster than :ref:asarray, because it skips possible allocation of a buffer in device’s memory. It uses this already allocated GpuArray buffer to contain src array from host’s memory. It is required though that the GpuArray and the Numpy array are compatible in byte size and data type. It is also needed for the GpuArray to be well behaved and contiguous. If src is not aligned or compatible in contiguity it will be copied to a new Numpy array in order to be. It is allowed for this GpuArray and src to have different shapes.

Parameters:src (numpy.ndarray) – source array in host
Raises:ValueError – If this GpuArray is not compatible with src or if it is not well behaved or contiguous.
exception pygpu.gpuarray.GpuArrayException

Exception used for most errors related to libgpuarray.

class pygpu.gpuarray.GpuContext

Class that holds all the information pertaining to a context.

The currently implemented modules (for the kind parameter) are “cuda” and “opencl”. Which are available depends on the build options for libgpuarray.

The flag values are defined in the gpuarray/buffer.h header and are in the “Context flags” group. If you want to use more than one value you must bitwise OR them together.

If you want an alternative interface check init().

Parameters:
  • kind (str) – module name for the context
  • devno (int) – device number
  • flags (int) – context flags
bin_id

Binary compatibility id

devname

Device name for this context

free_gmem

Size of free global memory on the device

kind
largest_memblock

Size of the largest memory block you can allocate

lmemsize

Size of the local (shared) memory, in bytes, for this context

maxgsize0

Maximum global size for dimension 0

maxgsize1

Maximum global size for dimension 1

maxgsize2

Maximum global size for dimension 2

maxlsize0

Maximum local size for dimension 0

maxlsize1

Maximum local size for dimension 1

maxlsize2

Maximum local size for dimension 2

numprocs

Number of compute units for this context

ptr

Raw pointer value for the context object

total_gmem

Total size of global memory on the device

unique_id

Device PCI Bus ID for this context

class pygpu.gpuarray.GpuKernel(source, name, types, context=None, have_double=False, have_small=False, have_complex=False, have_half=False, cuda=False, opencl=False)

Compile a kernel on the device

The kernel function is retrieved using the provided name which must match what you named your kernel in source. You can safely reuse the same name multiple times.

The have_* parameter are there to tell libgpuarray that we need the particular type or feature to work for this kernel. If the request can’t be satified a UnsupportedException will be raised in the constructor.

Once you have the kernel object you can simply call it like so:

k = GpuKernel(...)
k(param1, param2, n=n)

where n is the minimum number of threads to run. libgpuarray will try to stay close to this number but may run a few more threads to match the hardware preferred multiple and stay efficient. You should watch out for this in your code and make sure to test against the size of your data.

If you want more control over thread allocation you can use the gs and ls parameters like so:

k = GpuKernel(...)
k(param1, param2, gs=gs, ls=ls)

If you choose to use this interface, make sure to stay within the limits of k.maxlsize or the call will fail.

Parameters:
  • source (str) – complete kernel source code
  • name (str) – function name of the kernel
  • types (list or tuple) – list of argument types
  • context (GpuContext) – device on which the kernel is compiled
  • have_double (bool) – ensure working doubles?
  • have_small (bool) – ensure types smaller than float will work?
  • have_complex (bool) – ensure complex types will work?
  • have_half (bool) – ensure half-floats will work?
  • cuda (bool) – kernel is cuda code?
  • opencl (bool) – kernel is opencl code?

Notes

With the cuda backend, unless you use the cluda include, you must either pass the mangled name of your kernel or declare the function ‘extern “C”’, because cuda uses a C++ compiler unconditionally.

Warning

If you do not set the have_ flags properly, you will either get a device-specific error (the good case) or silent completly bogus data (the bad case).

context
maxlsize

Maximum local size for this kernel

numargs

Number of arguments to kernel

preflsize

Preferred multiple for local size for this kernel

exception pygpu.gpuarray.UnsupportedException
pygpu.gpuarray.abi_version()
pygpu.gpuarray.api_version()
pygpu.gpuarray.array(obj, dtype='float64', copy=True, order=None, ndmin=0, context=None, cls=None)

Create a GpuArray from existing data

This function creates a new GpuArray from the data provided in obj except if obj is already a GpuArray and all the parameters match its properties and copy is False.

The properties of the resulting array depend on the input data except if overriden by other parameters.

This function is similar to numpy.array() except that it returns GpuArrays.

Parameters:
  • obj (array-like) – data to initialize the result
  • dtype (string or numpy.dtype or int) – data type of the result elements
  • copy (bool) – return a copy?
  • order (str) – memory layout of the result
  • ndmin (int) – minimum number of result dimensions
  • context (GpuContext) – allocation context
  • cls (type) – result class (must inherit from GpuArray)
pygpu.gpuarray.asarray(a, dtype=None, order='A', context=None)

Returns a GpuArray from the data in a

If a is already a GpuArray and all other parameters match, then the object itself returned. If a is an instance of a subclass of GpuArray then a view of the base class will be returned. Otherwise a new object is create and the data is copied into it.

context is optional if a is a GpuArray (but must match exactly the context of a if specified) and is mandatory otherwise.

Parameters:
  • a (array-like) – data
  • dtype (str, numpy.dtype or int) – type of the elements
  • order ({'A', 'C', 'F'}) – layout of the data in memory, one of ‘A’ny, ‘C’ or ‘F’ortran
  • context (GpuContext) – context in which to do the allocation
pygpu.gpuarray.ascontiguousarray(a, dtype=None, context=None)

Returns a contiguous array in device memory (C order).

context is optional if a is a GpuArray (but must match exactly the context of a if specified) and is mandatory otherwise.

Parameters:
  • a (array-like) – input
  • dtype (str, numpy.dtype or int) – type of the return array
  • context (GpuContext) – context to use for a new array
pygpu.gpuarray.asfortranarray(a, dtype=None, context=None)

Returns a contiguous array in device memory (Fortran order)

context is optional if a is a GpuArray (but must match exactly the context of a if specified) and is mandatory otherwise.

Parameters:
  • a (array-like) – input
  • dtype (str, numpy.dtype or int) – type of the elements
  • context (GpuContext) – context in which to do the allocation
pygpu.gpuarray.cl_wrap_ctx(ptr)

Wrap an existing OpenCL context (the cl_context struct) into a GpuContext class.

pygpu.gpuarray.count_devices(kind, platform)

Returns number of devices in host’s platform compatible with kind.

pygpu.gpuarray.count_platforms(kind)

Return number of host’s platforms compatible with kind.

pygpu.gpuarray.cuda_wrap_ctx(ptr)

Wrap an existing CUDA driver context (CUcontext) into a GpuContext class.

If own is true, libgpuarray is now reponsible for the context and it will be destroyed once there are no references to it. Otherwise, the context will not be destroyed and it is the calling code’s reponsability.

pygpu.gpuarray.dtype_to_ctype(dtype)

Return the C name for a type.

Parameters:dtype (numpy.dtype) – type to get the name for
pygpu.gpuarray.dtype_to_typecode(dtype)

Get the internal typecode for a type.

Parameters:dtype (numpy.dtype) – type to get the code for
pygpu.gpuarray.empty(shape, dtype='float64', order='C', context=None, cls=None)

Returns an empty (uninitialized) array of the requested shape, type and order.

Parameters:
  • shape (iterable of ints) – number of elements in each dimension
  • dtype (str, numpy.dtype or int) – type of the elements
  • order ({'A', 'C', 'F'}) – layout of the data in memory, one of ‘A’ny, ‘C’ or ‘F’ortran
  • context (GpuContext) – context in which to do the allocation
  • cls (type) – class of the returned array (must inherit from GpuArray)
class pygpu.gpuarray.flags
aligned
behaved
c_contiguous
carray
contiguous
f_contiguous
farray
fnc
forc
fortran
num
owndata
updateifcopy
writeable
pygpu.gpuarray.from_gpudata(data, offset, dtype, shape, context=None, strides=None, writable=True, base=None, cls=None)

Build a GpuArray from pre-allocated gpudata

Parameters:
  • data (int) – pointer to a gpudata structure
  • offset (int) – offset to the data location inside the gpudata
  • dtype (numpy.dtype) – data type of the gpudata elements
  • shape (iterable of ints) – shape to use for the result
  • context (GpuContext) – context of the gpudata
  • strides (iterable of ints) – strides for the results (C contiguous if not specified)
  • writable (bool) – is the data writable?
  • base (object) – base object that keeps gpudata alive
  • cls (type) – view type of the result

Notes

This function might be deprecated in a later relase since the only way to create gpudata pointers is through libgpuarray functions that aren’t exposed at the python level. It can be used with the value of the gpudata attribute of an existing GpuArray.

Warning

This function is intended for advanced use and will crash the interpreter if used improperly.

pygpu.gpuarray.get_default_context()

Return the currently defined default context (or None).

pygpu.gpuarray.init()
init(dev, sched=’default’, single_stream=False, kernel_cache_path=None,
max_cache_size=sys.maxsize, initial_cache_size=0)

Creates a context from a device specifier.

Device specifiers are composed of the type string and the device id like so:

"cuda0"
"opencl0:1"

For cuda the device id is the numeric identifier. You can see what devices are available by running nvidia-smi on the machine. Be aware that the ordering in nvidia-smi might not correspond to the ordering in this library. This is due to how cuda enumerates devices. If you don’t specify a number (e.g. ‘cuda’) the first available device will be selected according to the backend order.

For opencl the device id is the platform number, a colon (:) and the device number. There are no widespread and/or easy way to list available platforms and devices. You can experiement with the values, unavaiable ones will just raise an error, and there are no gaps in the valid numbers.

Parameters:
  • dev (str) – device specifier
  • sched ({'default', 'single', 'multi'}) – optimize scheduling for which type of operation
  • disable_alloc_cache (bool) – disable allocation cache (if any)
  • single_stream (bool) – enable single stream mode
pygpu.gpuarray.may_share_memory(a, b)

Returns True if a and b may share memory, False otherwise.

pygpu.gpuarray.open_ipc_handle(c, hpy, l)

Open an IPC handle to get a new GpuArray from it.

Parameters:
  • c (GpuContext) – context
  • hpy (bytes) – binary handle data received
  • l (int) – size of the referred memory block
pygpu.gpuarray.register_dtype(dtype, cname)

Make a new type known to the cluda machinery.

This function return the associted internal typecode for the new type.

Parameters:
  • dtype (numpy.dtype) – new type
  • cname (str) – C name for the type declarations
pygpu.gpuarray.set_default_context(ctx)

Set the default context for the module.

The provided context will be used as a default value for all the other functions in this module which take a context as parameter. Call with None to clear the default value.

If you don’t call this function the context of all other functions is a mandatory argument.

This can be helpful to reduce clutter when working with only one context. It is strongly discouraged to use this function when working with multiple contexts at once.

Parameters:ctx (GpuContext) – default context
pygpu.gpuarray.zeros(shape, dtype='float64', order='C', context=None, cls=None)

Returns an array of zero-initialized values of the requested shape, type and order.

Parameters:
  • shape (iterable of ints) – number of elements in each dimension
  • dtype (str, numpy.dtype or int) – type of the elements
  • order ({'A', 'C', 'F'}) – layout of the data in memory, one of ‘A’ny, ‘C’ or ‘F’ortran
  • context (GpuContext) – context in which to do the allocation
  • cls (type) – class of the returned array (must inherit from GpuArray)

pygpu.elemwise module

class pygpu.elemwise.GpuElemwise
class pygpu.elemwise.arg
name
read
scalar
type
write
pygpu.elemwise.as_argument(o, name, read=False, write=False)
pygpu.elemwise.elemwise1(a, op, oper=None, op_tmpl='res = %(op)sa', out=None, convert_f16=True)
pygpu.elemwise.elemwise2(a, op, b, ary, odtype=None, oper=None, op_tmpl='res = (%(out_t)s)a %(op)s (%(out_t)s)b', broadcast=False, convert_f16=True)
pygpu.elemwise.ielemwise2(a, op, b, oper=None, op_tmpl='a = a %(op)s b', broadcast=False, convert_f16=True)
pygpu.elemwise.compare(a, op, b, broadcast=False, convert_f16=True)

pygpu.operations module

pygpu.operations.array_split(ary, indices_or_sections, axis=0)
pygpu.operations.atleast_1d(*arys)
pygpu.operations.atleast_2d(*arys)
pygpu.operations.atleast_3d(*arys)
pygpu.operations.concatenate(arys, axis=0, context=None)
pygpu.operations.dsplit(ary, indices_or_sections)
pygpu.operations.dstack(tup, context=None)
pygpu.operations.hsplit(ary, indices_or_sections)
pygpu.operations.hstack(tup, context=None)
pygpu.operations.split(ary, indices_or_sections, axis=0)
pygpu.operations.vsplit(ary, indices_or_sections)
pygpu.operations.vstack(tup, context=None)

pygpu.reduction module

class pygpu.reduction.ReductionKernel(context, dtype_out, neutral, reduce_expr, redux, map_expr=None, arguments=None, preamble='', init_nd=None)
pygpu.reduction.massage_op(operation)
pygpu.reduction.parse_c_args(arguments)
pygpu.reduction.reduce1(ary, op, neutral, out_type, axis=None, out=None, oper=None)

pygpu.blas module

pygpu.blas.dot(X, Y, Z=None, overwrite_z=False)
pygpu.blas.gemm(alpha, A, B, beta, C=None, trans_a=False, trans_b=False, overwrite_c=False)
pygpu.blas.gemmBatch_3d(alpha, A, B, beta, C=None, trans_a=False, trans_b=False, overwrite_c=False)
pygpu.blas.gemv(alpha, A, X, beta=0.0, Y=None, trans_a=False, overwrite_y=False)
pygpu.blas.ger(alpha, X, Y, A=None, overwrite_a=False)

pygpu.collectives module

class pygpu.collectives.GpuComm(cid, ndev, rank)

Represents a communicator which participates in a multi-gpu clique.

It is used to invoke collective operations to gpus inside its clique.

Parameters:
  • cid (GpuCommCliqueId) – Unique id shared among participating communicators.
  • ndev (int) – Number of communicators inside the clique.
  • rank (int) – User-defined rank of this communicator inside the clique. It influences order of collective operations.
all_gather(self, src, dest=None, nd_up=1)

AllGather collective operation for ranks in a communicator world.

Parameters:
  • src (GpuArray) – Array to be gathered.
  • dest (GpuArray) – Array to receive all gathered arrays from ranks in GpuComm.
  • nd_up (int) – Used when creating result array. Indicates how many extra dimensions user wants result to have. Default is 1, which means that the result will store each rank’s gathered array in one extra new dimension.

Notes

  • Providing nd_up == 0 means that gathered arrays will be appended to the dimension with the largest stride.
all_reduce(self, src, op, dest=None)

AllReduce collective operation for ranks in a communicator world.

Parameters:
  • src (GpuArray) – Array to be reduced.
  • op (str) – Key indicating operation type.
  • dest (GpuArray) – Array to collect reduce operation result.

Notes

  • Not providing dest argument for a caller will result in creating a new compatible GpuArray and returning result in it.
broadcast(self, array, root=-1)

Broadcast collective operation for ranks in a communicator world.

Parameters:
  • array (GpuArray) – Array to be reduced.
  • root (int) – Rank in GpuComm which broadcasts its array.

Notes

  • root is necessary when invoking from a non-root rank. Root caller does not need to provide root argument.
count

Total number of communicators inside the clique

rank

User-defined rank of this communicator inside the clique

reduce(self, src, op, dest=None, root=-1)

Reduce collective operation for ranks in a communicator world.

Parameters:
  • src (GpuArray) – Array to be reduced.
  • op (str) – Key indicating operation type.
  • dest (GpuArray) – Array to collect reduce operation result.
  • root (int) – Rank in GpuComm which will collect result.

Notes

  • root is necessary when invoking from a non-root rank. Root caller does not need to provide root argument.
  • Not providing dest argument for a root caller will result in creating a new compatible GpuArray and returning result in it.
reduce_scatter(self, src, op, dest=None)

ReduceScatter collective operation for ranks in a communicator world.

Parameters:
  • src (GpuArray) – Array to be reduced.
  • op (str) – Key indicating operation type.
  • dest (GpuArray) – Array to collect reduce operation scattered result.

Notes

  • Not providing dest argument for a caller will result in creating a new compatible GpuArray and returning result in it.
class pygpu.collectives.GpuCommCliqueId(context=None, comm_id=None)

Represents a unique id shared among GpuComm communicators which participate in a multi-gpu clique.

Parameters:
  • context (GpuContext) – Reference to which gpu this GpuCommCliqueId object belongs.
  • comm_id (bytes) – Existing unique id to be passed in this object.
comm_id

Unique clique id to be used by each GpuComm in a group of devices

context

pygpu.dtypes module

Type mapping helpers.

pygpu.dtypes.dtype_to_ctype(dtype)

Return the C type that corresponds to dtype.

Parameters:dtype (data type) – a numpy dtype
pygpu.dtypes.get_common_dtype(obj1, obj2, allow_double)

Returns the proper output type for a numpy operation involving the two provided objects. This may not be suitable for certain obscure numpy operations.

If allow_double is False, a return type of float64 will be forced to float32 and complex128 will be forced to complex64.

pygpu.dtypes.get_np_obj(obj)

Returns a numpy object of the same dtype and comportement as the source suitable for output dtype determination.

This is used since the casting rules of numpy are rather obscure and the best way to imitate them is to try an operation ans see what it does.

pygpu.dtypes.parse_c_arg_backend(c_arg, scalar_arg_class, vec_arg_class)
pygpu.dtypes.register_dtype(dtype, c_names)

Associate a numpy dtype with its C equivalents.

Will register dtype for use with the gpuarray module. If the c_names argument is a list then the first element of that list is taken as the primary association and will be used for generated C code. The other types will be mapped to the provided dtype when going in the other direction.

Parameters:
  • dtype (numpy.dtype or string) – type to associate
  • c_names (str or list) – list of C type names
pygpu.dtypes.upcast(*args)

pygpu.tools module

class pygpu.tools.Argument(dtype, name)
ctype()
class pygpu.tools.ArrayArg(dtype, name)
decltype()
expr()
isarray()
spec()
class pygpu.tools.ScalarArg(dtype, name)
decltype()
expr()
isarray()
spec()
pygpu.tools.as_argument(obj, name)
pygpu.tools.check_args(args, collapse=False, broadcast=False)

Returns the properties of arguments and checks if they all match (are all the same shape)

If collapse is True dimension collapsing will be performed. If collapse is False dimension collapsing will not be performed.

If broadcast is True array broadcasting will be performed which means that dimensions which are of size 1 in some arrays but not others will be repeated to match the size of the other arrays. If broadcast is False no broadcasting takes place.

pygpu.tools.lru_cache(maxsize=20)
pygpu.tools.prod(iterable)

Module contents

pygpu.get_include()
pygpu.test()