nnet – Ops for neural networks

theano.tensor.nnet.nnet.sigmoid(x)[source]
Returns the standard sigmoid nonlinearity applied to x
Parameters:

x - symbolic Tensor (or compatible)

Return type:

same as x

Returns:

element-wise sigmoid: sigmoid(x) = \frac{1}{1 + \exp(-x)}.

note:

see ultra_fast_sigmoid() or hard_sigmoid() for faster versions. Speed comparison for 100M float64 elements on a Core2 Duo @ 3.16 GHz:

  • hard_sigmoid: 1.0s
  • ultra_fast_sigmoid: 1.3s
  • sigmoid (with amdlibm): 2.3s
  • sigmoid (without amdlibm): 3.7s

Precision: sigmoid(with or without amdlibm) > ultra_fast_sigmoid > hard_sigmoid.

../../../_images/sigmoid_prec.png

Example:

import theano.tensor as T

x, y, b = T.dvectors('x', 'y', 'b')
W = T.dmatrix('W')
y = T.nnet.sigmoid(T.dot(W, x) + b)

Note

The underlying code will return an exact 0 or 1 if an element of x is too small or too big.

theano.tensor.nnet.nnet.ultra_fast_sigmoid(x)[source]
Returns the approximated standard sigmoid() nonlinearity applied to x.
Parameters:x - symbolic Tensor (or compatible)
Return type:same as x
Returns:approximated element-wise sigmoid: sigmoid(x) = \frac{1}{1 + \exp(-x)}.
note:To automatically change all sigmoid() ops to this version, use the Theano optimization local_ultra_fast_sigmoid. This can be done with the Theano flag optimizer_including=local_ultra_fast_sigmoid. This optimization is done late, so it should not affect stabilization optimization.

Note

The underlying code will return 0.00247262315663 as the minimum value and 0.997527376843 as the maximum value. So it never returns 0 or 1.

Note

Using directly the ultra_fast_sigmoid in the graph will disable stabilization optimization associated with it. But using the optimization to insert them won’t disable the stability optimization.

theano.tensor.nnet.nnet.hard_sigmoid(x)[source]
Returns the approximated standard sigmoid() nonlinearity applied to x.
Parameters:x - symbolic Tensor (or compatible)
Return type:same as x
Returns:approximated element-wise sigmoid: sigmoid(x) = \frac{1}{1 + \exp(-x)}.
note:To automatically change all sigmoid() ops to this version, use the Theano optimization local_hard_sigmoid. This can be done with the Theano flag optimizer_including=local_hard_sigmoid. This optimization is done late, so it should not affect stabilization optimization.

Note

The underlying code will return an exact 0 or 1 if an element of x is too small or too big.

Note

Using directly the ultra_fast_sigmoid in the graph will disable stabilization optimization associated with it. But using the optimization to insert them won’t disable the stability optimization.

theano.tensor.nnet.nnet.softplus(x)[source]
Returns the softplus nonlinearity applied to x
Parameter:x - symbolic Tensor (or compatible)
Return type:same as x
Returns:elementwise softplus: softplus(x) = \log_e{\left(1 + \exp(x)\right)}.

Note

The underlying code will return an exact 0 if an element of x is too small.

x,y,b = T.dvectors('x','y','b')
W = T.dmatrix('W')
y = T.nnet.softplus(T.dot(W,x) + b)
theano.tensor.nnet.nnet.softsign(x)[source]

Return the elemwise softsign activation function \\varphi(\\mathbf{x}) = \\frac{1}{1+|x|}

theano.tensor.nnet.nnet.softmax(x)[source]
Returns the softmax function of x:
Parameter:x symbolic 2D Tensor (or compatible).
Return type:same as x
Returns:a symbolic 2D tensor whose ijth element is softmax_{ij}(x) = \frac{\exp{x_{ij}}}{\sum_k\exp(x_{ik})}.

The softmax function will, when applied to a matrix, compute the softmax values row-wise.

note:

this supports hessian free as well. The code of the softmax op is more numerically stable because it uses this code:

e_x = exp(x - x.max(axis=1, keepdims=True))
out = e_x / e_x.sum(axis=1, keepdims=True)

Example of use:

x,y,b = T.dvectors('x','y','b')
W = T.dmatrix('W')
y = T.nnet.softmax(T.dot(W,x) + b)
theano.tensor.nnet.relu(x, alpha=0)[source]

Compute the element-wise rectified linear activation function.

New in version 0.7.1.

Parameters:
  • x (symbolic tensor) – Tensor to compute the activation function for.
  • alpha (scalar or tensor, optional) – Slope for negative input, usually between 0 and 1. The default value of 0 will lead to the standard rectifier, 1 will lead to a linear activation function, and any value in between will give a leaky rectifier. A shared variable (broadcastable against x) will result in a parameterized rectifier with learnable slope(s).
Returns:

Element-wise rectifier applied to x.

Return type:

symbolic tensor

Notes

This is numerically equivalent to T.switch(x > 0, x, alpha * x) (or T.maximum(x, alpha * x) for alpha < 1), but uses a faster formulation or an optimized Op, so we encourage to use this function.

theano.tensor.nnet.elu(x, alpha=1)[source]

Compute the element-wise exponential linear activation function [2].

New in version 0.8.0.

Parameters:
  • x (symbolic tensor) – Tensor to compute the activation function for.
  • alpha (scalar) –
Returns:

Element-wise exponential linear activation function applied to x.

Return type:

symbolic tensor

References

[2]Djork-Arne Clevert, Thomas Unterthiner, Sepp Hochreiter “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)” <http://arxiv.org/abs/1511.07289>`.
theano.tensor.nnet.selu(x)[source]

Compute the element-wise Scaled Exponential Linear unit [3].

New in version 0.9.0.

Parameters:x (symbolic tensor) – Tensor to compute the activation function for.
Returns:Element-wise scaled exponential linear activation function applied to x.
Return type:symbolic tensor

References

[3]Klambauer G, Unterthiner T, Mayr A, Hochreiter S. “Self-Normalizing Neural Networks” <https://arxiv.org/abs/1706.02515>
theano.tensor.nnet.nnet.binary_crossentropy(output, target)[source]
Computes the binary cross-entropy between a target and an output:
Parameters:
  • target - symbolic Tensor (or compatible)
  • output - symbolic Tensor (or compatible)
Return type:

same as target

Returns:

a symbolic tensor, where the following is applied elementwise crossentropy(t,o) = -(t\cdot log(o) + (1 - t) \cdot log(1 - o)).

The following block implements a simple auto-associator with a sigmoid nonlinearity and a reconstruction error which corresponds to the binary cross-entropy (note that this assumes that x will contain values between 0 and 1):

x, y, b, c = T.dvectors('x', 'y', 'b', 'c')
W = T.dmatrix('W')
V = T.dmatrix('V')
h = T.nnet.sigmoid(T.dot(W, x) + b)
x_recons = T.nnet.sigmoid(T.dot(V, h) + c)
recon_cost = T.nnet.binary_crossentropy(x_recons, x).mean()
theano.tensor.nnet.nnet.sigmoid_binary_crossentropy(output, target)[source]
Computes the binary cross-entropy between a target and the sigmoid of an output:
Parameters:
  • target - symbolic Tensor (or compatible)
  • output - symbolic Tensor (or compatible)
Return type:

same as target

Returns:

a symbolic tensor, where the following is applied elementwise crossentropy(o,t) = -(t\cdot log(sigmoid(o)) + (1 - t) \cdot log(1 - sigmoid(o))).

It is equivalent to binary_crossentropy(sigmoid(output), target), but with more efficient and numerically stable computation, especially when taking gradients.

The following block implements a simple auto-associator with a sigmoid nonlinearity and a reconstruction error which corresponds to the binary cross-entropy (note that this assumes that x will contain values between 0 and 1):

x, y, b, c = T.dvectors('x', 'y', 'b', 'c')
W = T.dmatrix('W')
V = T.dmatrix('V')
h = T.nnet.sigmoid(T.dot(W, x) + b)
x_precons = T.dot(V, h) + c
# final reconstructions are given by sigmoid(x_precons), but we leave
# them unnormalized as sigmoid_binary_crossentropy applies sigmoid
recon_cost = T.nnet.sigmoid_binary_crossentropy(x_precons, x).mean()
theano.tensor.nnet.nnet.categorical_crossentropy(coding_dist, true_dist)[source]

Return the cross-entropy between an approximating distribution and a true distribution. The cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the “true” distribution p. Mathematically, this function computes H(p,q) = - \sum_x p(x) \log(q(x)), where p=true_dist and q=coding_dist.

Parameters:
  • coding_dist - symbolic 2D Tensor (or compatible). Each row represents a distribution.
  • true_dist - symbolic 2D Tensor OR symbolic vector of ints. In the case of an integer vector argument, each element represents the position of the ‘1’ in a 1-of-N encoding (aka “one-hot” encoding)
Return type:

tensor of rank one-less-than coding_dist

Note

An application of the scenario where true_dist has a 1-of-N representation is in classification with softmax outputs. If coding_dist is the output of the softmax and true_dist is a vector of correct labels, then the function will compute y_i = - \log(coding_dist[i, one_of_n[i]]), which corresponds to computing the neg-log-probability of the correct class (which is typically the training criterion in classification settings).

y = T.nnet.softmax(T.dot(W, x) + b)
cost = T.nnet.categorical_crossentropy(y, o)
# o is either the above-mentioned 1-of-N vector or 2D tensor
theano.tensor.nnet.h_softmax(x, batch_size, n_outputs, n_classes, n_outputs_per_class, W1, b1, W2, b2, target=None)[source]

Two-level hierarchical softmax.

This function implements a two-layer hierarchical softmax. It is commonly used as an alternative of the softmax when the number of outputs is important (it is common to use it for millions of outputs). See reference [1] for more information about the computational gains.

The n_outputs outputs are organized in n_classes classes, each class containing the same number n_outputs_per_class of outputs. For an input x (last hidden activation), the first softmax layer predicts its class and the second softmax layer predicts its output among its class.

If target is specified, it will only compute the outputs of the corresponding targets. Otherwise, if target is None, it will compute all the outputs.

The outputs are grouped in classes in the same order as they are initially defined: if n_outputs=10 and n_classes=2, then the first class is composed of the outputs labeled {0,1,2,3,4} while the second class is composed of {5,6,7,8,9}. If you need to change the classes, you have to re-label your outputs.

New in version 0.7.1.

Parameters:
  • x (tensor of shape (batch_size, number of features)) – the minibatch input of the two-layer hierarchical softmax.
  • batch_size (int) – the size of the minibatch input x.
  • n_outputs (int) – the number of outputs.
  • n_classes (int) – the number of classes of the two-layer hierarchical softmax. It corresponds to the number of outputs of the first softmax. See note at the end.
  • n_outputs_per_class (int) – the number of outputs per class. See note at the end.
  • W1 (tensor of shape (number of features of the input x, n_classes)) – the weight matrix of the first softmax, which maps the input x to the probabilities of the classes.
  • b1 (tensor of shape (n_classes,)) – the bias vector of the first softmax layer.
  • W2 (tensor of shape (n_classes, number of features of the input x,) – n_outputs_per_class) the weight matrix of the second softmax, which maps the input x to the probabilities of the outputs.
  • b2 (tensor of shape (n_classes, n_outputs_per_class)) – the bias vector of the second softmax layer.
  • target (tensor of shape either (batch_size,) or (batch_size, 1)) – (optional, default None) contains the indices of the targets for the minibatch input x. For each input, the function computes the output for its corresponding target. If target is None, then all the outputs are computed for each input.
Returns:

Output tensor of the two-layer hierarchical softmax for input x. Depending on argument target, it can have two different shapes. If target is not specified (None), then all the outputs are computed and the returned tensor has shape (batch_size, n_outputs). Otherwise, when target is specified, only the corresponding outputs are computed and the returned tensor has thus shape (batch_size, 1).

Return type:

tensor of shape (batch_size, n_outputs) or (batch_size, 1)

Notes

The product of n_outputs_per_class and n_classes has to be greater or equal to n_outputs. If it is strictly greater, then the irrelevant outputs will be ignored. n_outputs_per_class and n_classes have to be the same as the corresponding dimensions of the tensors of W1, b1, W2 and b2. The most computational efficient configuration is when n_outputs_per_class and n_classes are equal to the square root of n_outputs.

Examples

The following example builds a simple hierarchical softmax layer.

>>> import numpy as np
>>> import theano
>>> from theano import tensor
>>> from theano.tensor.nnet import h_softmax
>>>
>>> # Parameters
>>> batch_size = 32
>>> n_outputs = 100
>>> dim_x = 10  # dimension of the input
>>> n_classes = int(np.ceil(np.sqrt(n_outputs)))
>>> n_outputs_per_class = n_classes
>>> output_size = n_outputs_per_class * n_outputs_per_class
>>>
>>> # First level of h_softmax
>>> floatX = theano.config.floatX
>>> W1 = theano.shared(
...     np.random.normal(0, 0.001, (dim_x, n_classes)).astype(floatX))
>>> b1 = theano.shared(np.zeros((n_classes,), floatX))
>>>
>>> # Second level of h_softmax
>>> W2 = np.random.normal(0, 0.001,
...     size=(n_classes, dim_x, n_outputs_per_class)).astype(floatX)
>>> W2 = theano.shared(W2)
>>> b2 = theano.shared(np.zeros((n_classes, n_outputs_per_class), floatX))
>>>
>>> # We can now build the graph to compute a loss function, typically the
>>> # negative log-likelihood:
>>>
>>> x = tensor.imatrix('x')
>>> target = tensor.imatrix('target')
>>>
>>> # This only computes the output corresponding to the target.
>>> # The complexity is O(n_classes + n_outputs_per_class).
>>> y_hat_tg = h_softmax(x, batch_size, output_size, n_classes,
...                      n_outputs_per_class, W1, b1, W2, b2, target)
>>>
>>> negll = -tensor.mean(tensor.log(y_hat_tg))
>>>
>>> # We may need to compute all the outputs (at test time usually):
>>>
>>> # This computes all the outputs.
>>> # The complexity is O(n_classes * n_outputs_per_class).
>>> output = h_softmax(x, batch_size, output_size, n_classes,
...                    n_outputs_per_class, W1, b1, W2, b2)

References

[1]J. Goodman, “Classes for Fast Maximum Entropy Training,” ICASSP, 2001, <http://arxiv.org/abs/cs/0108006>`.