# Module weights

source code

Routine to initialize weights.

Note: We assume that numpy.random.seed() has already been performed.

 Functions

 random_weights(nin, nout, scale_by=0.57735026919, power=0.5) Generate an initial weight matrix with nin inputs (rows) and nout outputs (cols). source code
 Variables
sqrt3 = `1.73205080757`

Imports: pow, sqrt, numpy

 Function Details

### random_weights(nin, nout, scale_by=0.57735026919, power=0.5)

source code
```
Generate an initial weight matrix with nin inputs (rows) and nout
outputs (cols).
Each weight is chosen uniformly at random to be in range:
[-scale_by*sqrt(3)/pow(nin,power), +scale_by*sqrt(3)/pow(nin,power)]
@note: Play with scale_by, but reasonable values are <=1, maybe 1./sqrt3
power=0.5 is strongly recommanded (see below).

Suppose these weights w are used in dot products as follows:
output = w' input
If w ~ Uniform(-r,r) and Var[input_i]=1 and x_i's are independent, then
Var[w]=r2/3
Var[output] = Var[ sum_{i=1}^d w_i input_i] = d r2 / 3
To make sure that variance is not changed after the dot product,
we therefore want Var[output]=1 and r = sqrt(3)/sqrt(d).  This choice
corresponds to the default values scale_by=sqrt(3) and power=0.5.
More generally we see that Var[output] = Var[input] * scale_by.

Now, if these are weights in a deep multi-layer neural network,
we would like the top layers to be initially more linear, so as to let
gradients flow back more easily (this is an explanation by Ronan Collobert).
To achieve this we want scale_by smaller than 1.
Ronan used scale_by=1/sqrt(3) (by mistake!) and got better results than scale_by=1
in the experiment of his ICML'2008 paper.
Note that if we have a multi-layer network, ignoring the effect of the tanh non-linearity,
the variance of the layer outputs would go down roughly by a factor 'scale_by' at each
layer (making the layers more linear as we go up towards the output).

```