random_weights(nin,
nout,
scale_by=0.57735026919,
power=0.5)
 source code

Generate an initial weight matrix with nin inputs (rows) and nout
outputs (cols).
Each weight is chosen uniformly at random to be in range:
[scale_by*sqrt(3)/pow(nin,power), +scale_by*sqrt(3)/pow(nin,power)]
@note: Play with scale_by, but reasonable values are <=1, maybe 1./sqrt3
power=0.5 is strongly recommanded (see below).
Suppose these weights w are used in dot products as follows:
output = w' input
If w ~ Uniform(r,r) and Var[input_i]=1 and x_i's are independent, then
Var[w]=r2/3
Var[output] = Var[ sum_{i=1}^d w_i input_i] = d r2 / 3
To make sure that variance is not changed after the dot product,
we therefore want Var[output]=1 and r = sqrt(3)/sqrt(d). This choice
corresponds to the default values scale_by=sqrt(3) and power=0.5.
More generally we see that Var[output] = Var[input] * scale_by.
Now, if these are weights in a deep multilayer neural network,
we would like the top layers to be initially more linear, so as to let
gradients flow back more easily (this is an explanation by Ronan Collobert).
To achieve this we want scale_by smaller than 1.
Ronan used scale_by=1/sqrt(3) (by mistake!) and got better results than scale_by=1
in the experiment of his ICML'2008 paper.
Note that if we have a multilayer network, ignoring the effect of the tanh nonlinearity,
the variance of the layer outputs would go down roughly by a factor 'scale_by' at each
layer (making the layers more linear as we go up towards the output).
