Introduction to SeriesTables

SeriesTables was created to make it easier to record scalar data series, such as, notably, the evolution of errors (training, valid, test) during training. There are other common usecases I foresee, such as recording basic statistics (mean, min/max, variance) of parameters during training, to diagnose problems.

I also think that if such recording is easily accessible, it might lead us to record other statistics, such as stats concerning activations in the network (i.e. to diagnose unit saturation problems).

Each element of a series is indexed and timestamped. By default, for example, the index is named “epoch”, which means that with each row an epoch number is stored (but this can be easily customized). By default, the timestamp at row creation time will also be stored, along with the CPU clock() time. This is to allow graphs plotting error series against epoch or training time.

Series are saved in HDF5 files, which I’ll introduce briefly.

Introduction to PyTables and HDF5

HDF5 is a file format intended for storage of big numerical datasets. In practice, for our concern, you’ll create a single .h5 file, in which many tables, corresponding to different series, will be stored. Datasets in a single file are organized hierarchically, in the equivalent of “folders” called “groups”. The “files” in the analogy would be our tables.

A useful property of HDF5 is that metadata is stored along with the data itself. Notably, we have the table names and column names inside the file. We can also attach more complex data, such as title, or even complex objects (which will be pickled), as attributes.

PyTables is a Python library to use the HDF5 format.

Here’s a basic Python session in which I create a new file and store a few rows in a single table:

>>> import tables
>>>
>>> hdf5_file = tables.openFile("mytables.h5", "w")
>>>
>>> # Create a new subgroup under the root group "/"
... mygroup = hdf5_file.createGroup("/", "mygroup")
>>>
>>> # Define the type of data we want to store
... class MyDescription(tables.IsDescription):
...     int_column_1 = tables.Int32Col(pos=0)
...     float_column_1 = tables.Float32Col(pos=1)
...
>>> # Create a table under mygroup
... mytable = hdf5_file.createTable("/mygroup", "mytable", MyDescription)
>>>
>>> newrow = mytable.row
>>>
>>> # a first row
... newrow["int_column_1"] = 15
>>> newrow["float_column_1"] = 30.0
>>> newrow.append()
>>>
>>> # and a second row
... newrow["int_column_1"] = 16
>>> newrow["float_column_1"] = 32.0
>>> newrow.append()
>>>
>>> # make sure we write to disk
... hdf5_file.flush()
>>>
>>> hdf5_file.close()

And here’s a session in which I reload the data and explore it:

>>> import tables
>>>
>>> hdf5_file = tables.openFile("mytables.h5", "r")
>>>
>>> mytable = hdf5_file.getNode("/mygroup", "mytable")
>>>
>>> # tables can be "sliced" this way
... mytable[0:2]
array([(15, 30.0), (16, 32.0)],
      dtype=[('int_column_1', '<i4'), ('float_column_1', '<f4')])
>>>
>>> # or we can access columns individually
... mytable.cols.int_column_1[0:2]
array([15, 16], dtype=int32)

Using SeriesTables: a basic example

Here’s a very basic example usage:

>>> import tables
>>> from pylearn.io.seriestables import *
>>>
>>> tables_file = tables.openFile("series.h5", "w")
>>>
>>> error_series = ErrorSeries(error_name="validation_error", \
...                         table_name="validation_error", \
...                         hdf5_file=tables_file)
>>>
>>> error_series.append((1,), 32.0)
>>> error_series.append((2,), 28.0)
>>> error_series.append((3,), 26.0)

I can then open the file series.h5, which will contain a table named validation_error with a column name epoch and another named validation_error. There will also be timestamp and cpuclock columns, as this is the default behavior. The table rows will correspond to the data added with append() above.

Indices

You may notice that the first parameter in append() is a tuple. This is because the index may have multiple levels. The index is a way for rows to have an order.

In the default case for ErrorSeries, the index only has an “epoch”, so the tuple only has one element. But in the ErrorSeries(...) constructor, you could have specified the index_names parameter, e.g. ('epoch','minibatch'), which would allow you to specify both the epoch and the minibatch as index.

Summary of the most useful classes

By default, for each of these series, there are also columns for timestamp and CPU clock() value when append() is called. This can be changed with the store_timestamp and store_cpuclock parameters of their constructors.

ErrorSeries
This records one floating point (32 bit) value along with an index in a new table.
AccumulatorSeriesWrapper
This wraps another Series and calls its append() method when its own append() as been called N times, N being a parameter when constructing the AccumulatorSeriesWrapper. A simple use case: say you want to store the mean of the training error every 100 minibatches. You create an ErrorSeries, wrap it with an Accumulator and then call its append() for every minibatch. It will collect the errors, wait until it has 100, then take the mean (with numpy.mean) and store it in the ErrorSeries, and start over again. Other “reducing” functions can be used instead of “mean”.
BasicStatisticsSeries
This stores the mean, the min, the max and the standard deviation of arrays you pass to its append() method. This is useful, notably, to see how the weights (and other parameters) evolve during training without actually storing the parameters themselves.
SharedParamsStatisticsWrapper
This wraps a few BasicStatisticsSeries. It is specifically designed so you can pass it a list of shared (as in theano.shared) parameter arrays. Each array will get its own table, under a new HDF5 group. You can name each table, e.g. “layer1_b”, “layer1_W”, etc.

Example of real usage

The following is a function where I create the series used to record errors and statistics about parameters in a stacked denoising autoencoder script:

def create_series(num_hidden_layers):

        # Replace series we don't want to save with DummySeries, e.g.
        # series['training_error'] = DummySeries()

        series = {}

        basedir = os.getcwd()

        h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w")

        # training error is accumulated over 100 minibatches,
        # then the mean is computed and saved in the training_base series
        training_base = \
                                ErrorSeries(error_name="training_error",
                                        table_name="training_error",
                                        hdf5_file=h5f,
                                        index_names=('epoch','minibatch'),
                                        title="Training error (mean over 100 minibatches)")

        # this series wraps training_base, performs accumulation
        series['training_error'] = \
                                AccumulatorSeriesWrapper(base_series=training_base,
                                        reduce_every=100)

        # valid and test are not accumulated/mean, saved directly
        series['validation_error'] = \
                                ErrorSeries(error_name="validation_error",
                                        table_name="validation_error",
                                        hdf5_file=h5f,
                                        index_names=('epoch',))

        series['test_error'] = \
                                ErrorSeries(error_name="test_error",
                                        table_name="test_error",
                                        hdf5_file=h5f,
                                        index_names=('epoch',))

        # next we want to store the parameters statistics
        # so first we create the names for each table, based on
        # position of each param in the array
        param_names = []
        for i in range(num_hidden_layers):
                param_names += ['layer%d_W'%i, 'layer%d_b'%i, 'layer%d_bprime'%i]
        param_names += ['logreg_layer_W', 'logreg_layer_b']


        series['params'] = SharedParamsStatisticsWrapper(
                                                new_group_name="params",
                                                base_group="/",
                                                arrays_names=param_names,
                                                hdf5_file=h5f,
                                                index_names=('epoch',))

        return series

Then, here’s an example of append() usage for each of these series, wrapped in pseudocode:

series = create_series(num_hidden_layers=3)

...

for epoch in range(num_epochs):
        for mb_index in range(num_minibatches):
                train_error = finetune(mb_index)
                series['training_error'].append((epoch, mb_index), train_error)

        valid_error = compute_validation_error()
        series['validation_error'].append((epoch,), valid_error)

        test_error = compute_test_error()
        series['test_error'].append((epoch,), test_error)

        # suppose all_params is a list [layer1_W, layer1_b, ...]
        # where each element is a shared (as in theano.shared) array
        series['params'].append((epoch,), all_params)

Other targets for appending (e.g. printing to stdout)

SeriesTables was created with an HDF5 file in mind, but often, for debugging, it’s useful to be able to redirect the series elsewhere, notably the standard output. A mechanism was added to do just that.

What you do is you create a AppendTarget instance (or more than one) and pass it as an argument to the Series constructor. For example, to print every row appended to the standard output, you use StdoutAppendTarget.

If you want to skip appending to the HDF5 file entirely, this is also possible. You simply specify skip_hdf5_append=True in the constructor. You still need to pass in a valid HDF5 file, though, even though nothing will be written to it (for, err, legacy reasons).

Here’s an example:

def create_series(num_hidden_layers):

        # Replace series we don't want to save with DummySeries, e.g.
        # series['training_error'] = DummySeries()

        series = {}

        basedir = os.getcwd()

        h5f = tables.openFile(os.path.join(basedir, "series.h5"), "w")

        # Here we create the new target, with a message prepended
        # before every row is printed to stdout
        stdout_target = \
                StdoutAppendTarget( \
                        prepend='\n-----------------\nValidation error',
                        indent_str='\t')

        # Notice here we won't even write to the HDF5 file
        series['validation_error'] = \
                ErrorSeries(error_name="validation_error",
                        table_name="validation_error",
                        hdf5_file=h5f,
                        index_names=('epoch',),
                        other_targets=[stdout_target],
                        skip_hdf5_append=True)

        return series

Now calls to series[‘validation_error’].append() will print to stdout outputs like:

----------------
Validation error
        timestamp : 1271202144
        cpuclock : 0.12
        epoch : 1
        validation_error : 30.0

----------------
Validation error
        timestamp : 1271202144
        cpuclock : 0.12
        epoch : 2
        validation_error : 26.0

Visualizing in vitables

vitables is a program with which you can easily explore an HDF5 .h5 file. Here’s a screenshot in which I visualize series produced for the preceding example:

_images/vitables_example_series.png