Application Requirements

Terminology and Abbreviations:

MLA - machine learning algorithm

learning problem - a machine learning application typically characterized by a dataset (possibly dataset folds) one or more functions to be learned from the data, and one or more metrics to evaluate those functions. Learning problems are the benchmarks for empirical model comparison.

  1. of - number of

SGD - stochastic gradient descent


  • New masters and PhD students in the lab should be able to quickly move into ‘production’ mode without having to reinvent the wheel.
  • Students in the two ML classes, able to play with the library to explore new ML variants. This means some APIs (e.g. Experiment level) must be really well documented and conceptually simple.
  • Researchers outside the lab (who might study and experiment with our algorithms)
  • Partners outside the lab (e.g. Bell, Ubisoft) with closed-source commercial projects.


R1. reproduce previous work (our own and others’)

R2. explore MLA variants by swapping components (e.g. optimization algo, dataset,
R3. analyze experimental results (e.g. plotting training curves, finding best
models, marginalizing across hyper-parameter choices)

R4. disseminate (or serve as platform for disseminating) our own published algorithms

R5. provide implementations of common MLA components (e.g. classifiers, datasets,
optimization algorithms, meta-learning algorithms)
R6. drive large scale parallizable computations (e.g. grid search, bagging,
random search)
R7. provide implementations of standard pre-processing algorithms (e.g. PCA,
stemming, Mel-scale spectrograms, GIST features, etc.)

R8. provide high performance suitable for large-scale experiments

R9. be able to use the most efficient algorithms in special case combinations of
learning algorithm components (e.g. when there is a fast k-fold validation algorithm for a particular model family, the library should not require users to rewrite their standard k-fold validation script to use it)
R10. support experiments on a variety of datasets (e.g. movies, images, text,
sound, reinforcement learning?)

R11. support efficient computations on datasets larger than RAM and GPU memory

R12. support infinite datasets (i.e. generated on the fly)

R13. apply trained models “in production”.
  • e.g. say you try many combinations of preprocessing, models and associated hyper-parameters, and want to easily be able to recover the full “processing pipeline” that performs best, and use it on real/test data later.

OD comments: Note that R9 and R13 may conflict with each other. Some optimizations performed by R9 may modify the input “symbolic graph” in such a way that extracting the required components for “production purpose” (R13) could be made more difficult (or even impossible). Imagine for instance that the graph is modified to take advantage of the fact that k-fold validation can be performed efficiently internally by some specific algorithm. Then it may not be obvious anymore how to remove the k-fold split in the saved model you want to use in production.

Requirements for component architecture

R14. Serializability of experiments. (essentially in pursuit of R6)

Jobs that are running a learning algorithm with our components (datasets, models, algorithms) must be able to serialize the experiment’s state to a string (typically written to disk) and be able to restart it from such a string. There must be a mechanism to tell a job to serialize the experiment as soon as possible, and a latency of up to 10 seconds should be acceptable. It must also be possible to deserialize the experiment for introspection (inspect the state of individual components), not just for continuing the experiment. The experiment can assume that resources on disk that were present when the experiment started will be present when the experiment resumes. The experiment cannot assume that resources written by the experiment will still be there (e.g. in /tmp or cwd). Implementations should make an effort to make the serialized representation compact, when it is possible to recompute or reload from disk at deserialization time.

This requirement is aimed at enabling process migration and job control as well as post-hoc analysis of experiment results.

PL: I’m not sure it the job should have to return its current state. I think it would be enough that it returns a consistent checkpoint, even if it is from some time in the past (ideally, not more than a few minutes ago).

OD: I agree with PL.

OD asks: When you say “The experiment cannot assume that resources written by the experiment will still be there”, do you mean we should be able to recover the exact same output after interrupting an experiment, wiping its expdir, and restarting it? This would mean that any output saved on disk by the experiment also has to be serialized within the experiment, which may lead to very big serialization files (and possibly memory issues?) A less constraining interpretation of your statement (which I like better) is that we allow “previous” output to be lost: we only ask that the experiment should be able to produce the “new” outputs after a wipe+restart.

Requirements from meeting 21 September: ( some of them share things with the requirements above )

R15. If you see the library as a driver that controls several components ( and we argue that any approach can be seen like this), the driver should always :

  • be serializable
  • respond to internal interrupts (“checkpoints”)
  • respond to external interrupts ( timeout)
  • async interrupts ( eg. SIGTERM)
R16 Cognitive load should be minimal (debatable requirement)
Notes : Is hard to actually appreciate cognitive load, so this should be a soft requirement, or in other words, from all proposal that respect all other requirements the one with least cognitive load should be selected.

R17 The library should allow surgery on long-running jobs

R18 Distributed Computation
The library should be able to support at least the followings :
  • lunch jobs transparantly across networks
  • resume jobs ( or continue jobs )