Pylearn2 Pull Request Checklist

Last updated: June 8, 2014

This is a preliminary list of common pull request fixes requested. It’s presumed that your pull request should already pass the Travis buildbot, including docstring and code formatting checks.

Are you breaking a statement over multiple lines?

Python supports breaking a logical line over multiple file lines in a number of ways. One is to use backslashes before the line ending. Another is to enclose the broken section is parentheses (), ([] also works but you should only use this if you are otherwise creating a list). Note that if you have open parentheses from a function call you do not need additional parentheses.

In Pylearn2 we generally prefer parentheses, because it means there’s less markup to maintain and leads to less spurious errors.

Yes:

assert some_complicated_conditional_thing, (
    "This is the assertion error on a separate line."
)

No:

assert some_complicated_conditional_thing, \
    "This just gets annoying, especially if there are multiple " \
    "lines of text."

Note that string concatenation across lines is automatic, no need for +. If enclosed in parentheses you don’t need a either:

# Valid Python.
print ("The quick brown fox jumps over the lazy dog. And then "
       "the fox did it again.")

See the PEP8 indentation recommendations for how to arrange indentation for continuation lines.

Do tests exist for the code you’re modifying?

Pylearn2 grew rapidly in the beginning, often without proper attention to testing. Modifying a piece of code in the codebase may alter how it works; if you make such a modification, you should not only verify that tests pass but that tests _exist_ for the piece of code you’re modifying. You should verify that those tests exist and update them as needed, including a test case for the behaviour you’re adding or modifying.

Usually tests for a module foo are found in tests/test_foo.py.

Are you fixing a bug? Did you add a regression test?

Tests that test for previously existing bugs are particularly critical, as further modification of the code may reintroduce the bug by those who are not aware of the subtleties that led to it in the first place.

Are you fixing an issue that is on the issue tracker?

Your pull request description (or a commit message for one of the commits) should include one of the supported variants of the syntax so that the issue is auto-closed upon merge.

Have you squashed out any nuisance commits?

Pull requests with lots and lots of tiny commits are hard to review. Lots of commits that subsequently introduce a minor bug and then fix them can also make bisecting a pain.

Your final pull request should comprise as few commits as logically make sense. Each commit should ideally leave the repository in a working state (tests passing, functionality preserved).

You can squash commits using git rebase -i and following the instructions. Note that you will have to git push –force origin my_branch_name after a rebase.

You should squash to a minimal set of semantically distinct commits before asking for a review, and then possibly squash again if you’ve made lots of commits in response to feedback (note that you can reorder the commits in the editor window given by git rebase -i).

Are you using OrderedDict where necessary? Are you iterating over sets?

The order of iteration over dictionaries in Python is not guaranteed to remain the same across different invocations of the same program. This is a result of a randomized hashing algorithm and is actually an important security feature for preventing certain kinds of Denial-of-Service attacks. Unfortunately, where such data structures are employed in scientific simulations, this can pose reproducibility problems.

The main reason for this is that computations in floating point do not precisely obey the typical laws of arithmetic (commutativity, associativity, distributivity), and slight differences in the order of operations can introduce small differences in result, which can have butterfly effects that significantly alter the results of a long-running job. The order of operations can be altered by the order in which a Theano graph is assembled, and the precise form it takes can unfortunately sometimes alter which compile-time graph optimizations are performed.

In order to stamp out inconsistencies introduced by an unpredictable iteration order, we make extensive use of the OrderedDict class. This class is part of the collections module in Python 2.7 and Python 3.x, however, in order to maintain Python 2.6 compatibility, we import it from theano.compat.python2x, which provides an equivalent pure-Python implementation if the built-in version is not available.

You should consider carefully whether the iteration order over a dictionary you’re using could result in different behaviour. If in doubt, use an OrderedDict. For the updates parameter when creating Theano functions, you _must_ use an OrderedDict, or a list of (shared_variable, update_expression) tuples.

When iterating over sets, consider whether you should first sort your set. The sorted() built-in function is a simple way of doing this.

Are you using print statements?

In most cases you should be using logging statements instead. You can initialize a logger in a new module with:

import logging

log = logging.getLogger(__name__)

And subsequently call into it with log.info(), log.warning(), etc.

Are you creating a sequence and then immediately iterating over it?

If so, consider using the faster and more memory efficient versions.

  • xrange instead of range.
  • from theano.compat.six.moves import zip as izip instead of zip. This import is for Python 3 compatibility.

Are you using zip()/izip() on sequences you expect to be the same length?

Note that zip and izip truncate the sequence of tuples they produce to the length of the shortest input sequence. If you expect, as is often the case, that the sequences you are zipping together should be the same length, use safe_zip or safe_izip defined in pylearn2.utils.

Also see itertools.izip_longest if you want to zip together sequences of unequal length with a fill value.

Are you using the dict/OrderedDict methods keys()/values()/items()?

For values() and items() consider whether itervalues() or iteritems() would be more appropriate, if you’re only iterating over them once, not keeping the result around for any length of time, and don’t need random access.

Also, don’t bother with keys() or iterkeys() at all if you’re just going to iterate over it. for k in my_dictionary iterates over keys by default.

An exception to these rules is if you are _modifying_ the dictionary within the loop. Then you probably want to duplicate things with the keys(), values() and items() calls.

Are you updating a dictionary or OrderedDict with .update()?

If you are using the update() method of a dictionary or OrderedDict and you expect that none of the keys in the argument should already be in the dictionary, use safe_update() defined in pylearn2.utils.

Do you have an except: block?

You should almost never have a bare except: in library code. Use:

except Exception:
    ...

instead. This catches any subclass of Exception but lets through certain low-level exceptions like KeyboardInterrupt, SystemExit, etc. that inherit from BaseException instead. You almost certainly do not want your code to catch these.

Don’t raise a new exception, use the reraise_as method from pylearn2.utils.exc instead.:

except Exception:
    reraise_as(ValueError("Informative error message here"))

This retains the traceback and original error message, allowing for easier debugging using a tool like pdb.

Are you checking to see if an argument is iterable?

In places where a list, tuple, or other iterable object (say, a deque) will suffice, use pylearn2.utils.is_iterable.

Are you checking if something is a string?

Unless you have a very good reason you should probably be using isinstance(foo, basestring) which correctly handles both str and unicode instances.

Are you checking if something is a number?

Usually such checks are unnecessary but where they might be, we’ve defined some helpful constants.

Are you checking if something is _any_ kind of number?

Use isinstance(foo, pylearn2.utils.py_number_types). This checks against Python builtins as well as NumPy-defined numerical types.

Are you checking if something is an integer?

Use isinstance(foo, pylearn2.utils.py_integer_types). This checks against Python builtins as well as NumPy-defined integer types.

Are you checking if something is a float?

First, ask yourself: do you really need to? Would passing an integer here be inappropriate in all circumstances? Would a cast (i.e. float() be sufficient?

If you really need to, use isinstance(foo, pylearn2.utils.py_float_types). This checks against Python builtins as well as NumPy-defined float types.

Are you checking if something is a complex number?

Again, ask yourself whether passing a real here would be an error, and whether you can get away with a cast.

If you really need to, use isinstance(foo, pylearn2.utils.py_complex_types). This checks against Python builtins as well as NumPy-defined complex types.

Are you checking for the presence of np.nan or np.inf in an array?

If so, use pylearn2.utils.contains_nan or pylearn2.utils.contains_inf. To check for either np.nan or np.inf, use pylearn2.utils.isfinite. These functions are faster and more memory effecient than np.any(np.isnan(X)) or np.any(np.isinf(X)).

Are you creating Theano functions?

If you’re building Theano functions, use pylearn2.utils.function. This disables the on_unused_input check, which in most cases you don’t want to consider an error if you’re doing any kind of generic graph building.

Are you creating Theano shared variables?

If your model, cost, etc. creates floating point shared variables you should probably create them with pylearn2.utils.sharedX, which creates shared variables with the dtype set to theano.config.floatX.

Are you casting symbols/constants to a Theano floating point type?

Use pylearn2.utils.as_floatX to cast symbolic quantities to the default floating point type, and use constantX to create symbolic constants from a scalar or ndarray with the dtype specified in theano.config.floatX.

Do you have big nested loops for generating a Cartesian product?

Example:

stuff = []
for i in range(50):
    for j in range(20):
        for k in range(30):
            stuff.append((i, j, k))

Consider whether itertools.product will get the job done more readably and probably more efficiently.

Are you generating combinations or permutations of a set (or list, ...)?

itertools contains the functions permutations, combinations and combinations_with_replacement that will probably get the job done more efficiently than your own code.

Are you overriding methods in your class?

Use the decorator pylearn2.utils.wraps to inherit the docstring if it is unchanged. If you add a docstring to a function that is wrapped in this fashion, it will be appended below the inherited docstring.

Are you writing functions that uses pseudo-random numbers?

If you are using the NumPy generator, are you providing a way to seed it as well as a default seed ? You should never be using numpy.random functions directly. Use pylearn2.utils.rng.make_np_rng with a user-provided seed and a default_seed argument.

If you are using the Theano RNG you should create it similarly with pylearn2.utils.rng.make_theano_rng.

Are you assembling filesystem paths with dir + / + filename or similar?

Use os.path.join rather than concatenating together with ‘/’. This ensures the code still works on Windows.

Are you extracting the directory name or base filename from a file path?

Use os.path.basename and os.path.dirname to ensure Windows compatibility.

Are you opening/closing files?

Use the with statement, i.e.:

with open(fname, 'w') as f:
    f.write('blah blah blah')

This is cleaner and ensures that the file always gets closed, even in an error condition.

Are you adding new files or changing files permissions?

The files containing unit tests (named test_....py) should never be executable, otherwise nose will ignore them, and not execute the tests.