Note
This section assumes the reader has already read through the MLP tutorial and through the documentation of Jobman. You should be familiar with the state/channel variable notions, the sql/sqlschedule(s) commands. You should also be able to connect to the PostgreSQL database.
Note
The code for this section is available for download in the examples directory, under mlp_jobman.py
In this part of the tutorial, we will explore the way Jobman can be used for exploring the hyper-parameters of an MLP. As we saw, these hyper-parameters are (in decreasing order of importance):
- learning rate
- number of hidden units
- minibatch size
- number of training epochs
- L1/L2 penalty
This tutorial was written as a self-containing, complete example of performing hyper-parameter exploration with Jobman.
We need a method that takes as inputs a state and a channel variable. The simplest way to do that with our existing code is to simply add a method that reads in the hyper-parameters from the state variable and passes them to the method that does the optimization:
def experiment(state,channel):
(best_validation_loss, test_score, minutes_trained, iter) = \
sgd_optimization_mnist(state.learning_rate, state.n_hidden, state.L1_reg,
state.L2_reg, state.batch_size, state.n_iter)
state.best_validation_loss = best_validation_loss
state.test_score = test_score
state.minutes_trained = minutes_trained
state.iter = iter
return channel.COMPLETE
Note that we have modified the sgd_optimization_mnist method to return a tuple containing the best validation loss, the corresponding test score, the number of minutes that training took and the iteration at which we obtained the best validation score. This tuple is then added to the state variable.
We present two ways of scheduling jobs in the database, for later execution. The first method uses the command-line and a configuration file, the more advanced one uses a Python script.
A good practice is to keep a flat file with the default values for the hyperparameters. For example, mlp_jobman.conf contains:
learning_rate=0.01
L1_reg=0.00
L2_reg=0.0001
n_iter=50
batch_size=20
n_hidden=10
The reason this is good practice is because in many cases we will be exploring only a few of the hyper-parameters at a time, rather than changing them all at the same time.
The following command
jobman sqlschedule postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi \
mlp_jobman.experiment mlp_jobman.conf
will
- read in the configuration stored in mlp_jobman.conf
- create the table mlp_dumi in database ift6266h10_sandbox_db on gershwin (using the account ift6266h10 and the password stored in ~/.pgpass)
- populate the table with the values inferred from mlp_jobman.conf
- “schedule” mlp_jobman.experiment to run with these values
The output of this function is
ADDING {'n_iter': 10, 'jobman.experiment': 'mlp_jobman.experiment',
'jobman.status': 0, 'jobman.sql.priority': 1.0, 'L1_reg': 0.0, 'learning_rate':
0.01, 'n_hidden': 100, 'batch_size': 20, 'L2_reg': 0.0001}
basically giving us a detailed overview of the job that was just added to the table. You will notice that jobman inserted a couple of different fields:
- jobman.experiment: the name of the module.method that took the (state, channel) as input
- jobman.status: the status of the job (0 = not started, 1 = incomplete, 2 = complete, 3 = err while starting, 4 = err while syncing, 5 = job error, -1 = canceled)
- jobman.sql.priority: the priority at which the job will be scheduled. Higher number == higher priority.
The following command
jobman sqlschedules postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi mlp_jobman.experiment\
mlp_jobman.conf 'n_hidden={{20,30}}'
will populate the mlp_dumi table with values inferred from mlp_jobman.conf and the ones given on the command line. The n_hidden variable values are expanded, so the output is:
[['mlp_jobman.conf', 'n_hidden=20'], ['mlp_jobman.conf', 'n_hidden=30']] [['n_hidden=20'], ['n_hidden=30']]
ADDING {'n_iter': 10, 'jobman.experiment': 'mlp_jobman.experiment', 'jobman.status': 0, 'jobman.sql.priority': 1.0, 'L1_reg': 0.0, 'learning_rate': 0.01, 'n_hidden': 20, 'batch_size': 20, 'L2_reg': 0.0001}
ADDING {'n_iter': 10, 'jobman.experiment': 'mlp_jobman.experiment', 'jobman.status': 0, 'jobman.sql.priority': 1.0, 'L1_reg': 0.0, 'learning_rate': 0.01, 'n_hidden': 30, 'batch_size': 20, 'L2_reg': 0.0001}
This shows how to keep default values for the parameters in a .conf file and add jobs corresponding to multiple values of a given (set of) hyperparameter(s).
If you to do random search of the hyper-paramter, you can do if with the key:=value syntax like this:
jobman sqlschedules postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi mlp_jobman.experiment\
mlp_jobman.conf 'n_hidden={{20,30}}' "learning_rate:=@numpy.random.uniform(1e-6, 1e-1)"
jobman sqlview postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi mlp_dumi_view
This creates a virtual table called mlp_dumi_view for later viewing and interpreting results (see below). Note that each time a new field is added to state, the view should be recreated. For instance, if you create the view before any job is finished, the table will not contain the best_validation_loss column, and if you want to use it you will have to call jobman sqlview ... once again after one job finishes.
Since Jobman is a Python package, we can easily call functions from a Python script, allowing us more flexibility when inserting jobs.
The main differences with the first approach are:
The values of the hyperparameters are in some dictionary-like object, not read from another configuration file (of course, you can have Python code that reads values from a conf file):
state = DD()
state.learning_rate = 0.01
state.L1_reg = 0.00
state.L2_reg=0.0001
state.n_iter=50
state.batch_size=20
state.n_hidden=10
The jobs are inserted in the database one by one:
sql.insert_job(experiment, flatten(state), db)
Here is a script that explore different values for n_hidden, and for the L1 and L2 regularization coefficients, without using both type of regularizations at the same time in an experiment. This would not be possible with one call to jobmah sqlschedules, because it would explore all the combinations of hyperparameter values.
from jobman import api0, sql
from jobman.tools import DD, flatten
# Experiment function
from mlp_jobman import experiment
# Database
TABLE_NAME = 'mlp_dumi'
db = api0.open_db('postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table='+TABLE_NAME)
# Default values
state = DD()
state.learning_rate = 0.01
state.L1_reg = 0.00
state.L2_reg = 0.0001
state.n_iter = 50
state.batch_size = 20
state.n_hidden = 10
# Hyperparameter exploration
for n_hidden in 20, 30:
state.n_hidden = n_hidden
# Explore L1 regularization w/o L2
state.L2_reg = 0.
for L1_reg in 0., 1e-6, 1e-5, 1e-4:
state.L1_reg = L1_reg
# Insert job
sql.insert_job(experiment, flatten(state), db)
# Explore L2 regularization w/o L1
state.L1_reg = 0.
for L2_reg in 1e-5, 1e-4:
state.L1_reg = L1_reg
# Insert job
sql.insert_job(experiment, flatten(state), db)
# Create the view
db.createView(TABLE_NAME+'_view')
If you simply execute
jobman sql 'postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi' .
at the command line, jobman connects to the database and retrieves a job that:
- has not been executed yet
- has the highest priority (or lowest job ID)
and then actually runs it locally.
The question is now how to run jobs in parallel on a cluster. Jobman now comes with a script called jobdispatch, that schedules jobs on a variety of back-ends, including condor at DIRO. In fact, it is quite simple:
jobdispatch --condor --repeat_jobs=3 jobman sql 'postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi' .
The different options mean:
- –condor: will launch on the condor system (clusters and workstations seen as compute servers) at DIRO
- –repeat_jobs=3: how many copies of jobman sql 'postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi' . it should launch
- jobman sql ‘postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi’ . - the command to launch
This is perhaps slightly confusing. What should be understood is that jobdispatch has no knowledge of the job that should be executed. So, without the –repeat_jobs=3 parameter, it would simply launch 1 jobman command on a condor machine. When –repeat_jobs is 3, jobdispatch launches the jobman command 3 times (on 3 different machines). Each of these jobman commands will try to fetch a task from the database and each will get a different task, even if the requests are sent at the same time (because of the atomicity of database transactions).
To add to the confusion, jobman has itself a mechanism for running more than one job, but in sequential manner. For instance, the following command will launch 3 tasks on the condor cluster. Each of them will execute jobman with the given parameters. These parameters say that each task will fetch up to 5 jobs from the database and run them sequentially.
dbidispatch --condor --repeat_jobs=3 jobman sql -n 5 'postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi' .
Assuming that there are at least 15 tasks in the database, they will all be executed after a while. Note that if n<=0 then each task will run as many jobs as possible (i.e. it will keep trying to fetch jobs from the database as long as there are some left).
The reason these two options exist separately is to allow a certain degree of trade-off between the number of machines used to execute jobs vs. the number of jobs per machine.
jobman sqlstatus [--all] [--print=KEY] [--print-keys] [--status=JOB_STATUS] [--set_status=JOB_STATUS] [--resert_prio] [--select=key=value] [--quiet] [--ret_nb_jobs] postgres://ift6266h10@gershwin/ift6266h10_sandbox_db?table=mlp_dumi JOB_ID1, JOB_ID2, ...
Default: only print the current status of those jobs.
–all: add all jobs in the db to the list of jobs.
–print=KEY: print the value of KEY of jobs in the db. Accept multiple –print option.
–print-keys: print the keys in the state of the first job.
–status=JOB_STATUS: Query the db and add jobs with the status JOB_STATUS to the list jobs.
–reset_prio: put the priority of jobs back to the default value.
–select=key=value: Query the db and add jobs that match all select parameter to the list of jobs.
–quiet: less verbose
–ret_nb_jobs: print only the number of jobs selected.
JOB_STATUS={START,RUNNING,DONE,ERR_START,ERR_SYNC,ERR_RUN,CANCELED} or their equivelent number
psql -h gershwin -U ift6266h10 -d ift6266h10_sandbox_db
ift6266h10_sandbox_db=> select batchsize,bestvalidationloss,learningrate,minutestrained,nhidden,niter,testscore from dumi_mlp_view where jobman_status=2 order by bestvalidationloss;
batchsize | bestvalidationloss | learningrate | minutestrained | nhidden | niter | testscore
-----------+--------------------+--------------+----------------+---------+-------+-----------
20 | 0.0202 | 0.05 | 22.0346666667 | 100 | 50 | 0.0216
20 | 0.0204 | 0.05 | 28.1765 | 150 | 50 | 0.019
20 | 0.0208 | 0.1 | 25.8041666667 | 150 | 50 | 0.0214
20 | 0.0226 | 0.1 | 26.4415 | 100 | 50 | 0.0227
20 | 0.0255 | 0.02 | 9.53816666667 | 50 | 50 | 0.0276
20 | 0.0265 | 0.05 | 19.237 | 50 | 50 | 0.0284
20 | 0.0267 | 0.1 | 10.2448333333 | 50 | 50 | 0.028
20 | 0.0267 | 0.01 | 23.9046666667 | 100 | 50 | 0.0272
20 | 0.0288 | 0.01 | 19.6711666667 | 50 | 50 | 0.0303
20 | 0.1173 | 0.01 | 0.135 | 30 | 1 | 0.1187
20 | 0.1266 | 0.01 | 0.101833333333 | 20 | 1 | 0.1352
20 | 0.1491 | 0.01 | 0.0655 | 10 | 1 | 0.1563
(12 rows)
Note that jobman_status=2 means that only finished jobs will be selected.