Some of the tricks, often not mentioned in papers, end up playing a crucial role.
Contrastive Divergence using one step of Gibbs sampling:
Stochastic Maximum Likelihood method using one step of Gibbs sampling:
It is trickier to analyse the convergence of CD, but the theory still says a lot about the tricks (e.g. momentum, constant learning rates) used to make CD working in practice.
There are many ways to assess the performance of an RBM, these include log-likelihood, train misclassification error, test misclassification error, reconstruction error, and samples generated from the model.
While ideally one would choose log-likelihood as well as test error for assessing the performance of training an RBM for classification.
It is expected that more hidden units will improve classification performance, but then it becomes prohibitive to run many experiments; in general the model is computationally impractical when the number of hidden units is too large.
Constant learning rate for RBM training.
We found empirically that using mini-batches provides an improvement in the convergence of the optimization. What is more surprising is that smaller mini-batches seem to converge to a lower test error than the methods that use a higher batch size.
We found that sampling the second set of visible units properly is an important part of the algorithm, and that this trick should not be employed when seeking good classification performance.
This suggests a new strategy where we anneal the weight decay from a high value to a low value over the choice of training, in order to force RBM to utilize as many hidden units as possible.
It seems fairly clear that for shallow training of generative binary RBMs for classification, SML (Stochastic Maximum Likelihood) is superior.
This indicates that test error is not the ideal metric to use when choosing the parameters of an RBM to initialize a DBN.
In order to choose good RBM parameters for use in a DBN, perhaps a different error metric could be used that would be more indicative of the performance of the DBN. One possibility is to use the reconstruction error, which is the difference between the data d, and the generated data v' caused by one iteration of sampling h' from d and followed by v' from h'. This may be more appropriate for predicting the quality of the features learned by the unsupervised portion of DBN training.
To do this, we appended class labels to the data and trained a 794-1000-500-250-2 deep autoencoder with Gaussian activations in the last hidden layer.
In Hinton's "Reducing the dimensionality of data with neural networks", the autoencoder is done in a purely unsupervised way. The setting is 784-1000-500-250-2, i.e. no class labels in the visible data.