The class notes could be found http://www.cs.toronto.edu/~hinton/csc2535/lectures.html

Cast machine learning to numerical optimization problems.

Quantify solution to problem based on scalar objective function, evaluated on sets of inputs/outputs.

The goal is to adjust model parameters to minimize the objective function given inputs and outputs.

The key is to design machine learning systems is to select representation of inputs and outputs, and mathematical formulation of task as objective function.

The mechanics: optimize objective function given observed data to find best parameters.

Probabilistic models allow us to:

1) account for noisy sensors, actuators

2) make decisions based on partial information about the world

3) describe inherently stochastic aspects of natural world

4) consider data not well described by model

Supervised learning is conditional density estimation P(Y|X);

Unsupervised learning is density estimation P(X);

If we learn the joint distribution, P(X,Y) we could marginalize to get P(X) by summing over all possible Y; also we could use Bayes theory to compute P(Y|X). Thus the joint distribution is the primary object of interest. The central problems concern representing it compactly, and robustly learning the parameters.

Key point about Directed Acyclic Graphical Models is missing edges imply conditional independence.

Sum-Product algorithm: Message definitions

Learning via the likelihood:

Simple Monte Carlo:

Importance sampling:

Sums and integrals, often expectations, occur frequently in statistics.

Monte Carlo approximate expectations with a sample average.

Rejection sampling draws samples from complex distributions.

Importance sampling applies Monte Carlo to any sum/integral.

Markov chain Monte Carlo constructs a random walk that explores target distributions.

On Sun, Oct 24, 2010 at 8:53 PM, Troy Lee <troy.lee2008@gmail.com> wrote:

The theoretical formalism described in this paper seems to be well suited to the description of many different variations of back-propagation.From a historical point of view, back-propagation had been used in the field of optimal control long before its application to connectionist systems has been porposed.The central problem that back-propagation solves is the evaluation of the influence of a parameter on a function whose computation involves several elementary steps.This paper presents a mathematical framework for studying back-propagation based on the Lagrangian formalism. In this framework, inspired by optimal control theory, back-propagation is formulated as an optimization problem with non-linear constraints.The Lagrange function is the sum of an output objective function, which is usually a squared sum of the difference between the actual output and the desired output, and a constraint term which describes the network dynamics.This approach suggests many natural extensions to the basic algorithm.Other easily described variations involve either additional terms in the error function, additional constraints on the set of solutions, or transformations of the parameter space.

Machine learning lecture notes from : http://www.cs.toronto.edu/~zemel/Courses/CS411/lects.html

Constrain NN weights:

Pruning network weights:

1) Weights decay;

2) Optimal Brain damage

Some of the tricks, often not mentioned in papers, end up playing a crucial role. **Contrastive Divergence using one step of Gibbs sampling:****Stochastic Maximum Likelihood method using one step of Gibbs sampling:**

It is trickier to analyse the convergence of CD, but the theory still says a lot about the tricks (e.g. momentum, constant learning rates) used to make CD working in practice.

There are many ways to assess the performance of an RBM, these include log-likelihood, train misclassification error, test misclassification error, reconstruction error, and samples generated from the model.

While ideally one would choose log-likelihood as well as test error for assessing the performance of training an RBM for classification.

It is expected that more hidden units will improve classification performance, but then it becomes prohibitive to run many experiments; in general the model is computationally impractical when the number of hidden units is too large.

Constant learning rate for RBM training.

We found empirically that using mini-batches provides an improvement in the convergence of the optimization. What is more surprising is that smaller mini-batches seem to converge to a lower test error than the methods that use a higher batch size.

We found that sampling the second set of visible units properly is an important part of the algorithm, and that this trick should not be employed when seeking good classification performance.

This suggests a new strategy where we anneal the weight decay from a high value to a low value over the choice of training, in order to force RBM to utilize as many hidden units as possible.

It seems fairly clear that for shallow training of generative binary RBMs for classification, SML (Stochastic Maximum Likelihood) is superior.

This indicates that test error is not the ideal metric to use when choosing the parameters of an RBM to initialize a DBN.

In order to choose good RBM parameters for use in a DBN, perhaps a different error metric could be used that would be more indicative of the performance of the DBN. One possibility is to use the reconstruction error, which is the difference between the data d, and the generated data v' caused by one iteration of sampling h' from d and followed by v' from h'. This may be more appropriate for predicting the quality of the features learned by the unsupervised portion of DBN training.

To do this, we appended class labels to the data and trained a 794-1000-500-250-2 deep autoencoder with Gaussian activations in the last hidden layer.

In Hinton's "Reducing the dimensionality of data with neural networks", the autoencoder is done in a purely unsupervised way. The setting is 784-1000-500-250-2, i.e. no class labels in the visible data.

Not only mixture models, but also a wide variety of other classical statistical models for density estimation are representable as simple networks with one or more layers of adaptive weights.

Following steps convert the standard Bayes rule into a logistic function:

To achieve good generalization it is important to have more data points than adaptive parameters in the model.

It has been demonstrated that MLP models of this form (with one hidden layer) can approximate to arbitrary accuracy any continuous function, defined on a compact domain, provided the number M of hidden units is sufficiently large.

The linear, logistic, and softmax functions are (inverse) canonical links for the Gaussian, Bernoulli, and multinomial distributions, respectively.

A variety of such pruning algorithms are available [cf. Bishop, 1995].

Some theoretical insight into the problem of overfitting can be obtained by decomposing the error into the sum of bias and variance terms. A model which is too inflexible is unable to represent the true structure in the underlying density function and this gives rise to a high bias. Conversely, a model which is too flexible becomes tuned to the specific details of the particular data set and gives a high variance. The best generalization is obtained from the optimum trade-off of bias against variance.

Regularization methods can be justified within a general theoretical framework known as structural risk minimization. Structural risk minimization provides a quantitative measure of complexity known as the VC dimension. The theory shows that the VC dimension predicts the different between performance on a training set and performance on a test set; thus, the sum log likelihood and (some function of) VC dimension provides a measure of generalization performance. This motivates regularization methods and provides some insight into possible forms for the regularizer.

Pre-processing is important for NN:

1) A simple normalization of the input to give it zero mean and unit variance;

2) Reducing the dimensionality of the input space;

A standard technique for dimensionality reduction is principle component analysis. Such methods, however, make use only of the input data and ignore the target values, and can sometimes be significantly sub-optimal.

One way of achieving such translation invariance for NN is to make use of the technique of shared weights. This involves a network architecture having many hidden layers in which each unit takes inputs only from a small patch, called a receptive field, of units in the previous layer. By a process of constraining neighboring units to have common weights, it can be arranged that the output of the network is insensitive to translation of the input image. A further benefit of weight sharing is that the number of independent parameters is much smaller than the number of weights, which assists with the problem of model complexity.

The relationship between probabilistic graphical models and neural networks is rather strong; indeed it is often possible to reduce one kind of model to the other.

Following figure shows that it is possible to treat an HMM as a special case of a Boltzmann machine.

Feedforward neural networks trained by error backpropagation are examples of nonparametric regression estimators.

The fundamental challenges in neural modeling are about representation rather than learning per se.

Learning, as it is represented in some current neural networks, can be formulated as a (nonlinear) regression problem.

The **essence of the bias/variance dilemma** lies in the fact that estimation error can be decomposed into two components, know as bias and variance; whereas incorrect models lead to high bias, truly model-free inference suffers from high variance. Thus, mode-free (tabula rasa) approaches to complex inference tasks are slot to "converge", in the sense that large training samples are required to achieve acceptable performance. This is the effect of high variance, and is a consequence of the large number of parameters, indeed infinite number in truly mode-free inference, that need to be estimated. Prohibitively large training sets are then required to reduce the variance contribution to estimation error. Parallel architectures and fast hardware do not help here: this "convergence problem" has to do with training set size rather than implementation. The only way to control the variance in complex inference problems is to use model-based inference is bias-prone: proper models are hard to identify for these more complex (and interesting) inference problems, and any model-based scheme is likely to be incorrect for the task at hand, that is highly biased.

In other words, among all functions of x, the regression is the best predictor of y given x, in the mean-squared-error sense.

This is indeed a reassuring property, but it comes with a high price: depending on the particular algorithm and the particular regression, non-parametric methods can be extremely slow to converge. That is, they may require very large numbers of examples to make relatively crude approximations of the target regression function. Indeed, with small smaples the estimator may be too dependent on the particular samples observed, that is, on the particular realizations of (x,y) (we say that the variance of the estimator is high). Thus, for a fixed and finite training set, a parametric estimator may actually outperform a nonparametric estimator, even when the true regression is outside of the parameterized class.

The **regression** problem is to construct a function f(x) based on a "training set" for the purpose of approximating y at future observations of x. This is sometimes called "generalization" a term borrowed from psychology.

An unbiased estimator may still have a large mean-squared error if the variance is large. Thus either bias or variance can contribute to poor performance.

There is often a **tradeoff between the bias and variance contributions to the estimation error**, which makes for a kind of "uncertainty principle". Typically, variance is reduced through "smoothing", via a combining, for example, of the influences of samples that are nearby in the input space. This, however, will introduce bias, as details of the regression function will be lost; for example, sharp peaks and valleys will be blurred.

The **general recipe of for obtaining consistency in nonparametric estimation:** slowly remove bias.

Nonparametric estimators are generally indexed by** one or more parameters which control bias and variance;** these parameters must be properly adjusted, as functions of sample size, to ensure consistency, that is, convergence of mean-squared error to zero in the large-sample-size limit. The number of neighbors k, the kernel bandwidth sigma, and the number of hidden units play these roles, respectively, in nearest-neighbor, Parzen-window, and feedforward-neural-network estimators. These "smoothing parameters" typically enforce a degree of regularity (hence bias), thereby "controlling" the variance. As we shall see in Section 4.2, consistency theorems specify asymptotic rates of growth or decay of these parameters to guarantee convergence to the unknown regression, or more generally, to the object of estimation. Thus, for example, a rate of growth of the number of neighbors of of the number of hidden units, or a rate of decay of the bandwidth, is specified as a function of the sample size N.

The most widely studied approach to automatic smoothing is **"cross-validation"**. The idea of this technique, usually attributed to Stone(1974).

Nonparametric methods may be characterized by the property of consistency: in the appropriate asymptotic limit they achieve the best possible performance for any learning task given to them, however difficult this task may be.

We also saw that **mean-squared error can be decomposed into a bias term and a variance term**. Both have to be made small if we want to achieve good performance. The practical issue is then the following: Can we hope to make both bias and variance "small", with "reasonably" sized training sets, in "interesting" problems, using nonparametric inference algorithms such as nearest-neighbor, CART, feedforward networks, etc.?

Layered networks, Boltzmann Machines, and older methods like nearest neighbor or window estimators, can indeed form the basis of a trainable, "from scratch", speech recognition system, or a device for invariant object recognition. With enough examples and enough computing power, the performance of these machines will necessarily approximate the best possible for the task at hand. **There would be no need for prepocessing or devising special representations: the "raw data" would do.**

For many problems, the constrained structures can do better than general purpose structures.

The use of the **graph-matching distance** yields significantly better results on this task.

Adopting an appropriate data representation is an efficient means for designing the bias required for solving many hard problems in machine perception. This view is of course shared by many authors. As Anderson and Rosenfeld (1988) put it: "**A good representation does most of the work**".

The theoretical formalism described in this paper seems to be well suited to the description of many different variations of back-propagation.

From a historical point of view, back-propagation had been used in the field of optimal control long before its application to connectionist systems has been porposed.

The central problem that back-propagation solves is the evaluation of the influence of a parameter on a function whose computation involves several elementary steps.

This paper presents a mathematical framework for studying back-propagation based on the Lagrangian formalism. In this framework, inspired by optimal control theory, back-propagation is formulated as an optimization problem with non-linear constraints.

The Lagrange function is the sum of an output objective function, which is usually a squared sum of the difference between the actual output and the desired output, and a constraint term which describes the network dynamics.

This approach suggests many natural extensions to the basic algorithm.

Other easily described variations involve either additional terms in the error function, additional constraints on the set of solutions, or transformations of the parameter space.

Connectionist networks are composed of relatively simple, neuron-like processing elements that store all their long-term knowledge in the strengths of the connections between processors.

The network generally learns to use distributed representations in which each input vector is represented by activity in many different hidden units, and each hidden unit is involved in representing many different input vectors.

The ability to represent complex hierarchical structures efficiently and to apply structure sensitive operations to these representations seems to be essential.

The outcomes of these two battles suggest that as the learning procedures become more sophisticated the advantage of automatic parameter tuning may more than outweigh the representational inadequacies of the restricted systems that admit such optimization techniques.

Clearly, the ultimate goal is efficient learning procedures for representationally powerful systems. The disagreement is about which of these two objectives should be sacrificed in the short term.

The original article on Error Back Propagation for Neural Networks.

The Matlab function imagesc automatically scales the data matrices to be displayed as a colored figure. The actual color depends on the color map used.

How does the scaling work? Actually, it is just a linear scaling which maps the original data values to the color map indices.

A blog post from Steve explained it detailedly, which can be found http://blogs.mathworks.com/steve/2006/02/10/all-about-pixel-colors-part-3/.

PLearn is a C++ Machine Learning package.**The installation on Ubuntu is much easy:** **python-numarray** is not needed, but **python-numpy-dev** is necessary

Follow the guidance on this page: http://plearn.berlios.de/installation_guide/node4.html

Install all the prerequisite and recommended packages through system's syntactic package manager.

Note that in your system, the version maybe different, the version number should not be smaller than the number on the guidance page.

In my installation, "edfblas" package is not found and not installed, instead the libblas is installed;

also for lapack package, no lapack3, lapack3-dev are found; liblapack3 and liblapack3-dev are installed instead;

After the package installation, edit the ~/.bashrc file, to include following lines:

export PLEARNDIR=${HOME}/PLearn

`export PATH=$PLEARNDIR/scripts:`

`$PLEARNDIR/commands:${PATH}`

`export PYTHONPATH=$PLEARNDIR/python_modules`

Change them to the correct path on your own system.

Next just following what the instructions on that page tells you.

To compile the executable, **pymake plearn.cc**

The installation on Suse SLES 10sp3 is a little difficult:

The major problem is to install the required packages. On the Suse system, the package name are different from what are used in Ubuntu.

there is not libboost, but boost package

libnspr4 is called mozilla-nspr4

libncurses is called ncurse directly

Anyway try to search the key name of that package when not found.

One more comments on installing those packages is that **do use Yast instead of zypper**. I'm not sure why some packages are not found with zypper but with Yast, with the same installation sources.

After the package installation, everything becomes the same and you are near the success.

Deep Architectures have been proposed in recent years, and experiments on simple tasks have shown much better results compared with existing shallow structure models. **The deep learning algorithms also need to be adapted in order to scale to harder, potentially "real life" problems.**

The idea of Deep architecture is that higher level abstract features could be captured, which is believed to be much more robust to the variations in the original feature space.

Through the deep architecture, the effects of different variation factors could be modeled by the layered structure.

The deep architecture is believed to work well on problems that the underlying data distribution can be thought as the product of factor distributions, which means that a sample corresponds to a combination of particular value for these factors.

In this paper, the authors experimented with plenty of factors of variations on the MNIST digit database to compare with different models including shallow SVM models and deep believe network, stacked auto-associators.

The shallow structures:

The DBN structure:

The stacked auto-associator structure:

The results of their experiments:

From the results, most of the time the deep architecture works well with variations, but there are also some cases they are worse than the shallow architectures.

In the talk presented by one of the author, Dumitru Erhan, another set of experiments using **Multi-layered kernel machines** for comparison, which was proposed by Schoelkopf et al. in 1998 by stacking up kernel-PCAs.

They have shown that the Multi-layer kernel machines work pretty well.

One last thing, their experiments are done using **PLearn**.

Subscribe to:
Posts (Atom)