Thursday, July 8, 2010

Introduction to Deep Learning

From: http://www.iro.umontreal.ca/~pift6266/H10/notes/deepintro.html

Introduction to Deep Learning Algorithms

See the following article for a recent survey of deep learning:

Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009

Depth

The computations involved in producing an output from an input can be represented by a flow graph: a flow graph is a graph representing a computation, in which each node represents an elementary computation and a value (the result of the computation, applied to the values at the children of that node). Consider the set of computations allowed in each node and possible graph structures and this defines a family of functions. Input nodes have no children. Output nodes have no parents.

The flow graph for the expression $sin(a^2+b/a)$ could be represented by a graph with two input nodes $a$ and $b$ , one node for the division $b/a$ taking $a$ and $b$ as input (i.e. as children), one node for the square (taking only $a$ as input), one node for the addition (whose value would be $a^2+b/a)$ and taking as input the nodes $a^2$ and $b/a$ , and finally one output node computing the sinus, and with a single input coming from the addition node.

A particular property of such flow graphs is depth: the length of the longest path from an input to an output.

Traditional feedforward neural networks can be considered to have depth equal to the number of layers (i.e. the number of hidden layers plus 1, for the output layer). Support Vector Machines (SVMs) have depth 2 (one for the kernel outputs or for the feature space, and one for the linear combination producing the output).

Motivations for Deep Architectures

The main motivations for studying learning algorithms for deep architectures are the following:

Insufficient depth can hurt
The brain has a deep architecture
Cognitive processes seem deep

Insufficient depth can hurt

Depth 2 is enough in many cases (e.g. logical gates, formal [threshold] neurons, sigmoid-neurons, Radial Basis Function [RBF] units like in SVMs) to represent any function with a given target accuracy. But this may come with a price: that the required number of nodes in the graph (i.e. computations, and also number of parameters, when we try to learn the function) may grow very large. Theoretical results showed that there exist function families for which in fact the required number of nodes may grow exponentially with the input size. This has been shown for logical gates, formal neurons, and RBF units. In the latter case Hastad has shown families of functions which can be efficiently (compactly) represented with $O(n)$ nodes (for $n$ inputs) when depth is $d$ , but for which an exponential number ( $O(2^n)$ ) of nodes is needed if depth is restricted to $d-1$ .

One can see a deep architecture as a kind of factorization. Most randomly chosen functions can’t be represented efficiently, whether with a deep or a shallow architecture. But many that can be represented efficiently with a deep architecture cannot be represented efficiently with a shallow one (see the polynomials example in the Bengio survey paper). The existence of a compact and deep representation indicates that some kind of structure exists in the underlying function to be represented. If there was no structure whatsoever, it would not be possible to generalize well.

The brain has a deep architecture

For example, the visual cortex is well-studied and shows a sequence of areas each of which contains a representation of the input, and signals flow from one to the next (there are also skip connections and at some level parallel paths, so the picture is more complex). Each level of this feature hierarchy represents the input at a different level of abstraction, with more abstract features further up in the hierarchy, defined in terms of the lower-level ones.

Note that representations in the brain are in between dense distributed and purely local: they are sparse: about 1% of neurons are active simultaneously in the brain. Given the huge number of neurons, this is still a very efficient (exponentially efficient) representation.

Cognitive processes seem deep

Humans organize their ideas and concepts hierarchically.
Humans first learn simpler concepts and then compose them to represent more abstract ones.
Engineers break-up solutions into multiple levels of abstraction and processing

It would be nice to learn / discover these concepts (knowledge engineering failed because of poor introspection?). Introspection of linguistically expressible concepts also suggests a sparse representation: only a small fraction of all possible words/concepts are applicable to a particular input (say a visual scene).

Breakthrough in Learning Deep Architectures

Before 2006, attempts at training deep architectures failed: training a deep supervised feedforward neural network tends to yield worse results (both in training and in test error) then shallow ones (with 1 or 2 hidden layers).

Three papers changed that in 2006, spearheaded by Hinton’s revolutionary work on Deep Belief Networks (DBNs):

Hinton, G. E., Osindero, S. and Teh, Y., A fast learning algorithm for deep belief netsNeural Computation 18:1527-1554, 2006

Yoshua Bengio, Pascal Lamblin, Dan Popovici and Hugo Larochelle, Greedy Layer-Wise Training of Deep Networks, in J. Platt et al. (Eds), Advances in Neural Information Processing Systems 19 (NIPS 2006), pp. 153-160, MIT Press, 2007

Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann LeCun Efficient Learning of Sparse Representations with an Energy-Based Model, in J. Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS 2006), MIT Press, 2007

The following key principles are found in all three papers:

Unsupervised learning of representations is used to (pre-)train each layer.

Unsupervised training of one layer at a time, on top of the previously trained ones. The representation learned at each level is the input for the next layer.
Use supervised training to fine-tune all the layers (in addition to one or more additional layers that are dedicated to producing predictions).

The DBNs use RBMs for unsupervised learning of representation at each layer. The Bengio et al paper explores and compares RBMs and auto-encoders (neural network that predicts its input, through a bottleneck internal layer of representation). The Ranzato et al paper uses sparse auto-encoder (which is similar to sparse coding) in the context of a convolutional architecture. Auto-encoders and convolutional architectures will be covered later in the course.

Since 2006, a plethora of other papers on the subject of deep learning has been published, some of them exploiting other principles to guide training of intermediate representations. See Learning Deep Architectures for AI for a survey.

Posted via email from Troy's posterous

A introduction to Machine Learning

From: http://www.iro.umontreal.ca/~pift6266/H10/notes/mlintro.html

Very Brief Introduction to Machine Learning for AI

The topics summarized here are covered in these slides.

Intelligence

The notion of intelligence can be defined in many ways. Here we define it as the ability to take the right decisions, according to some criterion (e.g. survival and reproduction, for most animals). To take better decisions requires knowledge, in a form that is operational, i.e., can be used to interpret sensory data and use that information to take decisions.

Artificial Intelligence

Computers already possess some intelligence thanks to all the programs that humans have crafted and which allow them to “do things” that we consider useful (and that is basically what we mean for a computer to take the right decisions). But there are many tasks which animals and humans are able to do rather easily but remain out of reach of computers, at the beginning of the 21st century. Many of these tasks fall under the label of Artificial Intelligence, and include many perception and control tasks. Why is it that we have failed to write programs for these tasks? I believe that it is mostly because we do not know explicitly (formally) how to do these tasks, even though our brain (coupled with a body) can do them. Doing those tasks involve knowledge that is currently implicit, but we have information about those tasks through data and examples (e.g. observations of what a human would do given a particular request or input). How do we get machines to acquire that kind of intelligence? Using data and examples to build operational knowledge is what learning is about.

Machine Learning

Machine learning has a long history and numerous textbooks have been written that do a good job of covering its main principles. Among the recent ones I suggest:

Here we focus on a few concepts that are most relevant to this course.

Formalization of Learning

First, let us formalize the most common mathematical framework for learning. We are given training examples

${\cal D} = \{z_1, z_2, \ldots, z_n\}$

with the $z_i$ being examples sampled from an unknown process $P(Z)$ . We are also given a loss functional $L$ which takes as argument a decision function $f$ and an example $z$ , and returns a real-valued scalar. We want to minimize the expected value of $L(f,Z)$ under the unknown generating process $P(Z)$ .

Supervised Learning

In supervised learning, each examples is an (input,target) pair: $Z=(X,Y)$ and $f$ takes an $X$ as argument. The most common examples are

regression: $Y$ is a real-valued scalar or vector, the output of $f$ is in the same set of values as $Y$ , and we often take as loss functional the squared error

$L(f,(X,Y)) = ||f(X) - Y||^2$

classification: $Y$ is a finite integer (e.g. a symbol) corresponding to a class index, and we often take as loss function the negative conditional log-likelihood, with the interpretation that $f_i(X)$ estimates $P(Y=i|X)$ :

$L(f,(X,Y)) = -\log f_Y(X)$

where we have the constraints

$f_Y(X) \geq 0 \;\;,\; \sum_i f_i(X) = 1$

Unsupervised Learning

In unsupervised learning we are learning a function $f$ which helps to characterize the unknown distribution $P(Z)$ . Sometimes $f$ is directly an estimator of $P(Z)$ itself (this is called density estimation). In many other cases $f$ is an attempt to characterize where the density concentrates. Clustering algorithms divide up the input space in regions (often centered around a prototype example or centroid). Some clustering algorithms create a hard partition (e.g. the k-means algorithm) while others construct a soft partition (e.g. a Gaussian mixture model) which assign to each $Z$ a probability of belonging to each cluster. Another kind of unsupervised learning algorithms are those that construct a new representation for $Z$ . Many deep learning algorithms fall in this category, and so does Principal Components Analysis.

Local Generalization

The vast majority of learning algorithms exploit a single principle for achieving generalization: local generalization. It assumes that if input example $x_i$ is close to input example $x_j$ , then the corresponding outputs $f(x_i)$ and $f(x_j)$ should also be close. This is basically the principle used to perform local interpolation. This principle is very powerful, but it has limitations: what if we have to extrapolate? or equivalently, what if the target unknown function has many more variations than the number of training examples? in that case there is no way that local generalization will work, because we need at least as many examples as there are ups and downs of the target function, in order to cover those variations and be able to generalize by this principle. This issue is deeply connected to the so-called curse of dimensionality for the following reason. When the input space is high-dimensional, it is easy for it to have a number of variations of interest that is exponential in the number of input dimensions. For example, imagine that we want to distinguish between 10 different values of each input variable (each element of the input vector), and that we care about about all the $10^n$ configurations of these $n$ variables. Using only local generalization, we need to see at least one example of each of these $10^n$ configurations in order to be able to generalize to all of them.

Distributed versus Local Representation and Non-Local Generalization

A simple-minded binary local representation of integer $N$ is a sequence of $B$ bits such that $N<B$ , and all bits are 0 except the $N$ -th one. A simple-minded binary distributed representation of integer $N$ is a sequence of $log_2 B$ bits with the usual binary encoding for $N$ . In this example we see that distributed representations can be exponentially more efficient than local ones. In general, for learning algorithms, distributed representations have the potential to capture exponentially more variations than local ones for the same number of free parameters. They hence offer the potential for better generalization because learning theory shows that the number of examples needed (to achieve a desired degree of generalization performance) to tune $O(B)$ effective degrees of freedom is $O(B)$ .

Another illustration of the difference between distributed and local representation (and corresponding local and non-local generalization) is with (traditional) clustering versus Principal Component Analysis (PCA) or Restricted Boltzmann Machines (RBMs). The former is local while the latter is distributed. With k-means clustering we maintain a vector of parameters for each prototype, i.e., one for each of the regions distinguishable by the learner. With PCA we represent the distribution by keeping track of its major directions of variations. Now imagine a simplified interpretation of PCA in which we care mostly, for each direction of variation, whether the projection of the data in that direction is above or below a threshold. With $d$ directions, we can thus distinguish between $2^d$ regions. RBMs are similar in that they define $d$ hyper-planes and associate a bit to an indicator of being on one side or the other of each hyper-plane. An RBM therefore associates one input region to each configuration of the representation bits (these bits are called the hidden units, in neural network parlance). The number of parameters of the RBM is roughly equal to the number these bits times the input dimension. Again, we see that the number of regions representable by an RBM or a PCA (distributed representation) can grow exponentially in the number of parameters, whereas the number of regions representable by traditional clustering (e.g. k-means or Gaussian mixture, local representation) grows only linearly with the number of parameters. Another way to look at this is to realize that an RBM can generalize to a new region corresponding to a configuration of its hidden unit bits for which no example was seen, something not possible for clustering algorithms (except in the trivial sense of locally generalizing to that new regions what has been learned for the nearby regions for which examples have been seen).

Posted via email from Troy's posterous

A Course on Deep Learning

From: http://www.iro.umontreal.ca/~pift6266/H10/notes/contents.html

Posted via email from Troy's posterous

Wednesday, July 7, 2010

Python Module Search Path

Modifying Python’s Search Path

When the Python interpreter executes an import

statement, it searches for both Python code and extension modules along a search path. A default value for the path is configured into the Python binary when the interpreter is built. You can determine the path by importing the sys module and printing the value of sys.path.

$ python Python 2.2 (#11, Oct 3 2002, 13:31:27) [GCC 2.96 20000731 (Red Hat Linux 7.3 2.96-112)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.path ['', '/usr/local/lib/python2.3', '/usr/local/lib/python2.3/plat-linux2', '/usr/local/lib/python2.3/lib-tk', '/usr/local/lib/python2.3/lib-dynload', '/usr/local/lib/python2.3/site-packages'] >>>

The null string in sys.path represents the current working directory.

The expected convention for locally installed packages is to put them in the .../site-packages/ directory, but you may want to install Python modules into some arbitrary directory. For example, your site may have a convention of keeping all software related to the web server under /www. Add-on Python modules might then belong in /www/python, and in order to import them, this directory must be added to sys.path. There are several different ways to add the directory.

The most convenient way is to add a path configuration file to a directory that’s already on Python’s path, usually to the .../site-packages/ directory. Path configuration files have an extension of .pth, and each line must contain a single path that will be appended to sys.path. (Because the new paths are appended to sys.path, modules in the added directories will not override standard modules. This means you can’t use this mechanism for installing fixed versions of standard modules.)

Paths can be absolute or relative, in which case they’re relative to the directory containing the .pth file. See the documentation of the site module for more information.

A slightly less convenient way is to edit the site.py file in Python’s standard library, and modify sys.path. site.py is automatically imported when the Python interpreter is executed, unless the -S switch is supplied to suppress this behaviour. So you could simply edit site.py and add two lines to it:

import sys sys.path.append('/www/python/')

However, if you reinstall the same major version of Python (perhaps when upgrading from 2.2 to 2.2.2, for example) site.py will be overwritten by the stock version. You’d have to remember that it was modified and save a copy before doing the installation.

There are two environment variables that can modify sys.path. PYTHONHOME sets an alternate value for the prefix of the Python installation. For example, if PYTHONHOME is set to /www/python, the search path will be set to ['', '/www/python/lib/pythonX.Y/','/www/python/lib/pythonX.Y/plat-linux2', ...].

The PYTHONPATH variable can be set to a list of paths that will be added to the beginning of sys.path. For example, ifPYTHONPATH is set to /www/python:/opt/py, the search path will begin with ['/www/python', '/opt/py']. (Note that directories must exist in order to be added to sys.path; the site module removes paths that don’t exist.)

Finally, sys.path is just a regular Python list, so any Python application can modify it by adding or removing entries.

Posted via email from Troy's posterous

Thursday, July 8, 2010