Thursday, July 29, 2010

[HTK] HVite output information

Download now or preview on posterous
hvite.pdf (71 KB)

Using HVite for both recognition and alignment, we could set the trace level to be 1, then we could get information of following format in the terminal:

File: /home/li-bo/research/databases/WSJ/mfcc_0_d_a_z/c2l/c2la0102.mfcc
c2l  ==  [482 frames] -72.9054 [Ac=-35140.4 LM=0.0] (Act=72.4)

The first line show which file is processed.
The second line, firstly outputs the recognized result sequence end with "=="; after that, the important model likelihood information are given:

[482 frames] : total number of frames in the utterance;
-72.9054 : overall average log likelihood per frame for the sentence, which equals to ( acoustic log likelihood + language model log likelilhood ) / totoal frames;
[Ac=-35140.4  : the total acoustic log likelihood for the whole utterance;
LM=0.0]   :  the total language model log likelihood for the whole utterance;
(Act=72.4)  :  the average number of active models.

Similarly, in the recognized MLF file or aligned MLF file, there are also scores:

0 33900000 c02 -79.211006

Between the filename and the ".", the recognized or aligned results are written line by line.
Each line in the example has 4 fields, the number of fields varies due to the setting of HVite, e.g with "-f"  we could keep track of the model's state information.

In this example, 
the first field is the start time of that segment;
the second field is the ending time of that segment;
the third field is the recognized symbol;
the last field is the total log likelihood of that segment.

In the attached file, it also gives another example.

Posted via email from Troy's posterous

Wednesday, July 28, 2010

[Math] Principle Component Analysis

Principle Component Analysis (PCA) is a easy and useful technique to identify patterns in high dimensional data.

Matlab has a function princomp doing the PCA analysis.

[coeff, score, latent]=princomp(X)

X is the data arranged by row, i.e. each row is an observation or an instance, each column corresponds to a random variable, suppose the dimension of X is [n,p], that is there are totally n observations and each is of p dimension;

coeff: is the returned eigenvector matrix, each column is an eigenvector, they are ordered according the corresponding eigenvalues from large to small. coeff is of the dimension of [p,p];

score: is the reconstructed version of X using the eigenvectors. If all the eigenvectors are used, it should be the same as X; while if small eigenvectors are ignored, there will be small difference;

latent: are the eigenvalues corresponding to the eigenvectors in coeff.

With principle components (i.e. eigenvectors) the reconstructed data are got by:

score = ( X - mean(X) ) * coeff

The eigenvectors and eigenvalues are actually the eigenvectors and eigenvalues of the covariance matrix of the data. 

Thus to compute them, 
1) first subtract the mean of the data;
2) compute the covariance matrix of the data;
3) compute the eigenvectors and eigenvalues of the covariance matrix;

then it's done.

Posted via email from Troy's posterous

Tuesday, July 27, 2010

[General] Phonology Basics II

Download now or preview on posterous
Phonology_response.pdf (20560 KB)

This paper is the authors' response to other researchers' comments on the previous paper. 

It gives much more complementary information to the one post previously.

Posted via email from Troy's posterous

[General] Phonology Basics I

Download now or preview on posterous
HL0835.pdf (12831 KB)

This a very good introduction paper to articulatory phonology. 

Posted via email from Troy's posterous

Monday, July 26, 2010

[General] IPA to ARPAbet mapping

Download now or preview on posterous
table.pdf (62 KB)

IPA is the standard pronunciation phoneme for English, which in automatic speech recognition area, we commonly used ARPAbet phonemes for research.

The attached is the table mapping these two set of phonemes.

Posted via email from Troy's posterous

Articulatory Related Resources - 2006

The first paper:
Articulatory feature-based methods for acoustic and audio-visual speech recognition: summery from the 2006 JHU summer workshop
The second paper:
Manual Transcription fo conversational speech at the articulatory feature level

The above two papers presented the results they have done using articulatory for speech recognition.

Their project are on the above URL.

A great place for articulatory speech recognition!!!

Posted via email from Troy's posterous

[Articulatory Features] Multilingual Articulatory Features

Download now or preview on posterous (2107 KB)

Ostendorf, for example, argues that pronunciation variability in spontaneous speech is the main reason for the poor performance. She claims that though it is possible to model pronunciation variants using a phonetic representation of words the success of this approach has been limited. Ostendorf therefore assumes that pronunciation variants are only poorly described by means of phoneme substitution, deletion and insertion. she also thinks that the use of linguistically motivated distinctive features could provide the necessary granularity to better deal with pronunciation variants by using context dependent rules that describe the value changes of features.

Kirchhoff also acknowledges that it is easier to model pronunciation variants with the help of articulatory features. She points out that articulatory features exhibit a dual nature because they have a relation to the speech signal as well as to higher-level linguistic units. Furthermore, since a feature often is common to multiple phonemes, training data is better shared for features than for phonemes. Also for AF detection fewer classes have to be distinguished (e.g. binary features). Therefore statistical models can be trained more robustly for articulatory features than for phonemes. Consequently feature recognition rates frequently outperform phoneme recognition rates.

Although this paper is using articulatory feature for multilingual speech recognition, it details about those articulatory features. It is a helpful material for articulatory features related research.

Posted via email from Troy's posterous

[General] Universal Background Model

A Universal Background Model (UBM) is usually used to model represent the general characteristics. For example, in speaker verification, the UBM is adopted for general speaker characteristics and different speakers could either be modeled by training a speaker dependent model of adapting the UBM to speakers.

Gaussian Mixture Model is commonly adopted for UBM to capture universal information.

In the attached paper, the author gives a brief introduction to the UBM.

Posted via email from Troy's posterous

Thursday, July 22, 2010

[Speech Production] Speech Production knowledge in automatic speech recognition

Everyday Audio

Everyday Audio

This is a term that represents a wide range of speech speaker, channel, and environmental conditions that people typically encounter and routinely adapt to in responding and recognizing speech signals. Currently, ASR systems deliver significantly degraded performance when they encounter audio signals that differ from the limited conditions under which they were originally developed and trained. This is true in many cases even if the differences are slight.

This focused research area would concentrated on creating and developing systems that would be much more robust against variability and shifts in acoustic environments, reverberations, external noise sources, communication channels (e.g., far-field microphones, cellular phones), speaker characteristics (e.g., speaker style, nonnative accents, emotional state), and language characteristics (e.g., formal/informal styles, dialects, vocabulary, topic domain). New techniques and architectures are proposed to enable exploring these critical issues in environments as diverse as meeting-room presentations and unstructured conversations. A primary focus would be exploring alternatives for automatically adapting to changing conditions in multiple dimensions, even simultaneously. The goal is to deliver accurate and useful speech transcripts automatically under many more environments and diverse circumstances than is now possible, thereby enabling many more applications. This challenging problem can productively draw on expertise and knowledge from related disciplines, including natural-language processing, information retrieval, and cognitive science.

Speaker Characteristics and Style

It is well known that speech characteristics (e.g., age,nonnative accent) vary widely among speakers due to many factors, including speaker physiology, speaker style (e.g., speech rate, spontaneity of speech, emotional state of the speaker), and accents (both regional and nonnative). The primary method currently used for making ASR systems more robust to variations in speaker characteristics is to include a wide range of speakers in the training. Speaker adaptation mildly alleviates problems with new speakers within the "span" of known speaker and speech types but usually fails for new types.

Current ASR systems assume a pronunciation lexicon that models native speakers of a language. Furthermore, they train on large amounts of speech data from various native speakers of the language. A number of modeling approaches have been explored in modeling accented speech, including explicit modeling accented speech, adaptation of native acoustic models via accented speech data and hybrid systems that combine these two approaches. Pronunciation variants have also been tried in the lexicon to accommodate accented speech. Except for small gains, the problem is largely unsolved.

Similarly, some progress has been made for automatically detecting speaking rate from the speech signal, but such knowledge is not exploited in ASR systems, mainly due to the lack of any explicit mechanism to model speaking rate in the recognition process.

Cited from Baker, J.; Li Deng; Glass, J.; Khudanpur, S.; Chin-hui Lee; Morgan, N.; O'Shaughnessy, D.; , "Developments and directions in speech recognition and understanding," Signal Processing Magazine, IEEE , vol.26, no.3, pp.75-80, May 2009. 

Posted via email from Troy's posterous

Challenges in speech recognition and understanding

Download now or preview on posterous
speech.web.pdf (158 KB)

The report is from :

Grand Challenges:

Everyday audio
Rapid probability to emerging languages
Self-adaptive language capabilities
Detection of rare, key events
Cognition-derived speech and language systems
Spoken-language comprehension (mimicking average language skills at a first-to-third-grad level)

Details could be found in the report. 

Posted via email from Troy's posterous

Wednesday, July 21, 2010

The two review papers for speech recognition and understanding

Download now or preview on posterous
51891.pdf (139 KB)

Download now or preview on posterous
51879.pdf (208 KB)

Posted via email from Troy's posterous

Great Overview Article


I saw this great information last night before sleeping, it's really helpful to me.

Janet M. Baker, Li Deng,
James Glass, Sanjeev Khudanpur,
Chin-Hui Lee, Nelson Morgan, and
Douglas O’Shaughnessy

Research Developments and Directions in Speech Recognition and Understanding, Part 1
Research Developments and Directions in Speech Recognition and Understanding, Part 2 

This article was MINDS 2006–2007 Report of the Speech Understanding Working Group,” one of five reports emanating from two workshops titled “Meeting of the MINDS: Future Directions for Human Language Technology,” sponsored by the U.S. Disruptive Technology Office (DTO). For me it was striking that spontaneous events are so important, I never thought about them from this point of view.

The whole state of things is also nicely described in Mark Gales talk Acoustic Modelling for Speech Recognition: Hidden Markov from Models and Beyond? The picture on the left is taken from it.

Posted via email from Troy's posterous

Saturday, July 17, 2010

Conditional Random Fields, begining


Thomas G. Dietterich. Machine Learning for Sequential Data: A Review. In Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes in Computer Science, Vol. 2396, T. Caelli (Ed.), pp. 15–30, Springer-Verlag, 2002.

 Statistical learning problems in many fields involve sequential data. This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for addressing these problems. These methods include sliding window methods, recurrent sliding windows, hidden Markov models, conditional random fields, and graph transformer networks. The paper also discusses some open research issues. 

Simon Lacoste-Julien. Combining SVM with graphical models for supervised classification: an introduction to Max-Margin Markov Networks. CS281A Project Report, UC Berkeley, 2003.

 The goal of this paper is to present a survey of the concepts needed to understand the novel Max-Margin Markov Networks (M3-net) framework, a new formalism invented by Taskar, Guestrin and Koller which combines both the advantages of the graphical models and the Support Vector Machines (SVMs) to solve the problem of multi-label multi-class supervised classification. We will compare generative models, discriminative graphical models and SVMs for this task, introducing the basic concepts at the same time, leading at the end to a presentation of the M3-net paper. 

Ben Taskar, Carlos Guestrin and Daphne Koller. Max-Margin Markov Networks. In Advances in Neural Information Processing Systems 16 (NIPS 2003), 2004.

 In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 

Fuchun Peng and Andrew McCallum (2004). Accurate Information Extraction from Research Papers using Conditional Random Fields. In Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT/NAACL-04), 2004.

 With the increasing use of research paper search engines, such as CiteSeer, for both literature search and hiring decisions, the accuracy of such systems is of paramount importance. This paper employs Conditional Random Fields (CRFs) for the task of extracting various common fields from the headers and citation of research papers. The basic theory of CRFs is becoming well-understood, but best-practices for applying them to real-world data requires additional exploration. This paper makes an empirical exploration of several factors, including variations on Gaussian, exponential and hyperbolic priors for improved regularization, and several classes of features and Markov order. On a standard benchmark data set, we achieve new state-of-the-art performance, reducing error in average F1 by 36%, and word error rate by 78% in comparison with the previous best SVM results. Accuracy compares even more favorably against HMMs. 

Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. In Advances in Neural Information Processing Systems 17 (NIPS 2004), 2005.

 We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semi-CRF on an input sequence x outputs a "segmentation" of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 

Posted via email from Troy's posterous

Monday, July 12, 2010

[Speaker Adaptation][1995] Acoustic characteristics of speaker individuality: Control and conversion

Voice Individuality, i.e. speaker characteristic, is detailed explained and the research history on this topic is also illustrated.

Voice individuality, in particular, is important not only because it helps us identify the person to whom we are talking, but also because it enriches our daily life with variety. However, for most speaker-independent speech recognition tasks, voice individuality is simply an obstacle that must be overcome, and speaker normalization and adaptation are methods that have been developed for that purpose.

What is voice individuality?

Speaking style
social status
community the speaker belongs to
sound or timbre
emotional state

Acoustic characteristics of voice individuality:

1) voice source:
the average pitch frequency;
the time-frequency pattern of pitch (the pitch contour);
the pitch frequency fluctuation
the glottal wave shape

2) vocal tract resonance:
the shape of spectral envelope and spectral tilt;
the absolute values of formant frequencies;
the time-frequency pattern of formant frequencies (formant trajectories)
the long-term average speech spectrum
the formant bandwidth

Spectral conversion algorithms:

1) Code book mapping

2) Speaker difference vector inter/extra-polation

3) Frequency scaling

4) Multi-speaker inter/extra-polation

5) Unsupervised conversion

Three stages of the mapping methods:

1) parametric representation of acoustic characteristics;

2) mapping algorithms;

3) speech corpus development to train the weights on the mappings.

Posted via email from Troy's posterous

[Speaker Adaptation][2009] Estimating speaker characteristics for speech recognition

A speaker-characteristic-based hierarchic tree of speech recognition model is proposed (as shown in the figure below).

Two kinds of features are adopted for the tree node splitting:
1) 1D vocal tract length warping factor
2) 4D vector: vocal tract length, two spectral slop parameters and a model variance scaling

On each internal nodes, the dimension values are an interval, and only in the leaf nodes, each node corresponds to a unique speaker profile vector.

The speaker profile vector is used to adapt the original trained model to estimate a profile-specific transformation. (However, how the transformation is estimated is not clearly explained.)

Recognition is done by firstly using the test speaker's profile down through the tree to find the the best model (if I understand correctly).

Experiments are carried on recognizing the children's connected digits speech using the originally trained on adults' speech data.

Knowledge on speech production can play an important role in speech recognition by imposing constrains on the structure of trained and adapted models. 

Download now or preview on posterous (163 KB)

Posted via email from Troy's posterous

[Speaker Recognition][2003] Modeling prosodic dynamics for speaker recognition

The fundamental frequency and energy trajectories are adopted to capture long-term speaker information for speaker recognition.

Prosodic information can be used to effectively improve performance of and add robustness to speaker recognition systems.

1) Global statistics of some prosodic-based feature are estimated and compared between two utterances, e.g. comparing the mean and standard deviation of the fundamental frequency between enrollment and test utterances;

2) Appending the prosodic features to standard spectral-based features and using the traditional distribution modeling systems, may not well capture the temporal dynamic information.

In this paper, the relation between dynamics of the fundamental frequency and energy trajectories is used to characterize the speaker's identity.

Another approach proposed is to allow explicit template matching of the f0 contours of a predefined set of words and phrases.

(How to explain EER:)
The performance measure used to evaluate the described systems is the equal error rate (EER). It represents the system performance when the false acceptance rate (accepting an impostor) is equal to the missed detection rate (rejecting a true speaker).

With prosody, as with other aspects of spoken language, speaker information may be found in both static and dynamic forms and may originate from anatomical, physiological, or behavioral differences among individuals.

(Explain why system fusion is needed:)
Since the baseline system is modeling the absolute f0 and energy values while the slope system is modeling the relative f0 and energy contour dynamics, it is expected that a fusion of these systems should produce better performance than the individual systems.

Posted via email from Troy's posterous

Saturday, July 10, 2010

[Speaker Adaptation][2004] Semi-supervised speaker adaptation

Download now or preview on posterous (240 KB)

The semi-supervised speaker adaptation is carried out by combining the MLLR and MAP adaptation techniques with the Confidence Measure score generated by NN.

MLLR is more capable to adapt the model with limited amount of adaptation data, thus is used at the early stage of the unsupervised, online adaptation. 

After that the MLLR adapted model is used as the prior for the MAP method to update the model. With the increase of the amount of speaker specific data, MAP adaptation would lead to the SD system.

Instead of using the standard CM scores got from the recognition process, which has a high computational cost, a NN is trained to predict whether the phones are recognized correctly or not. The features adopted to predict the CM scores are mainly related to phoneme durations, speaker rate and acoustic score.

Posted via email from Troy's posterous

[Speaker Adaptation][1999] Combined speech and speaker recognition with speaker adapted connectionist models

A Twin-Output MLP model is proposed in this paper, which predicts two sets of phone posteriors, one for the adapted speaker and the other for the "world" (i.e. speakers except the adapted one).

The model could both be used for speech recognition and speaker identification.

The structure of the model is shown below:

The model is trained in following steps:

1) Train the SI MLP model first using all the training data;
2) Duplicate the SI MLP's output layer to have two identical set of phone units, both the weights and the bias are duplicated;
3) Update the cloned model using the speaker specific data together with same amount of data randomly selected from the previously used training data;
4) For speech recognition, the authors have shown that using only the speaker specific phone units' posteriors (renormalized) yields the improved performance over the original SI model.
5) Also this model could be used for speaker recognition, as two sets of posteriors are gained simultaneously for second stage speaker verification.

Download now or preview on posterous (97 KB)

Posted via email from Troy's posterous

[Speaker Adaptation][1996] Speaker Adaptation by modeling the speaker variations in a continuous speech recognition system

Explicit speaker variation modelling is important for the acoustic modelling.

This paper tries to model the speaker variations in the hybrid NN/HMM system.

In the acoustic NN model, two speaker space units are used to indicate the speaker variations.

The activations of those two speaker space units are got from a number of speaker units ( actually is the number of speakers in the training set ).

The topology of the speaker sensitive NN is shown below:

Download now or preview on posterous (107 KB)

Posted via email from Troy's posterous

Finally, real PDF annotating under Linux! (with help from Wine)


Finally, a way to annotate PDF files under Linux (provided you can run Wine)

I have been looking for years for a solution to annotate PDF files from my Linux box. I usually do a lot of proof-reading, and these highlight and post-it features are just gold when you have to transmit your comments using the internet.

On the other hand, there is as far as I know NO software that can add annotations to PDF files in a clean way. Here are the ones I tried:

  • PDFEdit did some horrible glitch on my screen when I tried to change the document. Anyway, it looks good at modifying PDF files, but could not even figure out whether it supports annotations. Looks pretty complicated to use.
  • Foxit Reader has a Linux binary available. Of course, it segfaulted as soon as I tried to open a PDF file.
  • Xournal and its derivates are often claimed to support that feature. However, all they do is turning the PDF file into an image that you can annotate. Not exactly the same thing.
  • Okular is the only tool that actually has a real annotation tool for PDFs. It just looked like the holy grail, until I realized the annotations were not saved withing the PDF, but written separately... Which makes them unusable for any other reader.

One of the reason why PDF annotation support is so poor is no Linux PDF library supports it. As a consequence, software that uses them cannot neither. So we will probably we stuck with this situation until GNU PDF gets mature (which may take a while).

The solution came from the controversial Wine. I resigned myself to try a couple of Windows software under it. This is where I realized that the Windows software world is very different - Foxit requires to try some shit before you can download it freely, other software is paying, and so on. Moments like that remind me why I'm not part of this world.

But, finally, I found an assle-free, doing-the-job software that just installs and works flawlessly under Wine. It is called PDF-XChange Viewer and did not ask me to waste my time of my money before I can use it. Just needed to download the installation binary, gave it to wine, then run the software through wine without any particular twiddling. It just worked.

Sure, this is not free (as in free speech) software, nor is it native Linux, but waiting for a real free solution this is still a better compromise than dual-booting or buying software that doesn't work.

Posted via email from Troy's posterous

[Speaker Adaptation][1992] A new approach to speaker adaptation by modeling the pronunciation in automatic speech recognition

In standard ASR systems, the pronunciation models are derived from the dictionary. For different speakers, they may pronounce the same word with different phone sequences, i.e. ignoring some consonants,  producing some adjacent phones together, etc. 

In this paper, the pronunciation model (i.e. the lexicon ) is adapted to each speaker to dealing with this kind of variability.

Posted via email from Troy's posterous

Thursday, July 8, 2010

Upload release files to SourceForge using SCP


User jsmith seeks to put to the Rel_1 directory of his project, fooproject:

scp jsmith,

Posted via email from Troy's posterous

Google Code and Mercurial Tutorial for Eclipse Users

Google Code and Mercurial Tutorial for Eclipse Users


This post is a short introduction to using the Eclipse IDE and the HgEclipse plugin for projects hosted at Google Code Mercurial repositories.

HgEclipse is a free and open source Eclipse plugin that supports the Mercurial Distributed Version Control System right within the IDE, thus making these two a very convenient and efficient toolset for Java and C/C++ development. It is important to note that HgEclipse is absolutely not limited to Google Code's Mercurial repositories, it works with any Mercurial repository. We use Google Code merely because of its immediate availability and easy use.

What is Google Code? It is a popular hosting service for open source projects. It provides revision control, a rudimentary wiki, a basic issue tracker and file downloads. As for revision control, since April 2009 Google Code offers the Mercurial Distributed Version Control System as an option besidesSubversion. (Git is still unsupported by Google Code. If you are a Git user, codeBeamer Managed Repositories is a compelling option for you!)

We recommend watching the guided video first, then reading the explanatory notes below. First, here is the video tutorial. Watch it in full-screen HD:

1 - Installing HgEclipse

You can easily install the plugin with the Eclipse Update Manager. Read here how.

2 - Cloning a repository from Google Code

You start by making a local copy (or clone according to the right DVCS terminology) of the repository hosted at the Google servers, and import that to Eclipse as a new project.

Luckily, HgEclipse offers a dedicated wizard for cloning and it makes the whole process very easy. The only thing you will need is the repository URL that is visible at the Command-line access section, under the Source tab in Google Code. You don't even need to authenticate with user name and password, since the Google Code repositories can be cloned anonymously.

3 - Working locally and committing your changes

You can add, delete and modify files just as you do it without Mercurial. The small grey stars over the regular file icons always indicate the files which have uncommitted changes.

When you decide you want to commit some changes, the most convenient way to do that is switching to the Syncronization View. This view shows three types of changes:

  1. Uncomitted: your local changes that are waiting to be committed to your local repo.
  2. Outgoing: changes committed to the local repo, but not pushed to Google Code yet.
  3. Incoming: changes that were pushed to Google Code by other developers, but are not pulled into your local repo yet.
Accordingly, you can use this view to commit changes to your local repository, push changes up to Google Code or pull changes down from Google Code. For now, just select some changes and commit them.

Don't forget that, unlike in centralized version control systems, at this point these changes are still only in your local repo, and not in Google Code! The distributed approach also has the advantage that you can commit even offline. You need to be online only when pushing to Google Code.

4 - Pushing your changes to Google Code

If you have changes that are worth to be contributed to the Google Code repo, then it's time to push them. This requires you to login to Google Code. Important: to identify yourself, you have to use your Google Code specific password, not your regular Google user account. You can see this password clicking your user name in the web interface, then switching to the Settings tab.

Posted via email from Troy's posterous

ICML papers on Deep Learning

In the following, we first list some papers published since 2008, to reflect the new research activities since the last deep learning workshop held at NIPS, Dec 2007, and then list some earlier papers as well.

Papers published since 2008

Papers published before 2008


Ahmed et al., 2008
Ahmed, A., Yu, K., Xu, W., Gong, Y., & Xing, E. P. (2008). 
Training hierarchical feed-forward visual recognition models using transfer learning from pseudo tasks. 
European Conference on Computer Visionpdf

Bengio, 2009
Bengio, Y. (2009). 
Learning deep architectures for AI. 
To appear in Foundations and Trends in Machine Learningpdf

Bengio et al., 2007
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). 
Greedy layer-wise training of deep networks. 
Neural Information Processing Systemspdf

Bengio & LeCun, 2007
Bengio, Y., & LeCun, Y. (2007). 
Scaling learning algorithms towards AI. 
Large Scale Kernel Machinespdf

Collobert & Weston, 2008
Collobert, R., & Weston, J. (2008). 
A unified architecture for natural language processing: Deep neural networks with multitask learning. 
International Conference on Machine Learningpdf

Hinton, 2006
Hinton, G. (2006). 
To recognize shapes, first learn to generate images. 
Technical Reportpdf

Hinton et al., 2006
Hinton, G., Osindero, S., & Teh., Y.-W. (2006). 
A fast learning algorithm for deep belief nets. 
Neural Computation18, 1527-1554. pdf

Hinton & Salakhutdinov, 2006
Hinton, G. E., & Salakhutdinov, R. R. (2006). 
Reducing the dimensionality of data with neural networks. 
Science313, 504 - 507. pdf

Karklin & Lewicki, 2008
Karklin, Y., & Lewicki, M. (2008). 
Emergence of complex cell properties by learning to generalize in natural scenes. 

Larochelle et al., 2007
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). 
An empirical evaluation of deep architectures on problems with many factors of variation. 

Le Roux & Bengio, 2008
Le Roux, N., & Bengio, Y. (2008). 
Representational power of restricted Boltzmann machines and deep belief networks. 
Neural Computationpdf

Lee et al., 2008
Lee, H., Ekanadham, C., & Ng, A. Y. (2008). 
Sparse deep belief network model for visual area V2. 
Neural Information Processing Systemspdf

Lee et al., 1998
Lee, T., Mumford, D., Romero, R., & Lamme, V. (1998). 
The role of the primary visual cortex in higher level vision. 
Vision research38, 2429-2454. pdf

Mnih & Hinton, 2009
Mnih, A., & Hinton, G. (2009). 
A scalable hierarchical distributed language model. 
Neural Information Processing Systemspdf

Ranzato et al., 2008
Ranzato, M., Boureau, Y.-L., & LeCun, Y. (2008). 
Sparse feature learning for deep belief networks. 
Neural Information Processing Systemspdf

Ranzato et al., 2007
Ranzato, M., Huang, F.-J., Boureau, Y.-L., & LeCun, Y. (2007). 
Unsupervised learning of invariant feature hierarchies with applications to object recognition. 
IEEE Conference on Computer Vision and Pattern Recognitionpdf

Ranzato & Szummer, 2008
Ranzato, M., & Szummer, M. (2008). 
Semi-supervised learning of compact document representations with deep networks. 
International Conferenece on Machine Learningpdf

Salakhutdinov & Hinton, 2007
Salakhutdinov, R., & Hinton, G. (2007). 
Semantic hashing. 
SIGIR workshop on Information Retrieval and applications of Graphical Modelspdf

Salakhutdinov & Hinton, 2008
Salakhutdinov, R., & Hinton, G. (2008). 
Using deep belief nets to learn covariance kernels for Gaussian Processes. 
Neural Information Processing Systemspdf

Salakhutdinov & Murray, 2008
Salakhutdinov, R., & Murray, I. (2008). 
On the quantitative analysis of deep belief networks. 
International Conference on Machine Learningpdf

Torralba et al., 2008
Torralba, A., Fergus, R., & Weiss, Y. (2008). 
Small codes and large image databases for recognition. 
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionpdf

Vincent et al., 2008
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). 
Extracting and composing robust features with denoising autoencoders. 
International Conference on Machine Learningpdf

Welling et al., 2004
Welling, M., Rosen-Zvi, M., & Hinton, G. E. (2004). 
Exponential family harmoniums with an application to information retrieval. 

Weston et al., 2008
Weston, J., Ratle, F., & Collobert, R. (2008). 
Deep learning via semi-supervised embedding. 
International Conference on Machine Learningpdf

Yu et al., 2009
Yu, K., Xu, W., & Gong, Y. (2009). 
Deep learning with kernel regularization for visual recognition. 
Neural Information Processing Systemspdf

Posted via email from Troy's posterous

Introduction to Deep Learning


Introduction to Deep Learning Algorithms

See the following article for a recent survey of deep learning:

Yoshua Bengio, Learning Deep Architectures for AI, Foundations and Trends in Machine Learning, 2(1), 2009


The computations involved in producing an output from an input can be represented by a flow graph: a flow graph is a graph representing a computation, in which each node represents an elementary computation and a value (the result of the computation, applied to the values at the children of that node). Consider the set of computations allowed in each node and possible graph structures and this defines a family of functions. Input nodes have no children. Output nodes have no parents.

The flow graph for the expression sin(a^2+b/a) could be represented by a graph with two input nodes aand b, one node for the division b/a taking a and b as input (i.e. as children), one node for the square (taking only a as input), one node for the addition (whose value would be a^2+b/a) and taking as input the nodes a^2 and b/a, and finally one output node computing the sinus, and with a single input coming from the addition node.

A particular property of such flow graphs is depth: the length of the longest path from an input to an output.

Traditional feedforward neural networks can be considered to have depth equal to the number of layers (i.e. the number of hidden layers plus 1, for the output layer). Support Vector Machines (SVMs) have depth 2 (one for the kernel outputs or for the feature space, and one for the linear combination producing the output).

Motivations for Deep Architectures

The main motivations for studying learning algorithms for deep architectures are the following:

Insufficient depth can hurt

Depth 2 is enough in many cases (e.g. logical gates, formal [threshold] neurons, sigmoid-neurons, Radial Basis Function [RBF] units like in SVMs) to represent any function with a given target accuracy. But this may come with a price: that the required number of nodes in the graph (i.e. computations, and also number of parameters, when we try to learn the function) may grow very large. Theoretical results showed that there exist function families for which in fact the required number of nodes may grow exponentially with the input size. This has been shown for logical gates, formal neurons, and RBF units. In the latter case Hastad has shown families of functions which can be efficiently (compactly) represented with O(n)nodes (for n inputs) when depth is d, but for which an exponential number (O(2^n)) of nodes is needed if depth is restricted to d-1.

One can see a deep architecture as a kind of factorization. Most randomly chosen functions can’t be represented efficiently, whether with a deep or a shallow architecture. But many that can be represented efficiently with a deep architecture cannot be represented efficiently with a shallow one (see the polynomials example in the Bengio survey paper). The existence of a compact and deep representation indicates that some kind of structure exists in the underlying function to be represented. If there was no structure whatsoever, it would not be possible to generalize well.

The brain has a deep architecture

For example, the visual cortex is well-studied and shows a sequence of areas each of which contains a representation of the input, and signals flow from one to the next (there are also skip connections and at some level parallel paths, so the picture is more complex). Each level of this feature hierarchy represents the input at a different level of abstraction, with more abstract features further up in the hierarchy, defined in terms of the lower-level ones.

Note that representations in the brain are in between dense distributed and purely local: they are sparse: about 1% of neurons are active simultaneously in the brain. Given the huge number of neurons, this is still a very efficient (exponentially efficient) representation.

Cognitive processes seem deep

  • Humans organize their ideas and concepts hierarchically.
  • Humans first learn simpler concepts and then compose them to represent more abstract ones.
  • Engineers break-up solutions into multiple levels of abstraction and processing

It would be nice to learn / discover these concepts (knowledge engineering failed because of poor introspection?). Introspection of linguistically expressible concepts also suggests a sparse representation: only a small fraction of all possible words/concepts are applicable to a particular input (say a visual scene).

Breakthrough in Learning Deep Architectures

Before 2006, attempts at training deep architectures failed: training a deep supervised feedforward neural network tends to yield worse results (both in training and in test error) then shallow ones (with 1 or 2 hidden layers).

Three papers changed that in 2006, spearheaded by Hinton’s revolutionary work on Deep Belief Networks (DBNs):

The following key principles are found in all three papers:

  • Unsupervised learning of representations is used to (pre-)train each layer.
  • Unsupervised training of one layer at a time, on top of the previously trained ones. The representation learned at each level is the input for the next layer.
  • Use supervised training to fine-tune all the layers (in addition to one or more additional layers that are dedicated to producing predictions).

The DBNs use RBMs for unsupervised learning of representation at each layer. The Bengio et al paper explores and compares RBMs and auto-encoders (neural network that predicts its input, through a bottleneck internal layer of representation). The Ranzato et al paper uses sparse auto-encoder (which is similar to sparse coding) in the context of a convolutional architecture. Auto-encoders and convolutional architectures will be covered later in the course.

Since 2006, a plethora of other papers on the subject of deep learning has been published, some of them exploiting other principles to guide training of intermediate representations. See Learning Deep Architectures for AI for a survey.

Posted via email from Troy's posterous