Tuesday, November 30, 2010

[Feature] SIFT

Download now or preview on posterous
tutSIFT04.pdf (3040 KB)

In spite of significant progress in automatic speech recognition over the years, robustness still appears to be a stumbling block. Current commercial products are quite sensitive to changes in recording device, to acoustic clutter in the form of additional speech signals, and so on. The goal of replicating human performance in a machine remains far from sight.

Scale Invariant Feature Transform (SIFT) is an approach for detecting and extracting local feature descriptors that are reasonably invariant to changes in illumination, image noise, rotation, scaling, and small changes in viewpoint.

Detection stages for SIFT features:
1) Scale-space extrema detection

Interest points for SIFT features correspond to local extrema of difference-of-Gaussian filters at different scales.

Interest points (called keypoints in the SIFT framework) are identified as local maxima or minma of the DoG (difference of Gaussian) images across scales. Each pixel in the DoG images is compared to its 8 neighbors at the same scale, plus the 9 corresponding neighbors at neighboring scales. If the pixel is a local maximum or minimum, it is selected as a candidate keypoint.

For each candidate keypoint:
- Interpolation of nearby data is used to accurately determine its position;
- Keypoints with low contrast are removed;
- Responses along edges are eliminated;
- The keypoint is assigned an orientation.

To determine the keypoint orientation, a gradient orientation histogram is computed in the neighborhood of the keypoint (using the Gaussian image at the closest scale to the keypoint's scale). The contribution of each neighboring pixel is weighted by the gradient magnitude and a Gaussian window with a theta that is 1.5 times the scale of the keypoint.

Peaks in the histogram correspond to dominant orientations. A separate keypoint is created for the direction corresponding to the histogram maximum, and any other direction within 80% of the maximum value.

All the properties of the keypoint are measured relative to the keypoint orientation, this provides invariance to rotation.

2) Key point localization

3) Orientation assignment

4) Generation of keypoint descriptors

Posted via email from Troy's posterous

[Feature] Speech Recognition with localized time-frequency pattern detectors

Characteristics of the localized time-frequency features:
1) Local in frequency domain, not like MFCC, each feature is affected by all the frequencies;
2) Temporal dynamics, modeling long and variable time durations, while in MFCC, the features are all short time and fixed duration.

In this paper, the set of filters adopted is very simple, and are essentially basic edge detectors taking only values +1 and -1. The selection includes vertical edges (of varying frequency span and temporal duration) for onsets and offsets; wide horizontal edges for frication cutoffs; and horizontal edges tilted at various slopes to model formant transitions. The choices for the ranges of the various parameters were made based on acoustic phonetic knowledge, such as typical formant bandwidths, average phone durations, typical rates of formant movement, etc.[Book: Acoustic Phonetics]


With these filters, the features are computed as follows:
For each filter,
1) centering it at a particular frequency, and convolving with the whole spectrogram along the time axis for that specific frequency;
2) for each frequency value, we could get a time series of the convolution sums;
3) to reduce the dimension of features, the convolution sums are down sampled over a 16*32 point grid.
4) The 32 frequency points are taken linearly between 0 and 4kHz.
5) The 16 time points (which is specific to their task) is the centers of the 16 states (for the HMM word model) in the state alignment.

Thus, a feature refers to both the filter shape, and the time-frequency point (16*32 point grid).

In this paper, the task is to classify isolated digits. Each spectrogram is one single digit. The feature computed is per digit, thus is global to the target class.


The problem with current frame based features is its non-localization, make it difficult to modeling speaker variabilities.
As shown in following figure:

Posted via email from Troy's posterous

Monday, November 29, 2010

[Speech] Phonetic cues

While automatic speech recognition systems have steadily improved and are now in widespread use, their accuracy continues to lag behind human performance, particularly in adverse conditions. 

There has been much progress and ASR technology is now in widespread use; however, there is still a considerable gap between human and machine performance, particularly in adverse conditions.

How human evolution makes humans different from machines in perceiving speech signals?

What's the major differences between humans and machine when processing speech signals? And which are the crucial ones?

The parts-based model (PBM), based on previous work in machine vision, uses graphical models to represent speech with a deformable template of spectro-temporally localized "parts", as opposed to modeling speech as a sequence of fixed spectral profiles.

Perhaps most importantly, ASR systems have benefited greatly from general improvements in computer technology. The availability of very large datasets and the ability to utilize them for training models has been very beneficial. Also, with ever increasing computing power, more powerful search techniques can be utilized during recognition.

Reaching the ultimate goal of ASR - human-level (or beyond) performance in all conditions and on all tasks - will require investigating other regions of this landscape, even if doing so results in back-tracking in progress in the short-term.

Utilize the knowledge of acoustic phonetics and human speech perception in speech recognition.

We will argue that well known phonetic cues crucial to human speech perception are not modeled effectively in standard ASR systems, and that there is a benefit to model such cues explicitly, rather than implicitly.

The "glimpsing" model of speech perception suggests that humans can robustly decode noise-corrupted speech by taking advantage of local T-F regions having high SNR, and is supported by empirical evidence and computational models.

Auditory Neuroscience, Tonotopic maps. Also, recent research seeking to characterize the behavior of individual neurons in the mammalian auditory cortex has resulted in models in which cortical neurons act as localized spectro-temporal pattern detectors. ( represented by their so-called spectro-temporal receptive filed, or STRF).

+ Localized T-F pattern detectors
+ Explicit modeling of phonetic cues


acoustic ----------> phonetic cues  ------------------> phonemes

Posted via email from Troy's posterous

[DBN] Learning rate for RBM training

Thursday, November 25, 2010

[Speech] Spectrogram

From: http://www-3.unipv.it/cibra/edu_spectrogram_uk.html

To analyze sounds it is required to have an acoustic receiver (a microphone, an hydrophone or a vibration transducer) and an analyzer suitable for the frequencies of the signals we want to measure. Eventually, a recorder may allow to permanently store the sounds to allow later analyses or playbacks.

A spectrograph transforms sounds into images to make "visible", and thus measurable and comparable, sound features the human hear can't perceive. Spectrograms (also called sonograms or sonagrams) may show infrasounds, like those emitted by some large whales or by elephants, as well as ultrasounds, like those emitted by echolocating dolphins and by echolocating bats, but also emitted by insects and small rodents.

Spectrograms may reveal features, like fast frequency or amplitude modulations we can't hear even if they lie within our hearing frequency limits (30 Hz - 16 kHz). Spectrograms are widely used to show the features of animal voices, of the human voice and also of machinery noise.

A real-time spectrograph displays continuously the results of the analyses on the incoming sounds with a very small - often not perceivable - delay. This kind of instrumentation is very useful in field research because it allows to continuously monitor the sounds received by the sensors, to immediately evaluate their features, and to classify the received signals. A spectrograph can be dedicated instrument or a normal computer equipped with suitable hardware for receiving and digitizing sounds and a software to analyze sounds and convert them into a graphical representation.

Normally, a spectrogram represents the time on the x axis, frequency on the y axis and the amplitude of the signals by using a scale of grays or a scale of colours. In some applications, in particular those related with military uses, the x and y axes are swapped.

The quality and features of a spectrogram are controlled by a set of parameters. A default set can be used for generic display, but some parameters can be changed to optimize the display of specific features of the signals.
Also, by modifying the colour scale it is possible to optimize the display of the amplitude range of interest.

Posted via email from Troy's posterous

Tuesday, November 23, 2010

[News] ACMTech Nov.23

AT&T Ups the Ante in Speech Recognition
CNet (11/18/10) Marguerite Reardon

AT&T says it has devised technologies to boost the accuracy of speech and language recognition technology as well as broaden voice activation to other modes of communication.  AT&T's Watson technology platform is a cloud-based system of services that identifies words as well as interprets meaning and contexts to make results more accurate.  AT&T recently demonstrated various technologies such as the iRemote, an application that transforms smartphones into voice-activated TV remotes that let users speak natural sentences asking to search for specific programs, actors, or genres.  Most voice-activated remotes respond to prerecorded commands, but the iRemote not only recognizes words, but also employs other language precepts such as syntax and semantics to interpret and comprehend the request's meaning.  AT&T also is working on voice technology that mimics natural voices through its AT&T Natural Voices technology, which builds on text-to-speech technology to enable any message to be spoken in various languages, including English, French, Italian, German, or Spanish when text is processed via the AT&T cloud-based service.  The technology accesses a database of recorded sounds that, when combined by algorithms, generate spoken phrases.

What If We Used Poetry to Teach Computers to Speak Better?
McGill University (11/17/10)

McGill University linguistics researcher Michael Wagner is studying how English and French speakers use acoustic cues to stress new information over old information.  Finding evidence of a systematic difference in how the two languages use these cues could aid computer programmers in their effort to produce more realistic-sounding speech.  Wagner is working with Harvard University's Katherine McCurdy to gain a better understanding of how people decide where to put emphasis.  They recently published research that examined the use of identical rhymes in poetry in each language.  The study found that even when repeated words differ in meaning and sound the same, the repeated information should be acoustically reduced as otherwise it will sound odd.  "Voice synthesis has become quite impressive in terms of the pronunciation of individual words," Wagner says.  "But when a computer 'speaks,' whole sentences still sound artificial because of the complicated way we put emphasis on parts of them, depending on context and what we want to get across."  Wagner is now working on a model that better predicts where emphasis should fall in a sentence given the context of discourse.

Posted via email from Troy's posterous

Monday, November 22, 2010

Enabling Terminal's directory and file color highlighting in Mac

From: http://www.geekology.co.za/blog/2009/04/enabling-bash-terminal-directory-file-color-highlighting-mac-os-x/

By default Mac OS X’s Terminal application uses the Bash shell (Bourne Again SHell) but doesn’t havedirectory and file color highlighting enabled to indicate resource types and permissions settings.


Enabling directory and file color highlighting requires that you open (or create~/.bash_profile in your favourite text editor, add these contents:

export CLICOLOR=1 export LSCOLORS=ExFxCxDxBxegedabagacad

… save the file and open a new Terminal window (shell session). Any variant of the “ls” command:

ls ls -l ls -la ls -lah

… will then display its output in color.

More details on the LSCOLORS variable can be found by looking at the man page for “ls“:

man ls

LSCOLORS needs 11 sets of letters indicating foreground and background colors:

  1. directory
  2. symbolic link
  3. socket
  4. pipe
  5. executable
  6. block special
  7. character special
  8. executable with setuid bit set
  9. executable with setgid bit set
  10. directory writable to others, with sticky bit
  11. directory writable to others, without sticky bit

The possible letters to use are:

a black b red c green d brown e blue f magenta c cyan h light grey A block black, usually shows up as dark grey B bold red C bold green D bold brown, usually shows up as yellow E bold blue F bold magenta G bold cyan H bold light grey; looks like bright white x default foreground or background

By referencing these values, the strongstrongstrongstrongstrong

Posted via email from Troy's posterous

[Apple] Old versions of Xcode

There are a bunch of tools for Mac development on the site http://connect.apple.com

It also provides old versions of Xcode.

Posted via email from Troy's posterous

Old versions of iPhone SDK

From: http://iphonesdkdev.blogspot.com/2010/04/old-versions-of-iphone-sdk.html

You need Apple developer account to login
But Apple has disabled some of the links recently

iPhone SDK 2.2.1 Leopard (10.5.4)

iPhone SDK 3.0 (Xcode 3.1.3) Leopard (10.5.7)

iPhone SDK 3.0 (Xcode 3.2) Snow Leopard (10.6.0)

iPhone SDK 3.1 with Xcode 3.1.4 Leopard (10.5.7)

iPhone SDK 3.1 with XCode 3.2.1 for Snow Leopard (10.6.0)

iPhone SDK 3.1.2 with XCode 3.1.4 for Leopard (10.5.7)

iPhone SDK 3.1.2 with XCode 3.2.1 for Snow Leopard (10.6.0)

Update : You are too late, Apple has removed the links above.

iPhone SDK 3.1.3 with XCode 3.1.4 for Leopard (10.5.7)

iPhone SDK 3.1.3 with XCode 3.2.1 for Snow Leopard (10.6.0)

iPhone SDK 3.2 Final with Xcode 3.2.2 for Snow Leopard (10.6.0)

Xcode 3.2.3 and iPhone SDK 4 GM seed for Snow Leopard (10.6.2)

Xcode 3.2.3 and iPhone SDK 4 Final for Snow Leopard (10.6.2)

Xcode 3.2.3 and iOS SDK 4.0.1 for Snow Leopard (10.6.4)

Xcode 3.2.3 and iOS SDK 4.0.2 for Snow Leopard (10.6.4)

Credits go to Cédric Luthi for telling us the correct url above

Xcode 3.2.4 and iOS SDK 4.1 for Snow Leopard (10.6.4)

Xcode 3.2.5 and iOS SDK 4.2 GM for Snow Leopard (10.6.4)

Posted via email from Troy's posterous

Tuesday, November 9, 2010

[DBN] Machine learning for sequential data: a review

For sequential problems, the sequences exhibit significant sequential correlation. That is, nearby x and y values are likely to be related to each other.

To model the sequential correlations, we usually adopt the joint probability for the whole sequence as the objective function for learning.

Methods that analyze the entire sequence of x_t values before predicting the y_t labels typically can give better performance on the sequential supervised learning problem.

Non-uniform loss functions usually represented by a cost matrix C(i,j), which gives the cost of assigning label i to an example whose true label is j. In such cases, the goal is to find a classifier with minimum expected cost.

How can these kind of loss function be incorporated into sequential supervised learning? One approach is to view the learning problem as the task of predicting the (conditional) joint probability of all the labels in the output sequence: P(y|x). If this joint distribution can be accurately predicted, then all of the various loss functions can be evaluated.

We would like to find methods to extract features handling long distance interactions.

Approaches reviewed in this paper:

1) The sliding window method;
2) Recurrent sliding window
3) Hidden Markov Models
4) Maximum Entropy Markov Models
5) Input-Output Markov Models
6) Conditional Random Fields
7) Graph Transformer Networks

Posted via email from Troy's posterous

Sunday, November 7, 2010

[DBN] Advance machine learning lecture notes - III

Regarding the layer size for a deep belief network:
   The layers should not get smaller and they should be initialized correctly.

For visible and hidden units, whether to use binary values or probabilities:
  Only the first hidden units values are binary values, others like v0, v1, h1 adopt real valued probabilities.

How to measure the quality of the models as the probability is hard to compute due to the partition function:

Posted via email from Troy's posterous

[DBN] Energy based model for sparse overcomplete representation

In this paper, the authors proposed the energy based models for overcomplete sparse representation problems. 

They compared it with the Independent Component Analysis and showed some similarity between those training approaches.

1) The causal generative approach for ICA
2) The information maximization approach for ICA
3) The energy based approach 

Posted via email from Troy's posterous

[DBN] Advance machine learning lecture notes - II

Comparison between two model combination approaches:

PoE learning rule:

The derivation from the second equation to the third one is detailed in the attached file.

Download now or preview on posterous
poe.pdf (152 KB)

Posted via email from Troy's posterous

Thursday, November 4, 2010

[Papers] Selected papers from Interspeech 2010 by Nickolay

Following are a bunch of papers selected from INTERSPEECH 2010. The original post can be found here: http://nsh.nexiwave.com/2010/10/reading-interspeech-2010-program.html

Unsupervised Discovery and Training of Maximally Dissimilar Cluster Models
Francoise Beaufays (Google)
Vincent Vanhoucke (Google)
Brian Strope (Google)
One of the difficult problems of acoustic modeling for Automatic Speech Recognition (ASR) is how to adequately model the wide variety of acoustic conditions which may be present in the data. The problem is especially acute for tasks such as Google Search by Voice, where the amount of speech available per transaction is small, and adaptation techniques start showing their limitations. As training data from a very large user population is available however, it is possible to identify and jointly model subsets of the data with similar acoustic qualities. We describe a technique which allows us to perform this modeling at scale on large amounts of data by learning a tree-structured partition of the acoustic space,and we demonstrate that we can significantly improve recognition accuracy in various conditions through unsupervised Maximum Mutual Information (MMI) training. Being fully unsupervised, this technique scales easily to increasing numbers of conditions.

Techniques for topic detection based processing in spoken dialog systems
Rajesh Balchandran (IBM T J Watson Research Center)
Leonid Rachevsky (IBM T J Watson Research Center)
Bhuvana Ramabhadran (IBM T J Watson Research Center)
Miroslav Novak (IBM T J Watson Research Center)
In this paper we explore various techniques for topic detection in the context of conversational spoken dialog systems and also propose variants over known techniques to address the constraints of memory, accuracy and scalability associated with their practical implementation of spoken dialog systems. Tests were carried out on a multiple-topic spoken dialog system to compare and analyze these techniques. Results show benefits and compromises with each approach suggesting that the best choice of technique for topic detection would be dependent on the specific deployment requirements.

A Hybrid Approach to Robust Word Lattice Generation Via Acoustic-Based Word Detection
Icksang Han (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Chiyoun Park (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongmi Cho (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongsu Kim (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
A large-vocabulary continuous speech recognition (LVCSR) system usually utilizes a language model in order to reduce the complexity of the algorithm. However, the constraint also produces side-effects including low accuracy of the out-of-grammar sentences and the error propagation of misrecognized words. In order to compensate for the side-effects of the language model, this paper proposes a novel lattice generation method that adopts the idea from the keyword detection method. By combining the word candidates detected mainly from the acoustic aspect of the signal to the word lattice from the ordinary speech recognizer, a hybrid lattice is constructed. The hybrid lattice shows 33% improvement in terms of the lattice accuracy under the condition where the lattice density is the same. In addition, it is observed that the proposed model shows less sensitivity to the out-of-grammar sentences and to the error propagation due to misrecognized words.

Time Condition Search in Automatic Speech Recognition Reconsidered
David Nolden (RWTH Aachen)
Hermann Ney (RWTH Aachen)
Ralf Schlueter (RWTH Aachen)
In this paper we re-investigate the time conditioned search (TCS) method in comparison to the well known word conditioned search, and analyze its applicability on state-of-the-art large vocabulary continuous speech recognition tasks. In contrast to current standard approaches, time conditioned search offers theoretical advantages particularly in combination with huge vocabularies and huge language models, but it is difficult to combine with across word modelling, which was proven to be an important technique in automatic speech recognition. Our novel contributions for TCS are a pruning step during the recombination called Early Word End Pruning, an additional recombination technique called Context Recombination, the idea of a Startup Interval to reduce the number of started trees, and a mechanism to combine TCS with across word modelling. We show that, with these techniques, TCS can outperform WCS on a current task.

Direct Construction of Compact Context-Dependency Transducers From Data
David Rybach (RWTH Aachen University, Germany)
Michael Riley (Google Inc., USA)
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision-tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.

On the relation of Bayes Risk, Word Error, and Word Posteriors in ASR
Ralf Schlueter (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Markus Nussbaum-Thom (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Hermann Ney (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
In automatic speech recognition, we are faced with a well-known inconsistency: Bayes decision rule is usually used to minimize sentence (word sequence) error, whereas in practice we want to minimize word error, which also is the usual evaluation measure. Recently, a number of speech recognition approaches to approximate Bayes decision rule with word error (Levenshtein/edit distance) cost were proposed. Nevertheless, experiments show that the decisions often remain the same and that the effect on the word error rate is limited, especially at low error rates. In this work, further analytic evidence for these observations is provided. A set of conditions is presented, for which Bayes decision rule with sentence and word error cost function leads to the same decisions. Furthermore, the case of word error cost is investigated and related to word posterior probabilities. The analytic results are verified experimentally on several large vocabulary speech recognition tasks.

Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models
Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories, NTT Corporation)
Taichi ASAMI (NTT Cyber Space Laboratories, NTT Corporation)
Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu MASATAKI (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi TAKAHASHI (NTT Cyber Space Laboratories, NTT Corporation)
This paper proposes an efficient data selection technique to identify well recognized texts in massive volumes of speech data. Conventional confidence measure techniques can be used to obtain this accurate data, but they require speech recognition results to estimate confidence. Without a significant level of confidence, considerable computer resources are wasted since inaccurate recognition results are generated only to be rejected later. The technique proposed herein rapidly estimates the prior confidence based on just an acoustic likelihood calculation by using speech and context independent models before speech recognition processing; it then recognizes data with high confidence selectively. Simulations show that it matches the data selection performance of the conventional posterior confidence measure with less than 2 % of the computation time.

Discovering an Optimal Set of Minimally Contrasting Acoustic Speech Units: A Point of Focus for Whole-Word Pattern Matching
Guillaume Aimetti (University of Sheffield)
Roger Moore (Universty of Sheffield)
Louis ten Bosch (Radboud University)
This paper presents a computational model that can automatically learn words, made up from emergent sub-word units, with no prior linguistic knowledge. This research is inspired by current cognitive theories of human speech perception, and therefore strives for ecological plausibility with the desire to build more robust speech recognition technology. Firstly, the particulate structure of the raw acoustic speech signal is derived through a novel acoustic segmentation process, the `acoustic DP-ngram algorithm'. Then, using a cross-modal association learning mechanism, word models are derived as a sequence of the segmented units. An efficient set of sub-word units emerge as a result of a general purpose lossy compression mechanism and the algorithms sensitivity to discriminate acoustic differences. The results show that the system can automatically derive robust word representations and dynamically build re-usable sub-word acoustic units with no pre-defined language-specific rules.

Modeling pronunciation variation using context-dependent articulatory feature decision trees
Samuel Bowman (Linguistics, The University of Chicago)
Karen Livescu (TTI-Chicago)
We consider the problem of predicting the surface pronunciations of a word in conversational speech, using a feature-based model of pronunciation variation. We build context-dependent decision trees for both phone-based and feature-based models, and compare their perplexities on conversational data from the Switchboard Transcription Project. We find that feature-based decision trees using featur e bundles based on articulatory phonology outperform phone-based decision trees, and are much more r obust to reductions in training data. We also analyze the usefulness of various context variables.

Accelerating Hierarchical Acoustic Likelihood Computation on Graphics Processors
Pavel Kveton (IBM)
Miroslav Novak (IBM)
The paper presents a method for performance improvements of a speech recognition system by moving a part of the computation - acoustic likelihood computation - onto a Graphics Processor Unit (GPU). In the system, GPU operates as a low cost powerful coprocessor for linear algebra operations. The paper compares GPU implementation of two techniques of acoustic likelihood computation: full Gaussian computation of all components and a significantly faster Gaussian selection method using hierarchical evaluation. The full Gaussian computation is an ideal candidate for GPU implementation because of its matrix multiplication nature. The hierarchical Gaussian computation is a technique commonly used on a CPU since it leads to much better performance by pruning the computation volume. Pruning techniques are generally much harder to implement on GPUs, nevertheless, the paper shows that hierarchical Gaussian computation can be efficiently implemented on GPUs.

The AMIDA 2009 Meeting Transcription System
Thomas Hain (Univ Sheffield)
Lukas Burget (Brno Univ. of Technology)
John Dines (Idiap)
Philip N. Garner (Idiap)
Asmaa El Hannani (Univ. Sheffield)
Marijn Huijbregts (Univ. Twente)
Martin Karafiat (Brno Univ. of Technology)
Mike Lincoln (Univ. of Edinburgh)
Wan Vincent (Univ. Of Sheffield)
We present the AMIDA 2009 system for participation in the NIST RT'2009 STT evaluations. Systems for close-talking, far field and speaker attributed STT conditions are described. Improvements to our previous systems are: segmentation and diarisation; stacked bottle-neck posterior feature extraction; fMPE training of acoustic models; adaptation on complete meetings; improvements to WFST decoding; automatic optimisation of decoders and system graphs. Overall these changes gave a 6-13% relative reduction in word error rate while at the same time reducing the real-time factor by a factor of five and using considerably less data for acoustic model training.

Robert Peharz (Graz University of Technology)
Michael Stark (Graz University of Technology)
Franz Pernkopf (Graz University of Technology)
Yannis Stylianou (University of Crete)
We propose a probabilistic factorial sparse coder model for single channel source separation in the magnitude spectrogram domain. The mixture spectrogram is assumed to be the sum of the sources, which are assumed to be generated frame-wise as the output of sparse coders plus noise. For dictionary training we use an algorithm which can be described as non-negative matrix factorization with ℓ0 sparseness constraints. In order to infer likely source spectrogram candidates, we approximate the intractable exact inference by maximizing the posterior over a plausible subset of solutions. We compare our system to the factorial-max vector quantization model, where the proposed method shows a superior performance in terms of signal-to-interference ratio. Finally, the low computational requirements of the algorithm allows close to real time applications.

Yasmina Benabderrahmane (INRS-EMT Telecommunications Canada)
Sid Ahmed Selouani (Université de Moncton Canada)
Douglas O’Shaughnessy (INRS-EMT Telecommunications Canada)
This paper deals with blind speech separation of convolutive mixtures of sources. The separation criterion is based on Oriented Principal Components Analysis (OPCA) in the frequency domain. OPCA is a (second order) extension of standard Principal Component Analysis (PCA) aiming at maximizing the power ratio of a pair of signals. The convolutive mixing is obtained by modeling the Head Related Transfer Function (HRTF). Experimental results show the efficiency of the proposed approach in terms of subjective and objective evaluation, when compared to the Degenerate Unmixing Evaluation Technique (DUET) and the widely used C-FICA (Convolutive Fast-ICA) algorithm

Speaker Adaptation Based on System Combination Using Speaker-Class Models
Tetsuo Kosaka (Yamagata University)
Takashi Ito (Yamagata University)
Masaharu Kato (Yamagata University)
Masaki Kohda (Yamagata University)
In this paper, we propose a new system combination approach for an LVCSR system using speaker-class (SC) models and a speaker adaptation technique based on these SC models. The basic concept of the SC-based system is to select speakers who are acoustically similar to a target speaker to train acoustic models. One of the major problems regarding the use of the SC model is determining the selection range of the speakers. In other words, it is difficult to determine the number of speakers that should be selected. In order to solve this problem, several SC models, which are trained by a variety of number of speakers are prepared in advance. In the recognition step, acoustically similar models are selected from the above SC models, and the scores obtained from these models are merged using a word graph combination technique. The proposed method was evaluated using the Corpus of Spontaneous Japanese (CSJ), and showed significant improvement in a lecture speech recognition task.

Feature versus Model Based Noise Robustness
Kris Demuynck (Katholieke Universiteit Leuven,  dept. ESAT)
Xueru Zhang (Katholieke Universiteit Leuven,  dept. ESAT)
Dirk Van Compernolle (Katholieke Universiteit Leuven,  dept. ESAT)
Hugo Van hamme (Katholieke Universiteit Leuven,  dept. ESAT)
Over the years, the focus in noise robust speech recognition has shifted from noise robust features to model based techniques such as parallel model combination and uncertainty decoding. In this paper, we contrast prime examples of both approaches in the context of large vocabulary recognition systems such as used for automatic audio indexing and transcription. We look at the approximations the techniques require to keep the computational load reasonable, the resulting computational cost, and the accuracy measured on the Aurora4 benchmark. The results show that a well designed feature based scheme is capable of providing recognition accuracies at least as good as the model based approaches at a substantially lower computational cost

The role of higher-level linguistic features in HMM-based speech synthesis
Oliver Watts (Centre for Speech Technology Research, University of Edinburgh, UK)
Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (Centre for Speech Technology Research, University of Edinburgh, UK)
We analyse the contribution of higher-level elements of the linguistic specification of a data-driven speech synthesiser to the naturalness of the synthetic speech which it generates. The system is trained using various subsets of the full feature-set, in which features relating to syntactic category, intonational phrase boundary, pitch accent and boundary tones are selectively removed. Utterances synthesised by the different configurations of the system are then compared in a subjective evaluation of their naturalness. The work presented forms background analysis for an on-going set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be trivially extracted from text. By building a range of systems, each assuming the availability of a different level of linguistic annotation, we obtain benchmarks for our on-going work.

Latent Perceptual Mapping: A New Acoustic Modeling Framework for Speech Recognition
Shiva Sundaram (Deutsche Telekom Laboratories, Ernst-Reuter-Platz-7, Berlin 10587. Germany)
Jerome Bellegarda (Apple Inc., 3 Infinte Loop, Cupertino, 95014 California. USA.)
While hidden Markov modeling is still the dominant paradigm for speech recognition, in recent years there has been renewed interest in alternative, template-like approaches to acoustic modeling. Such methods sidestep usual HMM limitations as well as inherent issues with parametric statistical distributions, though typically at the expense of large amounts of memory and computing power. This paper introduces a new framework, dubbed latent perceptual mapping, which naturally leverages a reduced dimensionality description of the observations. This allows for a viable parsimonious template-like solution where models are closely aligned with perceived acoustic events. Context-independent phoneme classification experiments conducted on the TIMIT database suggest that latent perceptual mapping achieves results comparable to conventional acoustic modeling but at potentially significant savings in online costs.

State-based labelling for a sparse representation of speech and its application to robust speech recognition
Tuomas Virtanen (Department of Signal Processing, Tampere University of Technology, Finland)
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Antti Hurmalainen (Department of Signal Processing, Tampere University of Technology, Finland)
This paper proposes a state-based labeling for acoustic patterns of speech and a method for using this labelling in noise-robust automatic speech recognition. Acoustic time-frequency segments of speech, exemplars, are obtained from a training database and associated with time-varying state labels using the transcriptions. In the recognition phase, noisy speech is modeled by a sparse linear combination of noise and speech exemplars. The likelihoods of states are obtained by linear combination of the exemplar weights, which can then be used to estimate the most likely state transition path. The proposed method was tested in the connected digit recognition task with noisy speech material from the Aurora-2 database where it is shown to produce better results than the existing histogram-based labeling method.

Single-channel speech enhancement using Kalman filtering in the modulation domain
Stephen So (Signal Processing Laboratory, Griffith University)
Kamil K. Wojcicki (Signal Processing Laboratory, Griffith University)
Kuldip K. Paliwal (Signal Processing Laboratory, Griffith University)
In this paper, we propose the modulation-domain Kalman filter (MDKF) for speech enhancement. In contrast to previous modulation domain-enhancement methods based on bandpass filtering, the MDKF is an adaptive and linear MMSE estimator that uses models of the temporal changes of the magnitude spectrum for both speech and noise. Also, because the Kalman filter is a joint magnitude and phase spectrum estimator, under non-stationarity assumptions, it is highly suited for modulation-domain processing, as modulation phase tends to contain more speech information than acoustic phase. Experimental results from the NOIZEUS corpus show the ideal MDKF (with clean speech parameters) to outperform all the acoustic and time-domain enhancement methods that were evaluated, including the conventional time-domain Kalman filter with clean speech parameters. A practical MDKF that uses the MMSE-STSA method to enhance noisy speech in the acoustic domain prior to LPC analysis was also evaluated and showed promising results.

Metric Subspace Indexing for Fast Spoken Term Detection
Taisuke Kaneko (Toyohashi University of Technology)
Tomoyosi Akiba (Toyohashi University of Technology)
In this paper, we propose a novel indexing method for Spoken Term Detection (STD). The proposed method can be considered as using metric space indexing for the approximate string-matching problem, where the distance between a phoneme and a position in the target spoken document is defined. The proposed method does not require the use of thresholds to limit the output, instead being able to output the results in increasing order of distance. It can also deal easily with the multiple candidates obtained via Automatic Speech Recognition (ASR). The results of preliminary experiments show promise for achieving fast STD.

Discriminative Language Modeling Using Simulated ASR Errors
Preethi Jyothi (Department of Computer Science and Engineering, The Ohio State University, USA)
Eric Fosler-Lussier (Department of Computer Science and Engineering, The Ohio State University, USA)
In this paper, we approach the problem of discriminatively training language models using a weighted finite state transducer (WFST) framework that does not require acoustic training data. The phonetic confusions prevalent in the recognizer are modeled using a confusion matrix that takes into account information from the pronunciation model (word-based phone confusion log likelihoods) and information from the acoustic model (distances between the phonetic acoustic models). This confusion matrix, within the WFST framework, is used to generate confusable word graphs that serve as inputs to the averaged perceptron algorithm to train the parameters of the discriminative language model. Experiments on a large vocabulary speech recognition task show significant word error rate reductions when compared to a baseline using a trigram model trained with the maximum likelihood criterion.

Learning a Language Model from Continuous Speech
Graham Neubig (Graduate School of Informatics, Kyoto University)
Masato Mimura (Graduate School of Informatics, Kyoto University)
Shinsuke Mori (Graduate School of Informatics, Kyoto University)
Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)
This paper presents a new approach to language model construction, learning a language model not from text, but directly from continuous speech. A phoneme lattice is created using acoustic model scores, and Bayesian techniques are used to robustly learn a language model from this noisy input. A novel sampling technique is devised that allows for the integrated learning of word boundaries and an n-gram language model with no prior linguistic knowledge. The proposed techniques were used to learn a language model directly from continuous, potentially large-vocabulary speech. This language model was able to significantly reduce the ASR phoneme error rate over a separate set of test data, and the proposed lattice processing and lexical acquisition techniques were found to be important factors in this improvement.

New Insights into Subspace Noise Tracking
Mahdi Triki (Philips Research Laboratories)

Posted via email from Troy's posterous