Dream & Passion: [Papers] Selected papers from Interspeech 2010 by Nickolay

Following are a bunch of papers selected from INTERSPEECH 2010. The original post can be found here: http://nsh.nexiwave.com/2010/10/reading-interspeech-2010-program.html

Unsupervised Discovery and Training of Maximally Dissimilar Cluster Models
Francoise Beaufays (Google)
Vincent Vanhoucke (Google)
Brian Strope (Google)
One of the difficult problems of acoustic modeling for Automatic Speech Recognition (ASR) is how to adequately model the wide variety of acoustic conditions which may be present in the data. The problem is especially acute for tasks such as Google Search by Voice, where the amount of speech available per transaction is small, and adaptation techniques start showing their limitations. As training data from a very large user population is available however, it is possible to identify and jointly model subsets of the data with similar acoustic qualities. We describe a technique which allows us to perform this modeling at scale on large amounts of data by learning a tree-structured partition of the acoustic space,and we demonstrate that we can significantly improve recognition accuracy in various conditions through unsupervised Maximum Mutual Information (MMI) training. Being fully unsupervised, this technique scales easily to increasing numbers of conditions.

Techniques for topic detection based processing in spoken dialog systems
Rajesh Balchandran (IBM T J Watson Research Center)
Leonid Rachevsky (IBM T J Watson Research Center)
Bhuvana Ramabhadran (IBM T J Watson Research Center)
Miroslav Novak (IBM T J Watson Research Center)
In this paper we explore various techniques for topic detection in the context of conversational spoken dialog systems and also propose variants over known techniques to address the constraints of memory, accuracy and scalability associated with their practical implementation of spoken dialog systems. Tests were carried out on a multiple-topic spoken dialog system to compare and analyze these techniques. Results show benefits and compromises with each approach suggesting that the best choice of technique for topic detection would be dependent on the specific deployment requirements.

A Hybrid Approach to Robust Word Lattice Generation Via Acoustic-Based Word Detection
Icksang Han (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Chiyoun Park (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongmi Cho (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongsu Kim (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
A large-vocabulary continuous speech recognition (LVCSR) system usually utilizes a language model in order to reduce the complexity of the algorithm. However, the constraint also produces side-effects including low accuracy of the out-of-grammar sentences and the error propagation of misrecognized words. In order to compensate for the side-effects of the language model, this paper proposes a novel lattice generation method that adopts the idea from the keyword detection method. By combining the word candidates detected mainly from the acoustic aspect of the signal to the word lattice from the ordinary speech recognizer, a hybrid lattice is constructed. The hybrid lattice shows 33% improvement in terms of the lattice accuracy under the condition where the lattice density is the same. In addition, it is observed that the proposed model shows less sensitivity to the out-of-grammar sentences and to the error propagation due to misrecognized words.

Time Condition Search in Automatic Speech Recognition Reconsidered
David Nolden (RWTH Aachen)
Hermann Ney (RWTH Aachen)
Ralf Schlueter (RWTH Aachen)
In this paper we re-investigate the time conditioned search (TCS) method in comparison to the well known word conditioned search, and analyze its applicability on state-of-the-art large vocabulary continuous speech recognition tasks. In contrast to current standard approaches, time conditioned search offers theoretical advantages particularly in combination with huge vocabularies and huge language models, but it is difficult to combine with across word modelling, which was proven to be an important technique in automatic speech recognition. Our novel contributions for TCS are a pruning step during the recombination called Early Word End Pruning, an additional recombination technique called Context Recombination, the idea of a Startup Interval to reduce the number of started trees, and a mechanism to combine TCS with across word modelling. We show that, with these techniques, TCS can outperform WCS on a current task.

Direct Construction of Compact Context-Dependency Transducers From Data
David Rybach (RWTH Aachen University, Germany)
Michael Riley (Google Inc., USA)
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision-tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.

On the relation of Bayes Risk, Word Error, and Word Posteriors in ASR
Ralf Schlueter (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Markus Nussbaum-Thom (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Hermann Ney (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
In automatic speech recognition, we are faced with a well-known inconsistency: Bayes decision rule is usually used to minimize sentence (word sequence) error, whereas in practice we want to minimize word error, which also is the usual evaluation measure. Recently, a number of speech recognition approaches to approximate Bayes decision rule with word error (Levenshtein/edit distance) cost were proposed. Nevertheless, experiments show that the decisions often remain the same and that the effect on the word error rate is limited, especially at low error rates. In this work, further analytic evidence for these observations is provided. A set of conditions is presented, for which Bayes decision rule with sentence and word error cost function leads to the same decisions. Furthermore, the case of word error cost is investigated and related to word posterior probabilities. The analytic results are verified experimentally on several large vocabulary speech recognition tasks.

Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models
Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories, NTT Corporation)
Taichi ASAMI (NTT Cyber Space Laboratories, NTT Corporation)
Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu MASATAKI (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi TAKAHASHI (NTT Cyber Space Laboratories, NTT Corporation)
This paper proposes an efficient data selection technique to identify well recognized texts in massive volumes of speech data. Conventional confidence measure techniques can be used to obtain this accurate data, but they require speech recognition results to estimate confidence. Without a significant level of confidence, considerable computer resources are wasted since inaccurate recognition results are generated only to be rejected later. The technique proposed herein rapidly estimates the prior confidence based on just an acoustic likelihood calculation by using speech and context independent models before speech recognition processing; it then recognizes data with high confidence selectively. Simulations show that it matches the data selection performance of the conventional posterior confidence measure with less than 2 % of the computation time.

Discovering an Optimal Set of Minimally Contrasting Acoustic Speech Units: A Point of Focus for Whole-Word Pattern Matching
Guillaume Aimetti (University of Sheffield)
Roger Moore (Universty of Sheffield)
Louis ten Bosch (Radboud University)
This paper presents a computational model that can automatically learn words, made up from emergent sub-word units, with no prior linguistic knowledge. This research is inspired by current cognitive theories of human speech perception, and therefore strives for ecological plausibility with the desire to build more robust speech recognition technology. Firstly, the particulate structure of the raw acoustic speech signal is derived through a novel acoustic segmentation process, the `acoustic DP-ngram algorithm'. Then, using a cross-modal association learning mechanism, word models are derived as a sequence of the segmented units. An efficient set of sub-word units emerge as a result of a general purpose lossy compression mechanism and the algorithms sensitivity to discriminate acoustic differences. The results show that the system can automatically derive robust word representations and dynamically build re-usable sub-word acoustic units with no pre-defined language-specific rules.

Modeling pronunciation variation using context-dependent articulatory feature decision trees
Samuel Bowman (Linguistics, The University of Chicago)
Karen Livescu (TTI-Chicago)
We consider the problem of predicting the surface pronunciations of a word in conversational speech, using a feature-based model of pronunciation variation. We build context-dependent decision trees for both phone-based and feature-based models, and compare their perplexities on conversational data from the Switchboard Transcription Project. We find that feature-based decision trees using featur e bundles based on articulatory phonology outperform phone-based decision trees, and are much more r obust to reductions in training data. We also analyze the usefulness of various context variables.

Accelerating Hierarchical Acoustic Likelihood Computation on Graphics Processors
Pavel Kveton (IBM)
Miroslav Novak (IBM)
The paper presents a method for performance improvements of a speech recognition system by moving a part of the computation - acoustic likelihood computation - onto a Graphics Processor Unit (GPU). In the system, GPU operates as a low cost powerful coprocessor for linear algebra operations. The paper compares GPU implementation of two techniques of acoustic likelihood computation: full Gaussian computation of all components and a significantly faster Gaussian selection method using hierarchical evaluation. The full Gaussian computation is an ideal candidate for GPU implementation because of its matrix multiplication nature. The hierarchical Gaussian computation is a technique commonly used on a CPU since it leads to much better performance by pruning the computation volume. Pruning techniques are generally much harder to implement on GPUs, nevertheless, the paper shows that hierarchical Gaussian computation can be efficiently implemented on GPUs.

The AMIDA 2009 Meeting Transcription System
Thomas Hain (Univ Sheffield)
Lukas Burget (Brno Univ. of Technology)
John Dines (Idiap)
Philip N. Garner (Idiap)
Asmaa El Hannani (Univ. Sheffield)
Marijn Huijbregts (Univ. Twente)
Martin Karafiat (Brno Univ. of Technology)
Mike Lincoln (Univ. of Edinburgh)
Wan Vincent (Univ. Of Sheffield)
We present the AMIDA 2009 system for participation in the NIST RT'2009 STT evaluations. Systems for close-talking, far field and speaker attributed STT conditions are described. Improvements to our previous systems are: segmentation and diarisation; stacked bottle-neck posterior feature extraction; fMPE training of acoustic models; adaptation on complete meetings; improvements to WFST decoding; automatic optimisation of decoders and system graphs. Overall these changes gave a 6-13% relative reduction in word error rate while at the same time reducing the real-time factor by a factor of five and using considerably less data for acoustic model training.

A FACTORIAL SPARSE CODER MODEL FOR SINGLE CHANNEL SOURCE SEPARATION
Robert Peharz (Graz University of Technology)
Michael Stark (Graz University of Technology)
Franz Pernkopf (Graz University of Technology)
Yannis Stylianou (University of Crete)
We propose a probabilistic factorial sparse coder model for single channel source separation in the magnitude spectrogram domain. The mixture spectrogram is assumed to be the sum of the sources, which are assumed to be generated frame-wise as the output of sparse coders plus noise. For dictionary training we use an algorithm which can be described as non-negative matrix factorization with ℓ0 sparseness constraints. In order to infer likely source spectrogram candidates, we approximate the intractable exact inference by maximizing the posterior over a plausible subset of solutions. We compare our system to the factorial-max vector quantization model, where the proposed method shows a superior performance in terms of signal-to-interference ratio. Finally, the low computational requirements of the algorithm allows close to real time applications.

ORIENTED PCA METHOD FOR BLIND SPEECH SEPARATION OF CONVOLUTIVE MIXTURES
Yasmina Benabderrahmane (INRS-EMT Telecommunications Canada)
Sid Ahmed Selouani (Université de Moncton Canada)
Douglas O’Shaughnessy (INRS-EMT Telecommunications Canada)
This paper deals with blind speech separation of convolutive mixtures of sources. The separation criterion is based on Oriented Principal Components Analysis (OPCA) in the frequency domain. OPCA is a (second order) extension of standard Principal Component Analysis (PCA) aiming at maximizing the power ratio of a pair of signals. The convolutive mixing is obtained by modeling the Head Related Transfer Function (HRTF). Experimental results show the efficiency of the proposed approach in terms of subjective and objective evaluation, when compared to the Degenerate Unmixing Evaluation Technique (DUET) and the widely used C-FICA (Convolutive Fast-ICA) algorithm

Speaker Adaptation Based on System Combination Using Speaker-Class Models
Tetsuo Kosaka (Yamagata University)
Takashi Ito (Yamagata University)
Masaharu Kato (Yamagata University)
Masaki Kohda (Yamagata University)
In this paper, we propose a new system combination approach for an LVCSR system using speaker-class (SC) models and a speaker adaptation technique based on these SC models. The basic concept of the SC-based system is to select speakers who are acoustically similar to a target speaker to train acoustic models. One of the major problems regarding the use of the SC model is determining the selection range of the speakers. In other words, it is difficult to determine the number of speakers that should be selected. In order to solve this problem, several SC models, which are trained by a variety of number of speakers are prepared in advance. In the recognition step, acoustically similar models are selected from the above SC models, and the scores obtained from these models are merged using a word graph combination technique. The proposed method was evaluated using the Corpus of Spontaneous Japanese (CSJ), and showed significant improvement in a lecture speech recognition task.

Feature versus Model Based Noise Robustness
Kris Demuynck (Katholieke Universiteit Leuven, dept. ESAT)
Xueru Zhang (Katholieke Universiteit Leuven, dept. ESAT)
Dirk Van Compernolle (Katholieke Universiteit Leuven, dept. ESAT)
Hugo Van hamme (Katholieke Universiteit Leuven, dept. ESAT)
Over the years, the focus in noise robust speech recognition has shifted from noise robust features to model based techniques such as parallel model combination and uncertainty decoding. In this paper, we contrast prime examples of both approaches in the context of large vocabulary recognition systems such as used for automatic audio indexing and transcription. We look at the approximations the techniques require to keep the computational load reasonable, the resulting computational cost, and the accuracy measured on the Aurora4 benchmark. The results show that a well designed feature based scheme is capable of providing recognition accuracies at least as good as the model based approaches at a substantially lower computational cost

The role of higher-level linguistic features in HMM-based speech synthesis
Oliver Watts (Centre for Speech Technology Research, University of Edinburgh, UK)
Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (Centre for Speech Technology Research, University of Edinburgh, UK)
We analyse the contribution of higher-level elements of the linguistic specification of a data-driven speech synthesiser to the naturalness of the synthetic speech which it generates. The system is trained using various subsets of the full feature-set, in which features relating to syntactic category, intonational phrase boundary, pitch accent and boundary tones are selectively removed. Utterances synthesised by the different configurations of the system are then compared in a subjective evaluation of their naturalness. The work presented forms background analysis for an on-going set of experiments in performing text-to-speech (TTS) conversion based on shallow features: features that can be trivially extracted from text. By building a range of systems, each assuming the availability of a different level of linguistic annotation, we obtain benchmarks for our on-going work.

Latent Perceptual Mapping: A New Acoustic Modeling Framework for Speech Recognition
Shiva Sundaram (Deutsche Telekom Laboratories, Ernst-Reuter-Platz-7, Berlin 10587. Germany)
Jerome Bellegarda (Apple Inc., 3 Infinte Loop, Cupertino, 95014 California. USA.)
While hidden Markov modeling is still the dominant paradigm for speech recognition, in recent years there has been renewed interest in alternative, template-like approaches to acoustic modeling. Such methods sidestep usual HMM limitations as well as inherent issues with parametric statistical distributions, though typically at the expense of large amounts of memory and computing power. This paper introduces a new framework, dubbed latent perceptual mapping, which naturally leverages a reduced dimensionality description of the observations. This allows for a viable parsimonious template-like solution where models are closely aligned with perceived acoustic events. Context-independent phoneme classification experiments conducted on the TIMIT database suggest that latent perceptual mapping achieves results comparable to conventional acoustic modeling but at potentially significant savings in online costs.

State-based labelling for a sparse representation of speech and its application to robust speech recognition
Tuomas Virtanen (Department of Signal Processing, Tampere University of Technology, Finland)
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Antti Hurmalainen (Department of Signal Processing, Tampere University of Technology, Finland)
This paper proposes a state-based labeling for acoustic patterns of speech and a method for using this labelling in noise-robust automatic speech recognition. Acoustic time-frequency segments of speech, exemplars, are obtained from a training database and associated with time-varying state labels using the transcriptions. In the recognition phase, noisy speech is modeled by a sparse linear combination of noise and speech exemplars. The likelihoods of states are obtained by linear combination of the exemplar weights, which can then be used to estimate the most likely state transition path. The proposed method was tested in the connected digit recognition task with noisy speech material from the Aurora-2 database where it is shown to produce better results than the existing histogram-based labeling method.

Single-channel speech enhancement using Kalman filtering in the modulation domain
Stephen So (Signal Processing Laboratory, Griffith University)
Kamil K. Wojcicki (Signal Processing Laboratory, Griffith University)
Kuldip K. Paliwal (Signal Processing Laboratory, Griffith University)
In this paper, we propose the modulation-domain Kalman filter (MDKF) for speech enhancement. In contrast to previous modulation domain-enhancement methods based on bandpass filtering, the MDKF is an adaptive and linear MMSE estimator that uses models of the temporal changes of the magnitude spectrum for both speech and noise. Also, because the Kalman filter is a joint magnitude and phase spectrum estimator, under non-stationarity assumptions, it is highly suited for modulation-domain processing, as modulation phase tends to contain more speech information than acoustic phase. Experimental results from the NOIZEUS corpus show the ideal MDKF (with clean speech parameters) to outperform all the acoustic and time-domain enhancement methods that were evaluated, including the conventional time-domain Kalman filter with clean speech parameters. A practical MDKF that uses the MMSE-STSA method to enhance noisy speech in the acoustic domain prior to LPC analysis was also evaluated and showed promising results.

Metric Subspace Indexing for Fast Spoken Term Detection
Taisuke Kaneko (Toyohashi University of Technology)
Tomoyosi Akiba (Toyohashi University of Technology)
In this paper, we propose a novel indexing method for Spoken Term Detection (STD). The proposed method can be considered as using metric space indexing for the approximate string-matching problem, where the distance between a phoneme and a position in the target spoken document is defined. The proposed method does not require the use of thresholds to limit the output, instead being able to output the results in increasing order of distance. It can also deal easily with the multiple candidates obtained via Automatic Speech Recognition (ASR). The results of preliminary experiments show promise for achieving fast STD.

Discriminative Language Modeling Using Simulated ASR Errors
Preethi Jyothi (Department of Computer Science and Engineering, The Ohio State University, USA)
Eric Fosler-Lussier (Department of Computer Science and Engineering, The Ohio State University, USA)
In this paper, we approach the problem of discriminatively training language models using a weighted finite state transducer (WFST) framework that does not require acoustic training data. The phonetic confusions prevalent in the recognizer are modeled using a confusion matrix that takes into account information from the pronunciation model (word-based phone confusion log likelihoods) and information from the acoustic model (distances between the phonetic acoustic models). This confusion matrix, within the WFST framework, is used to generate confusable word graphs that serve as inputs to the averaged perceptron algorithm to train the parameters of the discriminative language model. Experiments on a large vocabulary speech recognition task show significant word error rate reductions when compared to a baseline using a trigram model trained with the maximum likelihood criterion.

Learning a Language Model from Continuous Speech
Graham Neubig (Graduate School of Informatics, Kyoto University)
Masato Mimura (Graduate School of Informatics, Kyoto University)
Shinsuke Mori (Graduate School of Informatics, Kyoto University)
Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)
This paper presents a new approach to language model construction, learning a language model not from text, but directly from continuous speech. A phoneme lattice is created using acoustic model scores, and Bayesian techniques are used to robustly learn a language model from this noisy input. A novel sampling technique is devised that allows for the integrated learning of word boundaries and an n-gram language model with no prior linguistic knowledge. The proposed techniques were used to learn a language model directly from continuous, potentially large-vocabulary speech. This language model was able to significantly reduce the ASR phoneme error rate over a separate set of test data, and the proposed lattice processing and lexical acquisition techniques were found to be important factors in this improvement.

New Insights into Subspace Noise Tracking
Mahdi Triki (Philips Research Laboratories)
spanspanspanspan

Posted via email from Troy's posterous

Thursday, November 4, 2010

[Papers] Selected papers from Interspeech 2010 by Nickolay

No comments:

Post a Comment