Thursday, January 27, 2011

[Conference] PASCAL CHiME

PASCAL CHiME Speech Separation and Recognition Challenge

                 Deadline: April 14, 2011
       Workshop: September 1, 2011, Florence, Italy

Posted via email from Troy's posterous

Wednesday, January 26, 2011

[Speech] FFT algorithm

The FFT algorithm implemented in the HTK is the Decimation-In-Time FFT algorithm as detailed explained in the attached file. The computation could be reflected in the following flow graph ( 8-poing DFT ):

A minor difference is that in the HTK implementation, the Wn=exp(j * 2 * pi / n), while in common the Wn adopted is Wn=exp(-j * 2 * pi).

The above FFT algorithm only deals with complex numbers. To do DFT on the real valued signal sequence we need a real version of the FFT, for which we could utilize the complex FFT. In the second document from Texas Instrument, they give a fast implementation of the FFT for real valued sequences ( on page 12-19). The algorithm is illustrated briefly in following figure:

Posted via email from Troy's posterous

Thursday, January 20, 2011

[Speech] Front end analysis of speech recognition: A review

[Machine Learning] GTM: The Generative Topographic Mapping

[Misc] Command Line Keyboard Shortcuts for Mac OS X


The command line in Mac OS X can be a very powerful and fun tool, so it’s good to know how to maneuver around if you find yourself in it. By default, the Mac OS X Terminal uses the Bash shell, which is what these keyboard shortcuts are intended for. So if you’re ready to get your feet wet, open up the Terminal and try these shortcuts out, they’re sure to make your command line life easier. The list isn’t too crazy so you should be able to try all these out within a minute or two, have fun:

Ctrl + AGo to the beginning of the line you are currently typing on
Ctrl + EGo to the end of the line you are currently typing on
Ctrl + LClears the Screen, similar to the clear command
Ctrl + UClears the line before the cursor position. If you are at the end of the line, clears the entire line.
Ctrl + HSame as backspace
Ctrl + RLet’s you search through previously used commands
Ctrl + CKill whatever you are running
Ctrl + D Exit the current shell
Ctrl + ZPuts whatever you are running into a suspended background process. fg restores it.
Ctrl + WDelete the word before the cursor
Ctrl + KClear the line after the cursor
Ctrl + TSwap the last two characters before the cursor
Esc + TSwap the last two words before the cursor

Posted via email from Troy's posterous

Wednesday, January 19, 2011

[Tool] HMM Toolkit STK from Speech@FIT

[Speech] Hierarchical structures of neural networks for phoneme recognition

Four Neural Network based phoneme recognition systems are investigated:
a) the TRAPs system (Fig. 1a) - separate networks for processing of speech in frequency bands;
b) the split temporal context (STC) system (Fig. 1b) - separate networks for processing of blocks of spectral vectors;
c) combination of both (Fig. 1c) - split in both frequency and time.
d) Tandem of two networks, the frond-end network is trained in classical ways and the back-end is trained on the combination of the front-end's posteriors and original features. 

The assumptions for those systems are:
a) Independent processing of speech in critical bands;
b) Independent processing of parts of phonemes;
c) both a) and b).

Phoneme strings are basic representation for automatic language recognition and it is proved that language recognition results are highly correlated with phoneme recognition results. Phoneme posteriors are useful representation for acoustic keyword search, they contain enough information to distinguish among all words and they are small enough to store compared for example to the size of posteriors from context dependent Gaussian Mixture Models.

Two ways to provide additional information for NN training:
i) windowing, multiple frames context window, hamming window to emphasis the central frame;
ii) output representation: some improvements have been observed when a net was trained for multiple tasks in the same time.

A special Phoneme set mapping adopted in this paper is they merged closures with burst instead of with silence (bcl b -> b not bcl b -> pau b). It is believed that this mapping is more appropriate for features which use a longer temporal context.

The number of neurons in hidden layer of neural networks was increased until the saturation of phoneme error rate (PER) was observed. The obtained number of hidden layer neurons was approximately 500. 

Table 1 shows the superiority of long Mel-bank energies but also great improvement coming from three state model. ( Block of 31 vectors of mel-bank energies (MBE) = 310 ms, Temporal trajectories in bands were weighted by Hamming window and down-sampled by DCT to 11 coefficients. )

The final best PER reported in this paper is using the 5-block STC system with bigram LM as shown in following table:

Posted via email from Troy's posterous

Monday, January 17, 2011

[Machine Learning] Machine Learning Summer School 2008 - Kioloa

Monte Carlo Simulation for Statistical Inference, Model Selection and Decision Making

Monte Carlo Simulation for Statistical Inference, Model Selection and Decision Making

Monte Carlo Simulation for Statistical Inference, Model Selection and Decision Making

[Tool] TexPoint - A latex plugin for Microsoft Office

[Speech] Mixture Density Network (Technical Report)

[Speech] A trajectory density mixture network for acoustic articulatory inversion mapping

Friday, January 7, 2011

Wednesday, January 5, 2011

Matlab v7.3 mat file and python


It looks like that matlab version 7.3 and later are capable of writing out objects in the so called matlab 7.3 file format. While at first glance it looks like another proprietary format - it seems to be in fact the Hierarchical Data Format version 5 or in short hdf5.

So you can do all sorts of neat things:

  1. Lets create some matrix in matlab first and save it:

    >> x=[[1,2,3];[4,5,6];[7,8,9]] x = 1 2 3 4 5 6 7 8 9 >> save -v7.3 x.mat x
  2. Lets investigate that file from the shell:

    $ h5ls x.mat x Dataset {3, 3} $ h5dump x.mat HDF5 "x.mat" { GROUP "/" { DATASET "x" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 3, 3 ) / ( 3, 3 ) } DATA { (0,0): 1, 4, 7, (1,0): 2, 5, 8, (2,0): 3, 6, 9 } ATTRIBUTE "MATLAB_class" { DATATYPE H5T_STRING { STRSIZE 6; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "double" } } } } }
  3. And load it from python:

    >>> import h5py >>> import numpy >>> f = h5py.File('x.mat') >>> x=f["x"] >>> x <HDF5 dataset "x": shape (3, 3), type "<f8"> >>> numpy.array(x) array([[ 1., 4., 7.], [ 2., 5., 8.], [ 3., 6., 9.]])

So it seems actually to be a good idea to use matlab's 7.3 format for interoperability.

Posted via email from Troy's posterous