## Thursday, January 27, 2011

### [Conference] PASCAL CHiME

PASCAL CHiME Speech Separation and Recognition Challenge

Workshop: September 1, 2011, Florence, Italy

## Wednesday, January 26, 2011

### [Speech] FFT algorithm

The FFT algorithm implemented in the HTK is the Decimation-In-Time FFT algorithm as detailed explained in the attached file. The computation could be reflected in the following flow graph ( 8-poing DFT ):

A minor difference is that in the HTK implementation, the Wn=exp(j * 2 * pi / n), while in common the Wn adopted is Wn=exp(-j * 2 * pi).

The above FFT algorithm only deals with complex numbers. To do DFT on the real valued signal sequence we need a real version of the FFT, for which we could utilize the complex FFT. In the second document from Texas Instrument, they give a fast implementation of the FFT for real valued sequences ( on page 12-19). The algorithm is illustrated briefly in following figure:

ul.pdf (362 KB)

## Thursday, January 20, 2011

### [Misc] Command Line Keyboard Shortcuts for Mac OS X

The command line in Mac OS X can be a very powerful and fun tool, so it’s good to know how to maneuver around if you find yourself in it. By default, the Mac OS X Terminal uses the Bash shell, which is what these keyboard shortcuts are intended for. So if you’re ready to get your feet wet, open up the Terminal and try these shortcuts out, they’re sure to make your command line life easier. The list isn’t too crazy so you should be able to try all these out within a minute or two, have fun:

 Ctrl + A Go to the beginning of the line you are currently typing on Ctrl + E Go to the end of the line you are currently typing on Ctrl + L Clears the Screen, similar to the clear command Ctrl + U Clears the line before the cursor position. If you are at the end of the line, clears the entire line. Ctrl + H Same as backspace Ctrl + R Let’s you search through previously used commands Ctrl + C Kill whatever you are running Ctrl + D Exit the current shell Ctrl + Z Puts whatever you are running into a suspended background process. fg restores it. Ctrl + W Delete the word before the cursor Ctrl + K Clear the line after the cursor Ctrl + T Swap the last two characters before the cursor Esc + T Swap the last two words before the cursor

## Wednesday, January 19, 2011

### [Tool] HMM Toolkit STK from Speech@FIT

http://www.fit.vutbr.cz/research/groups/speech/sw/stk.html

Simple example scripts for MMI, MPE training.

### [Speech] Hierarchical structures of neural networks for phoneme recognition

Four Neural Network based phoneme recognition systems are investigated:
a) the TRAPs system (Fig. 1a) - separate networks for processing of speech in frequency bands;
b) the split temporal context (STC) system (Fig. 1b) - separate networks for processing of blocks of spectral vectors;
c) combination of both (Fig. 1c) - split in both frequency and time.
d) Tandem of two networks, the frond-end network is trained in classical ways and the back-end is trained on the combination of the front-end's posteriors and original features.

The assumptions for those systems are:
a) Independent processing of speech in critical bands;
b) Independent processing of parts of phonemes;
c) both a) and b).

Phoneme strings are basic representation for automatic language recognition and it is proved that language recognition results are highly correlated with phoneme recognition results. Phoneme posteriors are useful representation for acoustic keyword search, they contain enough information to distinguish among all words and they are small enough to store compared for example to the size of posteriors from context dependent Gaussian Mixture Models.

Two ways to provide additional information for NN training:
i) windowing, multiple frames context window, hamming window to emphasis the central frame;
ii) output representation: some improvements have been observed when a net was trained for multiple tasks in the same time.

A special Phoneme set mapping adopted in this paper is they merged closures with burst instead of with silence (bcl b -> b not bcl b -> pau b). It is believed that this mapping is more appropriate for features which use a longer temporal context.

The number of neurons in hidden layer of neural networks was increased until the saturation of phoneme error rate (PER) was observed. The obtained number of hidden layer neurons was approximately 500.

Table 1 shows the superiority of long Mel-bank energies but also great improvement coming from three state model. ( Block of 31 vectors of mel-bank energies (MBE) = 310 ms, Temporal trajectories in bands were weighted by Hamming window and down-sampled by DCT to 11 coefficients. )

The final best PER reported in this paper is using the 5-block STC system with bigram LM as shown in following table:

## Friday, January 7, 2011

### [DBN] Learning multiple layers of Features from tiny images

In this paper, there are more detailed equation derivations for RBM, especially the Gaussian-Bernoulli RBM.

# It looks like that matlab version 7.3 and later are capable of writing out objects in the so called matlab 7.3 file format. While at first glance it looks like another proprietary format - it seems to be in fact the Hierarchical Data Format version 5 or in short hdf5.

So you can do all sorts of neat things:

1. Lets create some matrix in matlab first and save it:

`>> x=[[1,2,3];[4,5,6];[7,8,9]] x = 1 2 3 4 5 6 7 8 9 >> save -v7.3 x.mat x`
2. Lets investigate that file from the shell:

`\$ h5ls x.mat x Dataset {3, 3} \$ h5dump x.mat HDF5 "x.mat" { GROUP "/" { DATASET "x" { DATATYPE H5T_IEEE_F64LE DATASPACE SIMPLE { ( 3, 3 ) / ( 3, 3 ) } DATA { (0,0): 1, 4, 7, (1,0): 2, 5, 8, (2,0): 3, 6, 9 } ATTRIBUTE "MATLAB_class" { DATATYPE H5T_STRING { STRSIZE 6; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR DATA { (0): "double" } } } } }`
3. And load it from python:

`>>> import h5py >>> import numpy >>> f = h5py.File('x.mat') >>> x=f["x"] >>> x <HDF5 dataset "x": shape (3, 3), type "<f8"> >>> numpy.array(x) array([[ 1., 4., 7.], [ 2., 5., 8.], [ 3., 6., 9.]])`

So it seems actually to be a good idea to use matlab's 7.3 format for interoperability.