Dream & Passion: December 2009

Thursday, December 17, 2009

bad data or over pruning

HERest -C src/ConfigHVite -I lists/train.phonemlf -t 250.0 150.0 1000.0 -S train.mfcc.list -H hmm0/macros -H hmm0/hmmdefs -M hmm1 lists/monophones1
ERROR [-7324] StepBack: File ... bad data or over pruning

Possible problems include corrupt mfcc, non-matching or non-existent labels. In this case, I had to re-calculate the mean & variance for the prototype hmm using only 1/2 the data, and the problem went away. If every file is considered bad data, you may have derived the features wrong. Go back to HCopy and check the parameters (config file).

From: http://www.ling.ohio-state.edu/~bromberg/htk_problems.html

Tuesday, December 8, 2009

WSJCAMP0

Some information about the WSJCAMP0 corpus:
1. Totally 140 speakers and 110 utterances per speaker;

2. 92 training speakers, 20 development test speaker and two sets of 14 evaluation test speakers. Each speaker provides approximately 90 utterances and an additional 18 adaptation utterances.

3. The same set of 18 adaptation sentences was recorded by each speaker, consisting of one recording of background noise, 2 phonetically balanced sentences and the first 15 adaptation sentences from the initial WSJ experiment.

4. Each training speaker read out some 90 training sentences, selected randomly in paragraph units. This is the empirically determined maximum number of sentences that could be squeezed into one hour of speaker time.

5. Each of 48 test speakers read 80 sentences. The final development test group consists of 20 speakers.

The CD-ROM publication consists of six discs, with contents organized as follows:
discs 1 and 2 - training data from head-mounted microphone
disc 3 - development test data from head-mounted microphone, plus first set of evaluation test data
discs 4 and 5 - training data from desk-mounted microphone
disc 6 - development test data from desk-mounted microphone, plus second set of evaluation test data
There are 90 utterances from each of 92 speakers that are designated as training material for speech recognition algorithms. An additional 48 speakers each read 40 sentences containing only words from a fixed 5,000 word vocabulary, and another 40 sentences using a 64,000 word vocabulary, to be used as testing material. Each of the total of 140 speakers also recorded a common set of 18 adaptation sentences. Recordings were made from two microphones: a far-field desk microphone and a head-mounted close-talking microphone.

http://ccl.pku.edu.cn/doubtfire/CorpusLinguistics/LDC_Corpus/available_corpus_from_ldc.html#wsjcam0

Monday, December 7, 2009

Install CDT for Eclipse on Ubuntu

No eclipse-cdt package is found in the package list. We can install it through Eclipse.

1. open a terminal and start eclipse using root user: sudo eclipse
2. Form Help -> Install New Software... -> Add
3. Add:
Name: galileo
Url: http://download.eclipse.org/tools/cdt/releases/galileo
and return.
4. type filter text as "c" or "cdt", then in the listed package select CDT Main Feature package
5. Click to install.

Tuesday, December 1, 2009

wget: Download entire websites easy

wget is a nice tool for downloading resources from the internet. The basic usage is wget url:

wget http://linuxreviews.org/

Therefore, wget (manual page) + less (manual page) is all you need to surf the internet. The power of wget is that you may download sites recursive, meaning you also get all pages (and images and other data) linked on the front page:

wget -r http://linuxreviews.org/

But many sites do not want you to download their entire site. To prevent this, they check how browsers identify. Many sites refuses you to connect or sends a blank page if they detect you are not using a web-browser. You might get a message like:

Sorry, but the download manager you are using to view this site is not supported. We do not support use of such download managers as flashget, go!zilla, or getright

Wget has a very handy -U option for sites like this. Use -U My-browser to tell the site you are using some commonly accepted browser:

  wget  -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html

The most important command line options are --limit-rate= and --wait=. You should add --wait=20 to pause 20 seconds between retrievals, this makes sure you are not manually added to a blacklist. --limit-rate defaults to bytes, add K to set KB/s. Example:

wget --wait=20 --limit-rate=20K -r -p -U Mozilla http://www.stupidsite.com/restricedplace.html

A web-site owner will probably get upset if you attempt to download his entire site using a simple wget http://foo.bar command. However, the web-site owner will not even notice you if you limit the download transfer rate and pause between fetching files.

Use --no-parent

--no-parent is a very handy option that guarantees wget will not download anything from the folders beneath the folder you want to acquire. Use this to make sure wget does not fetch more than it needs to if just just want to download the files in a folder.

From: http://linuxreviews.org/quicktips/wget/