Tuesday, May 15, 2012

testing cmusphinx3 alignment

Task: do alignment with the cmusphinx3 tools.

1) Data preparation:

Using one of the WSJ si_et_05 testing speaker' prompts, 42 wav files are recorded as the testing data. The wav is saved in the standard PCM encoding, i.e. the basic MS wav format. And put only the file names of those (without path and suffix) to a list file "wav_fids.scp", which is called control file in cmusphinx community. 

Convert the prompts to the format of:
FIRST COMMODITY APPEALED THE EXPULSION AND FINE TO THE C. F. T. C. (441C0201)
The last item in the "(" and ")" is the file name of the corresponding recording. This will serve as the transcription file to be used for alignment.

To extract the cepstral features with sphinx_fe command ( which is located in sphinxbase/src/sphinx_fe ):
sphinx_fe -verbose yes -c wav_fids.scp -mswav yes -di "../wav" -ei "wav" -do "../feat" -eo "mfc" 
With this command, most of the feature extraction parameters are using the default values. According to the specific requirements, adjust the parameter values. After this command, there will be a ".mfc" file under the folder "../feat" corresponding to each ".wav" file in the folder ../wav.

To view the content of the ".mfc" feature file, use the command sphinx_cepview (which is also located in sphinxbase/src):
sphinx_cepview -header 1 -describe 1 -d 13 -f ../feat/441c0216.mfc 

2) Prepare the dictionary

As most of the example scripts come with cmusphinx are using cmudict.0.6d, here we will also use this version instead of the newest cmudict.0.7a. 

First, download the dictionary from https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/  and remove the comments at the beginning and the stress symbols in the dictionary to give the dictionary "wsj_all.dic" for alignment.

Meanwhile, generate the phone list file "wsj_all.phones" which contains the total 39 phones from the dictionary file and an extra "SIL".

Also create the filler dictionary "wsj_all.filler" with the contents of:
<s>   SIL
</s>  SIL
<sil> SIL

3) Prepare the model

In this experiment, we use the existing WSJ acoustic model trained by Keith Vertanen (http://www.keithv.com/software/sphinx/us/sphinx_wsj_all_cont_3no_8000_32.zip). Simply download and extract the folders. 

For alignment, no Language model is required. 

4) Do the alignment

The alignment is done with the sphinx3_align ( from sphinx3/src/programs) with following configurations:
sphinx3_align \
-logbase 1.0001 \
-feat 1s_12c_12d_3p_12dd \
-mdef model/model_architecture/wsj_all_cont_3no_8000.mdef \
-senmgau .cont. \
-mean model/model_parameters/wsj_all_cont_3no_8000_32/means \
-var model/model_parameters/wsj_all_cont_3no_8000_32/variances \
-mixw model/model_parameters/wsj_all_cont_3no_8000_32/mixture_weights \
-tmat model/model_parameters/wsj_all_cont_3no_8000_32/transition_matrices \
-beam 1e-80 \
-dict wsj_all.dic \
-fdict wsj_all.filler \
-ctl wav_fids.scp \
-cepdir ../feat \
-cepext .mfc \
-insent transcription.txt \
-outsent alignments/output.txt \
-wdsegdir segmentations,CTL \
-agc none \
-cmn current 

Make sure there are "alignments" and "segmentations" two folders under the current path. 


5) The results

In the "output.txt" file under the "alignments" folder, each line is of the form of:
<s> <sil> FIRST <sil> COMMODITY APPEALED <sil> THE(2) <sil> EXPULSION AND <sil> FINE TO(2) THE C. F. T. C. </s>  (441c0201)
 representing the alignment and the chosen pronunciation of each word in the dictionary. 

Under the "segmentations" folder, there are ".wdseg" files for each ".mfc" file. For example, the content of "441c0201.wdseg" is:
SFrm  EFrm    SegAScr Word
   0    16   -1547495 <s>
  17    50    -187467 <sil>
  51    81   -1144739 FIRST
  82    96    -589279 <sil>
  97   161   -3079573 COMMODITY
 162   217   -2497321 APPEALED
 218   231    -878705 <sil>
 232   254    -993106 THE(2)
 255   257    -268085 <sil>
 258   314   -6117691 EXPULSION
 315   371   -2952645 AND
 372   376    -269014 <sil>
 377   409   -1439215 FINE
 410   424    -618393 TO(2)
 425   449   -2016365 THE
 450   481   -1288960 C.
 482   504   -1931603 F.
 505   532   -1083254 T.
 533   576   -1412650 C.
 577   676    -488706 </s>
 Total score:   -30804266



Useful References:

No comments:

Post a Comment

Google+