Task: do alignment with the cmusphinx3 tools.
1) Data preparation:
Using one of the WSJ si_et_05 testing speaker' prompts, 42 wav files are recorded as the testing data. The wav is saved in the standard PCM encoding, i.e. the basic MS wav format. And put only the file names of those (without path and suffix) to a list file "wav_fids.scp", which is called control file in cmusphinx community.
Convert the prompts to the format of:
FIRST COMMODITY APPEALED THE EXPULSION AND FINE TO THE C. F. T. C. (441C0201)
The last item in the "(" and ")" is the file name of the corresponding recording. This will serve as the transcription file to be used for alignment.
To extract the cepstral features with sphinx_fe command ( which is located in sphinxbase/src/sphinx_fe ):
sphinx_fe -verbose yes -c wav_fids.scp -mswav yes -di "../wav" -ei "wav" -do "../feat" -eo "mfc"
With this command, most of the feature extraction parameters are using the default values. According to the specific requirements, adjust the parameter values. After this command, there will be a ".mfc" file under the folder "../feat" corresponding to each ".wav" file in the folder ../wav.
To view the content of the ".mfc" feature file, use the command sphinx_cepview (which is also located in sphinxbase/src):
sphinx_cepview -header 1 -describe 1 -d 13 -f ../feat/441c0216.mfc
2) Prepare the dictionary
As most of the example scripts come with cmusphinx are using cmudict.0.6d, here we will also use this version instead of the newest cmudict.0.7a.
Meanwhile, generate the phone list file "wsj_all.phones" which contains the total 39 phones from the dictionary file and an extra "SIL".
Also create the filler dictionary "wsj_all.filler" with the contents of:
3) Prepare the model
For alignment, no Language model is required.
4) Do the alignment
The alignment is done with the sphinx3_align ( from sphinx3/src/programs) with following configurations:
sphinx3_align \
-logbase 1.0001 \
-feat 1s_12c_12d_3p_12dd \
-mdef model/model_architecture/wsj_all_cont_3no_8000.mdef \
-senmgau .cont. \
-mean model/model_parameters/wsj_all_cont_3no_8000_32/means \
-var model/model_parameters/wsj_all_cont_3no_8000_32/variances \
-mixw model/model_parameters/wsj_all_cont_3no_8000_32/mixture_weights \
-tmat model/model_parameters/wsj_all_cont_3no_8000_32/transition_matrices \
-beam 1e-80 \
-dict wsj_all.dic \
-fdict wsj_all.filler \
-ctl wav_fids.scp \
-cepdir ../feat \
-cepext .mfc \
-insent transcription.txt \
-outsent alignments/output.txt \
-wdsegdir segmentations,CTL \
-agc none \
-cmn current
Make sure there are "alignments" and "segmentations" two folders under the current path.
5) The results
In the "output.txt" file under the "alignments" folder, each line is of the form of:
<s> <sil> FIRST <sil> COMMODITY APPEALED <sil> THE(2) <sil> EXPULSION AND <sil> FINE TO(2) THE C. F. T. C. </s> (441c0201)
representing the alignment and the chosen pronunciation of each word in the dictionary.
Under the "segmentations" folder, there are ".wdseg" files for each ".mfc" file. For example, the content of "441c0201.wdseg" is:
SFrm EFrm SegAScr Word
0 16 -1547495 <s>
17 50 -187467 <sil>
51 81 -1144739 FIRST
82 96 -589279 <sil>
97 161 -3079573 COMMODITY
162 217 -2497321 APPEALED
218 231 -878705 <sil>
232 254 -993106 THE(2)
255 257 -268085 <sil>
258 314 -6117691 EXPULSION
315 371 -2952645 AND
372 376 -269014 <sil>
377 409 -1439215 FINE
410 424 -618393 TO(2)
425 449 -2016365 THE
450 481 -1288960 C.
482 504 -1931603 F.
505 532 -1083254 T.
533 576 -1412650 C.
577 676 -488706 </s>
Total score: -30804266
Useful References: