Wednesday, May 30, 2012

First use of Bazaar and

1. Check the launchpad brunch:

bzr branch lp:~troy-lee2008/pronunciationeval/branch-troy

2. Update to launchpad:

bzr push lp:~troy-lee2008/pronunciationeval/branch-troy

This should be executed everytime you want your local changes be reflected in the server.

3. Useful operations

bzr merge

bzr add [file/directory]

bzr mkdir [directory]

bzr mv [file/directory...]

bzr rm [file/directory]

bzr commit -m "message"

Most of the time, the operations go in following steps:
a) bzr merge: to sync to the most recent revision with the server
b) bzr add, mv, rm etc. to make necessary changes
c) bzr commit to commit changes to your local repository
d) bzr push to submit your changes to the server repository

Some other helps:

Monday, May 28, 2012

[GSoC 2012: Pronunciation Evaluation #Troy] Week 1

The first week of GSoC 2012 has already been a busy summer. Following are the stuff I managed to accomplish:
  1. The server side rtmplite is now configured to save the recordings into the folder [path_to_webroot]/data on the server. And for the audioRecorder app, all the data will be under the [path_to_webroot]/data/audioRecorder folder and for each user there will be a separate folder (e.g. [path_to_webroot]/data/audioRecorder/user1). For each recording utterance, the file name is now in the format of [sentence name]_[quality level].flv
  2. Till now the conversion from FLV to WAV is done purely on the server side inside rtmplite with Python's subprocess.Popen() function calling FFMPEG. After the rtmplite close the FLV file, the conversion will be carried out immediately and the converted WAV file has exactly the same path and name except the suffix, which is WAV instead of FLV. Thanks very much for Guillem to help me test "sox" to do the conversion. However, I failed to use sox directly to convert FLV to WAV in the terminal, which said "unknown format flv". If needed I will try to figure out whether it is the problem of the sox I have.  James then pointed out we can do it inside rtmplite. This really helps. Yes, why must send an extra HTTP request to invoke a PHP process for the conversion? 
  3. To verify the recording parameter, i.e. the quality for speex encoding, I tried to record the same utterance ("NO ONE AT THE STATE DEPARTMENT WANTS TO LET SPIES IN") with the quality varying from 0 to 10. Apparently, the higher the quality, the larger the FLV file will be. From my own listening, the better the quality. However, it is hard to notice the differences above level 7. I also tried to generate alignment scores to see whether the quality affects the alignment. However, from the results shown in the following graph, the acoustic scores seems incomparable among different recordings. However, we will for now set the recording quality to 8. Inline image 1
  4. For the audioRecorder, only when the NetConnection event and NetStream open and close events successfully finished will the UI and other events carry on. Also 0.5s delay is inserted into the starting and ending of the recording button click event to avoid clipping. 

For the 2nd week,
  1. Solve the problem encountered in converting FLV to WAV using FFMEPG with Python's Popen(). If the main Python script (call it for now) is run in the terminal with "python", then no problems, everything works great. However, if I want to put it in background and log off the server by doing "python &", everytime when Popen() is invoked, the whole process hangs there with a "Stopped + &" error message. I have tried different approaches found from web without success for two days. I will try to figure out a way to work around it. Otherwise, maybe turn to Guillem's suggestion and figure out whether sox works.
  2. Finish the upload interface. There will be two kinds of interfaces: one for students and one for exemplar pronunciations. For the students': we want to display one to five phrases below space for a graphic or animation, assuming the smallest screen possible but with HTML which also looks good in a large window. For the exemplar, we just need to display one phrase but we should also have per-upload form fields (name, age, sex, native English speaker?, where lived ages 6-8 (determines accent), self-reported accent, etc.) which should persist across multiple uploads by the same user (with cookies?)
  3. Testing the rtmplite for multiple users and the same user's multiple recording sessions.

Saturday, May 26, 2012

First use of SVN

Due to the requirement of GSoC 2012, we need to check in our code in the cmusphinx subversion. Here is what I tried. 

Step 1: Creating a new brunch:

Then a log message will be opened with vi, just add some comments to it, save and quit. Next press 'c' and Enter, user authentication will be prompted, if the username is not your SVN username, just press Enter and then it will ask for both username and password. 

After that, seeing the message "Committed revision 11369" would probably indicating a successful operation. 

However, I did receive an email saying that the operation "Is being held until the list moderator can review it for approval.". Browsing the online SVN branches, the folder is already there. 

Step 2: Check out the brunch:

Some other commands useful:

svn add [filename/folder]

svn delete [filename/folder]

svn commit -m "message"

Thursday, May 24, 2012

[GSoC 2012] Before Week 1

GSoC 2012 has already officially started since this Monday (21 May). Although the expected weekly report should be starting next Monday, it would be better to have a brief overview of the preparations we have done during the bonding period.

The projects starts with the group chat with our mentor James and the other student Ronanki. From the chat together with the following email communications, the project is becoming more and more clear for me. For my project, the major focuses would be:

1) a web portal for automatic pronunciation evaluation audio collection;
2) Android based mobile automatic pronunciation evaluation app.

The core of these two applications is the edit distance grammar based automatic pronunciation evaluation using cmusphinx3, which would server as both the foundation and prior. 

Following are the preparations I have done during the bonding period:
  1. Trying out the basic wami-recorder demo on my school's server;
  2. Change to the rtmplite for audio recording. rtmplite is a Python implementation of the Flash RTMP server with minimum support needed for real-time streaming and recording using AMF0. On the server side, the daemon RTMP server process will by default listen on the TCP 1935 port for connection and streaming. On the client side, the user needs to use NetConnection to setup a session with the server and using NetStream for audio and video streaming and also recording. The demo application has been set up at:
  3. Based on the understanding of the demo application, which does the real time streaming and recording of both audio and video, I started to write my own audio recorder which is the key component for both the web-based audio data collection and the evaluation app. The basic version of the recorder was hosted at: . The current implementation includes:
    1. Distinguish recordings from different users with user id;
    2. Pre-defined text sentences loading for recording, which may be useful for the data collection;
    3. Real-time audio recording;
    4. Playback the recordings from the server.
    5. Basic event control logic, such as prevent users from recording and playing in the same time etc.
  4. Besides, I have also learnt from on how to do alignment using cmusphinx3. To generate the phoneme alignment scores, two steps of alignments are needed. Regarding the details of how to carry out the alignment can be found on my more tech-oriented posts ( and on my personal blog.
Currently, there are followings thing ongoing:
  1. Setup the server side process to well manage the user recordings, i.e. distinguishing between users and different utterances.
  2. Figuring out how to automatically convert the recorded server side FLV files to WAV files after the user stop the recording. 
  3. Verify the recording parameters against the recording quality and also taking the network bandwidth into consideration.
  4. Incorporating delays between network events in the recorder. The current version does not wait for the network events (such as connection set up, data package transmission etc) to successfully finish to process next user event, which usually cause the recordings to be clipped out. 

Sunday, May 20, 2012

Configure Ubuntu12.04 to boot without a monitor

Normally, the Ubuntu 12.04 desktop system is for PC which is assumed to be connected with a monitor, a keyboard and a mouse. As I just want to remotely login to the computer, I just removed those devices and leave the machine with only power and network. Everything works fine before I need to reboot the machine remotely. Then I can never connected it again before I reconnect it with a monitor and reboot it.

To solve the problem, I finally find this post: . The solution is:

Step 1. Back up the original xorg.conf to xorg.conf.bk just in case. Create a new xorg.conf in /etc/X11 with the following. 

Section "Device"
Identifier "VNC Device"
Driver "vesa"

Section "Screen"
Identifier "VNC Screen"
Device "VNC Device"
Monitor "VNC Monitor"
SubSection "Display"
Modes "1024x768"

Section "Monitor"
Identifier "VNC Monitor"
HorizSync 30-70
VertRefresh 50-75

Step 2. Disable KMS for your video card

The list is to know which video card manufacturer you have and use the command line entry below it to create the appropriate kms.conf file with the line "options...modeset=0" line in it. If you have access to the GUI you could just are easily create/modify the file and put the "options...modeset=0" in as appropriate.

The following are input into the terminal windows as a line command.

# ATI Radeon:
echo options radeon modeset=0 > /etc/modprobe.d/radeon-kms.conf

# Intel:
echo options i915 modeset=0 > /etc/modprobe.d/i915-kms.conf

# Nvidia (this should revert you to using -nv or -vesa):
echo options nouveau modeset=0 > /etc/modprobe.d/nouveau-kms.conf

As for my case, mine is Intel, I add "option i915 modeset=0 " to /etc/modprobe.d/dkms.conf .

Step 3. Reboot

cudaGetDeviceCount returned 38

The machine was initially configured to use the NVIDIA card to provide video output and the IGD (integrated graphic device) comes with the board is disabled. To make full use of the GPU to provide computation power, I have to re-configure the BIOS to use the IGD for video output and leave NVIDIA card for computation.

However, after installing the NVIDIA driver, CUDA 4.2 and SDK on Ubuntu 12.04, the test program deviceQuery cannot find the CUDA device:

$. / DeviceQuery
 [DeviceQuery] starting ...
 . / DeviceQuery Starting ...

  CUDA Device Query (Runtime API) version (CUDART static linking)

 cudaGetDeviceCount returned 38
 -> No CUDA-capable device is detected
 [DeviceQuery] test results ...

 Press ENTER to exit ...

Checking the device:

$ lspci | grep -i NVIDIA
01:00.0 VGA compatible controller: NVIDIA Corporation Tesla C2075 (rev a1)
With the help of the post: and Google Translate, finally realized the problem is actually stated in the installation doc:

4. If you do not use a GUI environment, ensure that the device files /dev/nvidia*
  exist and have the correct file permissions. (This would be done automatically when
 initializing a GUI environment.) This can be done creating a startup script like the
following to load the driver kernel module and create the entries as a superuser at
boot time:

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
    # Count the number of NVIDIA controllers found.
    NVDEVS=`lspci | grep -i NVIDIA`
    N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
    NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`
    N=`expr $N3D + $NVGA - 1`
    for i in `seq 0 $N`; do
        mknod -m 666 /dev/nvidia$i c 195 $i
    mknod -m 666 /dev/nvidiactl c 195 255
    exit 1

Thus to solve the problem: add the above scripts to /etc/rc.local as a startup script.

Linking error while compiling CUDA SDK in Ubuntu 12.04

Following the installation guide provided by CUDA website, all the dependency libraries are installed through apt-get:

sudo apt-get install freeglut3-dev build-essential libx11-dev
libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa libglu1-mesa-dev

When compiling the SDK, still get the error:

../../lib/librendercheckgl_x86_64.a(rendercheck_gl.cpp.o): In function `CheckBackBuffer::checkStatus(char const*, int, bool)':
rendercheck_gl.cpp:(.text+0xfbb): undefined reference to `gluErrorString'
collect2: ld returned 1 exit status

Tried several times uninstall and install again, and finally find the post: solving this problem:

The problem is in the ~/NVIDIA_GPU_Computing_SDK/C/common/

Lines like:
should have the two in reverse.

Tuesday, May 15, 2012

cmusphinx3 phoneme alignment

Task: Generate phoneme level alignment acoustic scores.

Here is just a way to solve the problem. It requires two pass of alignments. If there are better ways to do so, let me know!

1) Do word level alignment using sphinx3_align as in the previous post.

sphinx3_align \
-logbase 1.0001 \
-feat 1s_12c_12d_3p_12dd \
-mdef model/model_architecture/wsj_all_cont_3no_8000.mdef \
-senmgau .cont. \
-hmm model/model_parameters/wsj_all_cont_3no_8000_32 \
-beam 1e-150 \
-dict ../../lib/dicts/wsj_all.dic \
-fdict ../../lib/dicts/wsj_all.filler \
-ctl ../../lib/flists/wav_fids.scp \
-cepdir ../../data/feat \
-cepext .mfc \
-insent ../../lib/wlabs/441c0200.lsn \
-outsent alignments/output.txt \
-wdsegdir wdseg,CTL \
-phlabdir phnlab,CTL \
-agc none \
-cmn current 

2) convert the phoneme labels to a phoneme transcriptions file

3) Do phoneme alignment with sphinx3_align

sphinx3_align \
-logbase 1.0001 \
-feat 1s_12c_12d_3p_12dd \
-mdef model/model_architecture/wsj_all_cont_3no_8000.mdef \
-senmgau .cont. \
-hmm model/model_parameters/wsj_all_cont_3no_8000_32 \
-beam 1e-150 \
-insert_sil 0 \
-dict ../../lib/dicts/phone.dic \
-ctl ../../lib/flists/wav_fids.scp \
-cepdir ../../data/feat \
-cepext .mfc \
-insent trans_phn.txt \
-outsent alignments/output_phn.txt \
-wdsegdir phnseg,CTL \
-agc none \
-cmn current 

testing cmusphinx3 alignment

Task: do alignment with the cmusphinx3 tools.

1) Data preparation:

Using one of the WSJ si_et_05 testing speaker' prompts, 42 wav files are recorded as the testing data. The wav is saved in the standard PCM encoding, i.e. the basic MS wav format. And put only the file names of those (without path and suffix) to a list file "wav_fids.scp", which is called control file in cmusphinx community. 

Convert the prompts to the format of:
The last item in the "(" and ")" is the file name of the corresponding recording. This will serve as the transcription file to be used for alignment.

To extract the cepstral features with sphinx_fe command ( which is located in sphinxbase/src/sphinx_fe ):
sphinx_fe -verbose yes -c wav_fids.scp -mswav yes -di "../wav" -ei "wav" -do "../feat" -eo "mfc" 
With this command, most of the feature extraction parameters are using the default values. According to the specific requirements, adjust the parameter values. After this command, there will be a ".mfc" file under the folder "../feat" corresponding to each ".wav" file in the folder ../wav.

To view the content of the ".mfc" feature file, use the command sphinx_cepview (which is also located in sphinxbase/src):
sphinx_cepview -header 1 -describe 1 -d 13 -f ../feat/441c0216.mfc 

2) Prepare the dictionary

As most of the example scripts come with cmusphinx are using cmudict.0.6d, here we will also use this version instead of the newest cmudict.0.7a. 

First, download the dictionary from  and remove the comments at the beginning and the stress symbols in the dictionary to give the dictionary "wsj_all.dic" for alignment.

Meanwhile, generate the phone list file "wsj_all.phones" which contains the total 39 phones from the dictionary file and an extra "SIL".

Also create the filler dictionary "wsj_all.filler" with the contents of:
<s>   SIL
</s>  SIL
<sil> SIL

3) Prepare the model

In this experiment, we use the existing WSJ acoustic model trained by Keith Vertanen ( Simply download and extract the folders. 

For alignment, no Language model is required. 

4) Do the alignment

The alignment is done with the sphinx3_align ( from sphinx3/src/programs) with following configurations:
sphinx3_align \
-logbase 1.0001 \
-feat 1s_12c_12d_3p_12dd \
-mdef model/model_architecture/wsj_all_cont_3no_8000.mdef \
-senmgau .cont. \
-mean model/model_parameters/wsj_all_cont_3no_8000_32/means \
-var model/model_parameters/wsj_all_cont_3no_8000_32/variances \
-mixw model/model_parameters/wsj_all_cont_3no_8000_32/mixture_weights \
-tmat model/model_parameters/wsj_all_cont_3no_8000_32/transition_matrices \
-beam 1e-80 \
-dict wsj_all.dic \
-fdict wsj_all.filler \
-ctl wav_fids.scp \
-cepdir ../feat \
-cepext .mfc \
-insent transcription.txt \
-outsent alignments/output.txt \
-wdsegdir segmentations,CTL \
-agc none \
-cmn current 

Make sure there are "alignments" and "segmentations" two folders under the current path. 

5) The results

In the "output.txt" file under the "alignments" folder, each line is of the form of:
<s> <sil> FIRST <sil> COMMODITY APPEALED <sil> THE(2) <sil> EXPULSION AND <sil> FINE TO(2) THE C. F. T. C. </s>  (441c0201)
 representing the alignment and the chosen pronunciation of each word in the dictionary. 

Under the "segmentations" folder, there are ".wdseg" files for each ".mfc" file. For example, the content of "441c0201.wdseg" is:
SFrm  EFrm    SegAScr Word
   0    16   -1547495 <s>
  17    50    -187467 <sil>
  51    81   -1144739 FIRST
  82    96    -589279 <sil>
  97   161   -3079573 COMMODITY
 162   217   -2497321 APPEALED
 218   231    -878705 <sil>
 232   254    -993106 THE(2)
 255   257    -268085 <sil>
 258   314   -6117691 EXPULSION
 315   371   -2952645 AND
 372   376    -269014 <sil>
 377   409   -1439215 FINE
 410   424    -618393 TO(2)
 425   449   -2016365 THE
 450   481   -1288960 C.
 482   504   -1931603 F.
 505   532   -1083254 T.
 533   576   -1412650 C.
 577   676    -488706 </s>
 Total score:   -30804266

Useful References:

Friday, May 11, 2012

setup rtmplite demo I

1) Make sure the server has at least Python 2.6, otherwise install the newer version;

2) Download the rtmplite package from; extract and navigate to the folder;

3) Start the server with default options and debug trace by issuing the command: python -d

4) Open the link in browser, and connect to rtmp:// ; for the netstream, just use the default "user1", then click "Publish" to send the data to server. If the "Enable recording" is selected, there will be a flv file created to record the content.

5) Open the same link as 4) in another browser window and also connect to the same application domain rtmp:// . Use the same stream and click "Play", we can then see the same recording of the previous window.