Dream & Passion: [GSoC 2012: Pronunciation Evaluation #Troy] Week 1

The first week of GSoC 2012 has already been a busy summer. Following are the stuff I managed to accomplish:

The server side rtmplite is now configured to save the recordings into the folder [path_to_webroot]/data on the server. And for the audioRecorder app, all the data will be under the [path_to_webroot]/data/audioRecorder folder and for each user there will be a separate folder (e.g. [path_to_webroot]/data/audioRecorder/user1). For each recording utterance, the file name is now in the format of [sentence name]_[quality level].flv
Till now the conversion from FLV to WAV is done purely on the server side inside rtmplite with Python's subprocess.Popen() function calling FFMPEG. After the rtmplite close the FLV file, the conversion will be carried out immediately and the converted WAV file has exactly the same path and name except the suffix, which is WAV instead of FLV. Thanks very much for Guillem to help me test "sox" to do the conversion. However, I failed to use sox directly to convert FLV to WAV in the terminal, which said "unknown format flv". If needed I will try to figure out whether it is the problem of the sox I have. James then pointed out we can do it inside rtmplite. This really helps. Yes, why must send an extra HTTP request to invoke a PHP process for the conversion?
To verify the recording parameter, i.e. the quality for speex encoding, I tried to record the same utterance ("NO ONE AT THE STATE DEPARTMENT WANTS TO LET SPIES IN") with the quality varying from 0 to 10. Apparently, the higher the quality, the larger the FLV file will be. From my own listening, the better the quality. However, it is hard to notice the differences above level 7. I also tried to generate alignment scores to see whether the quality affects the alignment. However, from the results shown in the following graph, the acoustic scores seems incomparable among different recordings. However, we will for now set the recording quality to 8.
For the audioRecorder, only when the NetConnection event and NetStream open and close events successfully finished will the UI and other events carry on. Also 0.5s delay is inserted into the starting and ending of the recording button click event to avoid clipping.

For the 2nd week,

Solve the problem encountered in converting FLV to WAV using FFMEPG with Python's Popen(). If the main Python script (call it test.py for now) is run in the terminal with "python test.py", then no problems, everything works great. However, if I want to put it in background and log off the server by doing "python test.py &", everytime when Popen() is invoked, the whole process hangs there with a "Stopped + test.py &" error message. I have tried different approaches found from web without success for two days. I will try to figure out a way to work around it. Otherwise, maybe turn to Guillem's suggestion and figure out whether sox works.
Finish the upload interface. There will be two kinds of interfaces: one for students and one for exemplar pronunciations. For the students': we want to display one to five phrases below space for a graphic or animation, assuming the smallest screen possible but with HTML which also looks good in a large window. For the exemplar, we just need to display one phrase but we should also have per-upload form fields (name, age, sex, native English speaker?, where lived ages 6-8 (determines accent), self-reported accent, etc.) which should persist across multiple uploads by the same user (with cookies?)
Testing the rtmplite for multiple users and the same user's multiple recording sessions.

Monday, May 28, 2012

[GSoC 2012: Pronunciation Evaluation #Troy] Week 1

No comments:

Post a Comment