Thursday, May 24, 2012

[GSoC 2012] Before Week 1

GSoC 2012 has already officially started since this Monday (21 May). Although the expected weekly report should be starting next Monday, it would be better to have a brief overview of the preparations we have done during the bonding period.

The projects starts with the group chat with our mentor James and the other student Ronanki. From the chat together with the following email communications, the project is becoming more and more clear for me. For my project, the major focuses would be:

1) a web portal for automatic pronunciation evaluation audio collection;
2) Android based mobile automatic pronunciation evaluation app.

The core of these two applications is the edit distance grammar based automatic pronunciation evaluation using cmusphinx3, which would server as both the foundation and prior. 

Following are the preparations I have done during the bonding period:
  1. Trying out the basic wami-recorder demo on my school's server;
  2. Change to the rtmplite for audio recording. rtmplite is a Python implementation of the Flash RTMP server with minimum support needed for real-time streaming and recording using AMF0. On the server side, the daemon RTMP server process will by default listen on the TCP 1935 port for connection and streaming. On the client side, the user needs to use NetConnection to setup a session with the server and using NetStream for audio and video streaming and also recording. The demo application has been set up at: http://talknicer.net/~li-bo/testClient/bin-debug/testClient.html
  3. Based on the understanding of the demo application, which does the real time streaming and recording of both audio and video, I started to write my own audio recorder which is the key component for both the web-based audio data collection and the evaluation app. The basic version of the recorder was hosted at: http://talknicer.net/~li-bo/audioRecorder/audioRecorder.html . The current implementation includes:
    1. Distinguish recordings from different users with user id;
    2. Pre-defined text sentences loading for recording, which may be useful for the data collection;
    3. Real-time audio recording;
    4. Playback the recordings from the server.
    5. Basic event control logic, such as prevent users from recording and playing in the same time etc.
  4. Besides, I have also learnt from http://cmusphinx.sourceforge.net/wiki/sphinx4:sphinxthreealigner on how to do alignment using cmusphinx3. To generate the phoneme alignment scores, two steps of alignments are needed. Regarding the details of how to carry out the alignment can be found on my more tech-oriented posts (http://troylee2008.blogspot.com/2012/05/testing-cmusphinx3-alignment.html and http://troylee2008.blogspot.com/2012/05/cmusphinx3-phoneme-alignment.html) on my personal blog.
Currently, there are followings thing ongoing:
  1. Setup the server side process to well manage the user recordings, i.e. distinguishing between users and different utterances.
  2. Figuring out how to automatically convert the recorded server side FLV files to WAV files after the user stop the recording. 
  3. Verify the recording parameters against the recording quality and also taking the network bandwidth into consideration.
  4. Incorporating delays between network events in the recorder. The current version does not wait for the network events (such as connection set up, data package transmission etc) to successfully finish to process next user event, which usually cause the recordings to be clipped out. 

No comments:

Post a Comment

Google+