Dream & Passion

Saturday, April 28, 2012

Testing wami-recorder

Installation and configuration:

1) Download Flex SDK from http://www.adobe.com/products/flex.html (http://www.adobe.com/devnet/flex/flex-sdk-download.edu.html)

After the downloading is finished, extract it to a directory and add following two paths to either ~/.profile or ~/.bashrc file:

[Flex_sdk_path]/bin to $PATH

[Flex_sdk_path]/lib to $LD_LIBRARY_PATH

2) Check out the wami-recorder codes using: hg clone https://code.google.com/p/wami-recorder/

Then navigate to the [wami-recorder] folder, which has two subfolders: example and src. Compile the client with following command:

mxmlc -compiler.source-path=src -static-link-runtime-shared-libraries=true -output example/client/Wami.swf src/edu/mit/csail/wami/client/Wami.mxml

The command will generate a Wami.swf file under the example/client folder. Next we can start testing the wami recorder.

3) Testing

A) upload both the client and server php example to my own server, test with the basic.html both recording and playback works fine.

B) change the recording file to a file on the server instead of the default one which is on the wami group's server

C) instead of specify a absolute path for wami to save the recording to, use a php file to save the recording

D) change the hard coded file name to a variable that can be generated automatically

E) check the recording format, which is PCM, signed 16 bit integer, 22050 sample rate; only the sample rate is different from what we want, which is 16000. Currently, it can be converted using command line tool sox on the server. Have already found the interface of wami recorder to set the recording parameters, but the code does not effect currently

4) todos:

A) solve the recording parameter setup during wami recorder initialization

B) try a better UI for the recorder, which currently use the 3 basic buttons

Wednesday, April 25, 2012

GSoC 2012 Applications Accepted

When I saw the poster for Google Summer of Code in my department, it was already April 6th. Thanks to the time difference, I still had one day to apply before the deadline. Searching the list of projects with "speech recognition" as the keyword showed CMU Sphinx as the only result. It was great that there was something related to my research interests, which include acoustic modeling and speaker adaptation. While checking the CMU Sphinx project page, I was so excited to see the language learning project there. I had published a paper on that topic. That's what I will do! I contacted the mentor, James, for that project. He is really nice and gave me quite a lot of suggestions for my application. Also I have to thank Ronanki, who may not know that his well written project proposal helped me a lot with my application.

Finally, both Ronanki's Pronunciation Evaluation using CMU Sphinx3 and my Accurate and Efficient Pronunciation Evaluation using CMUSphinx for Spoken Language Learning proposals were both accepted this Monday! Thanks so much to all the mentors, reviewers and also to Google for providing us this great opportunity to work on open source projects.

Pronunciation learning is one of the most important parts of second language acquisition. The aim of this project is to utilize automatic speech recognition technology to facilitate learning spoken language and reading skills. Ronanki and I will work on the same pronunciation evaluation project with different focuses. Ronanki will focus on building the web-based pronunciation evaluation system with CMU Sphinx3. I will mainly focus on developing edit-distance based mispronunciation detection grammars, speech data collection, and maximizing the potential learner population by implementing a mobile application to work with our pronunciation evaluation system. Additionally, we also plan to design and implement an game front end to make the learning process much more fun. My project involves four specific sub-tasks: automatic edit distance scoring grammar generation, exemplar pronunciation data collection, an Android app client implementation, and development of a game-based learning system.

As a first time open source contributor, there are lots of things to learn. I believe we will have a great summer this year. Also any comments or suggestions are appreciated. Thanks again for everyone that made this happen!

All the posts for GSoC 2012 will also appear in our team blog: http://pronunciationeval.blogspot.com/.

Tuesday, April 24, 2012

GSoC 2012

Finally, my proposal for GSoC 2012 got accepted!

Accurate and Efficient Pronunciation Evaluation using CMUSphinx for Spoken Language Learning

Thanks so much to my mentor James for his great suggestions to my hurry application!

Let's start doing something great!

Wednesday, February 1, 2012

WSJ Setup

WSJ SI84, 7138 sentences, (removing 410's duplication in the original data)

WSJ SI284, 37416 sentences, (removing 410's duplication in the data set)

speaker 410 has contributed two sessions of recordings, thus has twice the number of sentences as other speakers. During the processing, we should remove 11-2.1/wsj0/si_tr_s/401 to avoid too much data for a single speaker.

The 5K and 20K test sets for WSJ0 are Nov.92 test sets.

The 64K test sets of WSJ1 is the commonly mentioned Nov.93 test set.

http://books.google.com.sg/books?id=WJeAWYFa0i0C&pg=PA127&lpg=PA127&dq=WSJ+SI284+number+sentences&source=bl&ots=I2laYM2emM&sig=plZXYseK9n6umvawaWZ7vhOrhcw&hl=en&sa=X&ei=ueUoT5P5Jo-HrAfQ4oG9AQ&ved=0CGYQ6AEwBw

Posted via email from Troy's posterous

Wednesday, November 30, 2011

[HTK] Chinese Encoding

修改HTK源码 HParse，HVite部分，使其支持中文

2010-03-24 12:05

From: http://hi.baidu.com/cbyhit2008/blog/item/e642d6c7c16bd4179c163d5b.html

利用HTK工具包进行语音识别建模时，遇到任务语法中存在中文时候，无法生成对应的底层网络，这样就需要对HTK源码的部分内容进行修改，以下是我对HTK源码HParse及HVite部分内容改动记录，希望对有需要的人有帮助！自己也做个备份！
添加下面函数
static int IsSpace(char c)
{
if ((c == 0x09) ||( c == 0x0D) || (c == ' ' ))
return 1;
return 0;
}
修改下面的函数
static void PGetSym(void)
{
..../////////////
+++while ( !IsSpace(ch) || (ch=='/' && inlyne[curpos]=='*') ) //isspace((int) ch)
{
+++    if (!IsSpace(ch) || isspace((int) ch)) /* skip space */
PGetCh();
else {            /* skip comment */
PGetCh(); PGetCh();
while (!(ch=='*' && inlyne[curpos]=='/')) PGetCh();
PGetCh(); PGetCh();
}
}
..../////////////以下部分代码为做修改
}

static void PGetIdent(void)
{
int i=0;
Ident id;

do {
if (ch==ESCAPE) PGetCh();
if (i<MAXIDENT) id[i++]=ch;
PGetCh();
+++ } while (!IsSpace(ch)&& ch!='{' && ch!='}' && ch!='[' && ch!=']' &&//!isspace( (int)ch)
ch!='<' && ch!='>' && ch!='(' && ch!=')' && ch!='=' &&
ch!=';' && ch!='|' && ch!='/' && ch!='%');
id[i]='\0';
ident = GetLabId(id,TRUE);
}

ReturnStatus WriteOneLattice(Lattice *lat,FILE *file,LatFormat format)
{
...///////////////////////////////
else if (ln->word!=NULL) {
fprintf(file,"W=%-19s ",ln->word->wordName->name);//
//   ReWriteString(ln->word->wordName->name,注释掉
//                NULL,ESCAPE_CHAR));
...////////////////////////////////
}
这样在生产的底层网络中就可以看到汉字，而不是汉字编码了。下面是我测试的一个简单例子：
这是taskgram中的内容
$word = 好
| 浩
| 尼
| 你;
( START_SIL ([sil] )(<$word>)( [sil]) END_SIL )
没有修改HParse生产的网络
VERSION=1.0
N=11   L=22
I=0    W=END_SIL
I=1    W=sil
I=2    W=\304\343
I=3    W=!NULL
I=4    W=\304\341
I=5    W=\272\306
I=6    W=\272\303
I=7    W=sil
I=8    W=START_SIL
I=9    W=!NULL
I=10   W=!NULL
J=0     S=1    E=0
J=1     S=3    E=0
J=2     S=3    E=1
J=3     S=3    E=2
J=4     S=7    E=2
J=5     S=8    E=2
J=6     S=2    E=3
J=7     S=4    E=3
J=8     S=5    E=3
J=9     S=6    E=3
J=10    S=3    E=4
J=11    S=7    E=4
J=12    S=8    E=4
J=13    S=3    E=5
J=14    S=7    E=5
J=15    S=8    E=5
J=16    S=3    E=6
J=17    S=7    E=6
J=18    S=8    E=6
J=19    S=8    E=7
J=20    S=10   E=8
J=21    S=0    E=9
修改后的网络
VERSION=1.0
N=11   L=22
I=0    W=END_SIL
I=1    W=sil
I=2    W=你
I=3    W=!NULL
I=4    W=尼
I=5    W=浩
I=6    W=好
I=7    W=sil
I=8    W=START_SIL
I=9    W=!NULL
I=10   W=!NULL
J=0     S=1    E=0
J=1     S=3    E=0
J=2     S=3    E=1
J=3     S=3    E=2
J=4     S=7    E=2
J=5     S=8    E=2
J=6     S=2    E=3
J=7     S=4    E=3
J=8     S=5    E=3
J=9     S=6    E=3
J=10    S=3    E=4
J=11    S=7    E=4
J=12    S=8    E=4
J=13    S=3    E=5
J=14    S=7    E=5
J=15    S=8    E=5
J=16    S=3    E=6
J=17    S=7    E=6
J=18    S=8    E=6
J=19    S=8    E=7
J=20    S=10   E=8
J=21    S=0    E=9
至于HVite部分，我找了近一下午，总算找到改的地方了，修改HSheel.c 中WriteString函数
n=*p;
fputc(n,f);
//   fputc(ESCAPE_CHAR,f);
// fputc(((n/64)%8)+'0',f);fputc(((n/8)%8)+'0',f);fputc((n%8)+'0',f);
我将相应的位置给注释上了，并将字符之间输出到文件中，这样在结果文件中就可以看到中文了～～

Posted via email from Troy's posterous

[HTK] Chinese encoding

HTK could directly read in the "gbk" encoded MLF or dictionary etc. files. Actually, it could read any kine of encoded file. In HTK, what it does is to read in every byte (char type) and when print them out, each byte is write out in the form of "\abc", which abc is the octal representation of the byte number(=a*64+b*8+c).

Thus to convert the HTK generated files back to the readable characters, we need following steps:

1) convert the HTK octal representation of byte values to byte array

2) decode the byte array with corresponding encoding, (e.g. for Chinese, we could use "gbk")

Following is the code I used to convert the HTK generated MLF to readable "gbk" encoded MLF file:

import string, codecs

fin=open('vom_utt_wlab.mlf')

fout=codecs.open('vom_utt_wlab.gbk.mlf', encoding='gbk', mode='w')

while True:

sr=fin.readline()

if sr=='':break

sr=sr.strip()

if sr.endswith('.lab"'):

print >>fout, sr

while True:

sr=(fin.readline()).strip()

if sr=='.':break

if sr.startswith('\\'):

lst=(sr.strip('\\')).split('\\') # get the list of octal representation of each byte

bins=bytearray()

for itm in lst:

val=0

for ii in range(3): # each octal number will have exactly 3 numbers, i.e. of the form \nnn

val=val*8

val=val+int(itm[ii])

bins.append(val)

print >>fout, bins.decode('gbk')

else:

print >>fout, sr

print >>fout, '.'

else:

print >>fout, sr

fin.close()

fout.close()

Posted via email from Troy's posterous

Thursday, November 24, 2011

[HTK] Increase HTK feature dimension limit

In the HTK feature file, there is a header file specify the basic information of the parameters.

HTK format files consist of a contiguous sequence of samples preceded by a header. Each sample is a vector of either 2-byte integers or 4-byte floats. 2-byte integers are used for compressed forms as described below and for vector quantised data as described later in section 5.11. HTK format data files can also be used to store speech waveforms as described in section 5.8.

The HTK file format header is 12 bytes long and contains the following data

nSamples                - number of samples in file (4-byte integer)

sampPeriod - sample period in 100ns units (4-byte integer)

sampSize - number of bytes per sample (2-byte integer)

parmKind - a code indicating the sample kind (2-byte integer)

http://blush.ee.columbia.edu/doc/HTKBook21/node58.html

From the above specification, the sampSize is short integer, thus the maximum value for sampSize is 32768. For uncompressed data, the maximum dimension for each sample is thus 32768/4=8192. However, usually even just 1000+ D feature will cause the HTK tools to generate following errors:

OpenParmChannel: cannot read HTK Header in File

The reason is that in the function ReadHTKHeader of the file HWave.c, there is check for the sampSize value:

if (hdr.sampSize <= 0 || hdr.sampSize > 5000 || hdr.nSamples <= 0 ||

hdr.sampPeriod <= 0 || hdr.sampPeriod > 1000000)

return FALSE;

That's to say, in HTK the dimension of the feature vector is limited by this check instead of data type specified in the header format. In the standard version of HTK, at most 1250D feature could be used. To increase the limit, what we need to do is to change the number 5000, but do remember sampSize is short integer, changing to any value larger than 32768 would be useless.

The code at about line 1427 of the file HTKLib/HWave.c.

Posted via email from Troy's posterous