Wednesday, November 30, 2011

[HTK] Chinese encoding

HTK could directly read in the "gbk" encoded MLF or dictionary etc. files. Actually, it could read any kine of encoded file. In HTK, what it does is to read in every byte (char type) and when print them out, each byte is write out in the form of "\abc", which abc is the octal representation of the byte number(=a*64+b*8+c). 

Thus to convert the HTK generated files back to the readable characters, we need following steps:
1) convert the HTK octal representation of byte values to byte array
2) decode the byte array with corresponding encoding, (e.g. for Chinese, we could use "gbk")

Following is the code I used to convert the HTK generated MLF to readable "gbk" encoded MLF file:

import string, codecs

fin=open('vom_utt_wlab.mlf')
fout=codecs.open('vom_utt_wlab.gbk.mlf', encoding='gbk', mode='w')
while True:
    sr=fin.readline()
    if sr=='':break
    sr=sr.strip()
    if sr.endswith('.lab"'):
        print >>fout, sr
        while True:
            sr=(fin.readline()).strip()
            if sr=='.':break
            if sr.startswith('\\'):
                lst=(sr.strip('\\')).split('\\') # get the list of octal representation of each byte
                bins=bytearray()
                for itm in lst:
                    val=0
                    for ii in range(3): # each octal number will have exactly 3 numbers, i.e. of the form \nnn
                        val=val*8
                        val=val+int(itm[ii])
                    bins.append(val)
                print >>fout, bins.decode('gbk')
            else:
                print >>fout, sr
        print >>fout, '.'
    else:
        print >>fout, sr
fin.close()
fout.close()

Posted via email from Troy's posterous

No comments:

Post a Comment

Google+