HTK could directly read in the "gbk" encoded MLF or dictionary etc. files. Actually, it could read any kine of encoded file. In HTK, what it does is to read in every byte (char type) and when print them out, each byte is write out in the form of "\abc", which abc is the octal representation of the byte number(=a*64+b*8+c).
Thus to convert the HTK generated files back to the readable characters, we need following steps:
1) convert the HTK octal representation of byte values to byte array
2) decode the byte array with corresponding encoding, (e.g. for Chinese, we could use "gbk")
Following is the code I used to convert the HTK generated MLF to readable "gbk" encoded MLF file:
import string, codecs
fin=open('vom_utt_wlab.mlf')
fout=codecs.open('vom_utt_wlab.gbk.mlf', encoding='gbk', mode='w')
while True:
sr=fin.readline()
if sr=='':break
sr=sr.strip()
if sr.endswith('.lab"'):
print >>fout, sr
while True:
sr=(fin.readline()).strip()
if sr=='.':break
if sr.startswith('\\'):
lst=(sr.strip('\\')).split('\\') # get the list of octal representation of each byte
bins=bytearray()
for itm in lst:
val=0
for ii in range(3): # each octal number will have exactly 3 numbers, i.e. of the form \nnn
val=val*8
val=val+int(itm[ii])
bins.append(val)
print >>fout, bins.decode('gbk')
else:
print >>fout, sr
print >>fout, '.'
else:
print >>fout, sr
fin.close()
fout.close()
No comments:
Post a Comment