how to transfer my utf8 code saved in a file to gbk code
higer
higerinbeijing at gmail.com
Sun Jun 7 22:32:59 EDT 2009
On Jun 7, 11:25 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Jun 7, 10:55 pm, higer <higerinbeij... at gmail.com> wrote:
>
> > My file contains such strings :
> > \xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
>
> Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
>
It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.
>
>
> > I want to read the content of this file and transfer it to the
> > corresponding gbk code,a kind of Chinese character encode style.
> > Everytime I was trying to transfer, it will output the same thing no
> > matter which method was used.
> > It seems like that when Python reads it, Python will taks '\' as a
> > common char and this string at last will be represented as "\\xe6\\x97\
> > \xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
> > output,but that's not what I want to get.
>
> > Anyone can help me?
>
> try this:
>
> utf8_data = your_data.decode('string-escape')
> unicode_data = utf8_data.decode('utf8')
> # unicode derived from your sample looks like this 日期: is that what
> you expected?
You are right , the result is 日期 which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?
> gbk_data = unicode_data.encode('gbk')
>
I have tried this method which you just told me, but unfortunately it
does not work(mess code).
> If that "doesn't work", do three things:
> (1) give us some unambiguous hard evidence about the contents of your
> data:
> e.g. # assuming Python 2.x
My Python versoin is 2.5.2
> your_data = open('your_file.txt', 'rb').read(36)
> print repr(your_data)
> print len(your_data)
> print your_data.count('\\')
> print your_data.count('x')
>
The result is:
'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9
> (2) show us the source of the script that you used
def UTF8ToChnWords():
f = open("123.txt","rb")
content=f.read()
print repr(content)
print len(content)
print content.count("\\")
print content.count("x")
pass
if __name__ == '__main__':
UTF8ToChnWords()
> (3) Tell us what "doesn't work" means in this case
It doesn't work because no matter in what way we deal with it we often
get 36 bytes string not 9 bytes.Thus, we can not get the correct
answer.
>
> Cheers,
> John
Thank you very much,
higer
More information about the Python-list
mailing list