how to transfer my utf8 code saved in a file to gbk code

higer higerinbeijing at gmail.com
Sun Jun 7 22:32:59 EDT 2009


On Jun 7, 11:25 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Jun 7, 10:55 pm, higer <higerinbeij... at gmail.com> wrote:
>
> > My file contains such strings :
> > \xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
>


> Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
>

It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.

>
>
> > I want to read the content of this file and transfer it to the
> > corresponding gbk code,a kind of Chinese character encode style.
> > Everytime I was trying to transfer, it will output the same thing no
> > matter which method was used.
> >  It seems like that when Python reads it, Python will taks '\' as a
> > common char and this string at last will be represented as "\\xe6\\x97\
> > \xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
> > output,but that's not what I want to get.
>
> > Anyone can help me?
>
> try this:
>
> utf8_data = your_data.decode('string-escape')
> unicode_data = utf8_data.decode('utf8')
> # unicode derived from your sample looks like this 日期: is that what
> you expected?

You are right , the result is 日期 which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?

> gbk_data = unicode_data.encode('gbk')
>

I have tried this method which you just told me, but unfortunately it
does not work(mess code).


> If that "doesn't work", do three things:
> (1) give us some unambiguous hard evidence about the contents of your
> data:
> e.g. # assuming Python 2.x

My Python versoin is 2.5.2

> your_data = open('your_file.txt', 'rb').read(36)
> print repr(your_data)
> print len(your_data)
> print your_data.count('\\')
> print your_data.count('x')
>

The result is:

'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9

> (2) show us the source of the script that you used

def UTF8ToChnWords():
    f = open("123.txt","rb")
    content=f.read()
    print repr(content)
    print len(content)
    print content.count("\\")
    print content.count("x")

    pass
if __name__ == '__main__':
    UTF8ToChnWords()

> (3) Tell us what "doesn't work" means in this case

It doesn't work because no matter in what way we deal with it we often
get 36 bytes string not 9 bytes.Thus, we can not get the correct
answer.

>
> Cheers,
> John

Thank you very much,
higer



More information about the Python-list mailing list