how to transfer my utf8 code saved in a file to gbk code
metolone+gmane at gmail.com
Mon Jun 8 01:58:16 EDT 2009
"higer" <higerinbeijing at gmail.com> wrote in message
news:0c786326-1651-42c8-ba39-4679f3558660 at r13g2000vbr.googlegroups.com...
> On Jun 7, 11:25 pm, John Machin <sjmac... at lexicon.net> wrote:
>> On Jun 7, 10:55 pm, higer <higerinbeij... at gmail.com> wrote:
>> > My file contains such strings :
>> > \xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a
>> Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
> It was saved in a file, so it occupy 36 bytes. If I just use a
> variable to contain this string, it can certainly work out correct
> result,but how to get right answer when reading from file.
Did you create this file? If it is 36 characters, it contains literal
backslash characters, not the 9 bytes that would correctly encode as UTF-8.
If you created the file yourself, show us the code.
>> > I want to read the content of this file and transfer it to the
>> > corresponding gbk code,a kind of Chinese character encode style.
>> > Everytime I was trying to transfer, it will output the same thing no
>> > matter which method was used.
>> > It seems like that when Python reads it, Python will taks '\' as a
>> > common char and this string at last will be represented as "\\xe6\\x97\
>> > \xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
>> > output,but that's not what I want to get.
>> > Anyone can help me?
>> try this:
>> utf8_data = your_data.decode('string-escape')
>> unicode_data = utf8_data.decode('utf8')
>> # unicode derived from your sample looks like this 日期： is that what
>> you expected?
> You are right , the result is 日期 which I just expect. If you save the
> string in a variable, you surely can get the correct result. But it is
> just a sample, so I give a short string, what if so many characters in
> a file?
>> gbk_data = unicode_data.encode('gbk')
> I have tried this method which you just told me, but unfortunately it
> does not work(mess code).
How are you determining this is 'mess code'? How are you viewing the
result? You'll need to use a viewer that understands GBK encoding, such as
"Chinese Window's Notepad".
>> If that "doesn't work", do three things:
>> (1) give us some unambiguous hard evidence about the contents of your
>> e.g. # assuming Python 2.x
> My Python versoin is 2.5.2
>> your_data = open('your_file.txt', 'rb').read(36)
>> print repr(your_data)
>> print len(your_data)
>> print your_data.count('\\')
>> print your_data.count('x')
> The result is:
>> (2) show us the source of the script that you used
> def UTF8ToChnWords():
> f = open("123.txt","rb")
> print repr(content)
> print len(content)
> print content.count("\\")
> print content.count("x")
utf8data = content.decode('string-escape')
unicodedata = utf8data.decode('utf8')
gbkdata = unicodedata.encode('gbk')
The print should give:
This is correct for GBK encoding. 456.txt should contain the 6 bytes of GBK
data. View the file with a program that understand GBK encoding.
More information about the Python-list