[Chicago] understanding unicode problems

Feihong Hsu hsu.feihong at yahoo.com
Fri Nov 16 16:57:10 CET 2007


Carl, that's a pretty good question.

There's probably no good, complete answer that can be given in a short email post. Basically, there's supposed to be a standard encoding for unicode: UTF-8. However, go to google.cn for instance and you'll see that it's GB2312. That alone tells you there are millions (billions?) of people using other encodings. So no, in practice there's no single encoding that is the standard. I think UTF-8 is the way to go because it's the closest to being the standard right now, and also it's identical to ASCII when you aren't dealing with multibyte characters.

So we have to encode/decode because there is no standard encoding yet. That's why GB2312 and all those other bizarro encodings are packed into the Python standard library.

Maybe someday a bunch of tech-savvy pop stars will get together and put out a rendition of "We Are the World" to raise awareness about this whole text encoding mess, and we will finally unite under one glorious standard. Along that note, does anyone have Bono's cell phone number?

- Feihong

Carl Karsten <carl at personnelware.com> wrote: Kumar McMillan wrote:
> On Nov 15, 2007 4:13 PM, Carl Karsten  wrote:
>> of course now a unicode problem just hit me.
>>
>> i use the  django admin to enter  Ivan Krstic'
>> and reportlab spits out: http://dev.personnelware.com/carl/a/IvanK1.pdf
>>
>> so pretty much 100% python.
>>
>> I am told:
>>
>>  > Make sure that you are using utf-8 and not some other encoding, such as
>>  > latin-1.
>>
>> But I really don't know what that means, nor do I even know how to debug this.
> 
> I wrote up a little something about it when it finally clicked for me:
> http://farmdev.com/thoughts/23/what-i-thought-i-knew-about-unicode-in-python-amounted-to-nothing/
> (I was in the same spot, I knew I *should* use UTF-8 but wasn't sure
> how or why or what that even implied)

"However, it's not always possible to work with unicode all the time because not 
everything supports it. As just one example, you'll need to create a wrapper 
that temporarily encodes / decodes data when reading a csv file using the 
standard csv module."

Is there a standard way of encoding?

A string (unicode or not) is a bunch of bytes.  unicode chars may use more than 
one byte.  What I don't understand:  Why do I need to encode / decode?  I get 
the feeling the error caused is a reminder "so that you know that you need to do 
the other operation later."

Carl K
_______________________________________________
Chicago mailing list
Chicago at python.org
http://mail.python.org/mailman/listinfo/chicago


       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/chicago/attachments/20071116/620a2aba/attachment.htm 


More information about the Chicago mailing list