[Chicago] understanding unicode problems

Feihong Hsu hsu.feihong at yahoo.com
Fri Nov 16 18:15:16 CET 2007

I learned a lot about how to handle Unicode in Python when I gave my talk on it back in March. So clearly, the best way to understand Unicode is to give a talk on it. That's why you should give the talk, Carl. We'll be here to help you out ;-)


Carl Karsten <carl at personnelware.com> wrote: Kumar McMillan wrote:
> On Nov 16, 2007 9:07 AM, Carl Karsten  wrote:
>> Kumar McMillan wrote:
>>> I wrote up a little something about it when it finally clicked for me:
>>> http://farmdev.com/thoughts/23/what-i-thought-i-knew-about-unicode-in-python-amounted-to-nothing/
>>> (I was in the same spot, I knew I *should* use UTF-8 but wasn't sure
>>> how or why or what that even implied)
>> "However, it's not always possible to work with unicode all the time because not
>> everything supports it. As just one example, you'll need to create a wrapper
>> that temporarily encodes / decodes data when reading a csv file using the
>> standard csv module."
>> Is there a standard way of encoding?
> I suppose the standard way is to find all the boundaries of your
> application (where you accept strings from files or user input) and
> convert it all to unicode then deal with it everywhere internally as
> unicode.  Whenever you need to send output to stdout, a file,
> whatever, then you encode it.
>> A string (unicode or not) is a bunch of bytes.  unicode chars may use more than
>> one byte.
> unicode is actually represented internally as "code points;" it's not
> stored in bytes while it's "unicode."

Um, what's a "code point"?  and what are you calling "bytes", cuz in my 
vocabulary, everything is stored as a set of bytes, those 8 bit things that the 
CPU reads and writes to ram and disk drives.

>> What I don't understand:  Why do I need to encode / decode?
> Because you can't write unicode to a file, for example.  A file
> contains bytes and unicode has arbitrary byte representations.  When
> you encode unicode as UTF-8 the bytestring will look different than if
> you encode it as LATIN-1.  The reason this is so confusing is that
> Python will **try** to do the encoding/decoding for you automatically.
>  This is also why the errors you see are often very confusing (if you
> don't know Python is doing this under the hood).

This will make more sense once I get a grip on what a byte is.

>>  I get
>> the feeling the error caused is a reminder "so that you know that you need to do
>> the other operation later."
> if you post a little bit more of the error I can try and give some
> specific suggestions for solving it.  I wasn't clear exactly what code
> was raising the exception you posted earlier.

code that errored wasn't mine - it was Paul's, and I think he fixed it.  I am 
back to helping flesh out your unicode talk :)

Carl K
Chicago mailing list
Chicago at python.org

Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/chicago/attachments/20071116/ea46cafe/attachment.htm 

More information about the Chicago mailing list