[Chicago] understanding unicode problems

Fri Nov 16 17:36:14 CET 2007

Kumar McMillan wrote:
> On Nov 16, 2007 9:07 AM, Carl Karsten <carl at personnelware.com> wrote:
>> Kumar McMillan wrote:
>>> I wrote up a little something about it when it finally clicked for me:
>>> http://farmdev.com/thoughts/23/what-i-thought-i-knew-about-unicode-in-python-amounted-to-nothing/
>>> (I was in the same spot, I knew I *should* use UTF-8 but wasn't sure
>>> how or why or what that even implied)
>> "However, it's not always possible to work with unicode all the time because not
>> everything supports it. As just one example, you'll need to create a wrapper
>> that temporarily encodes / decodes data when reading a csv file using the
>> standard csv module."
>>
>> Is there a standard way of encoding?
> 
> I suppose the standard way is to find all the boundaries of your
> application (where you accept strings from files or user input) and
> convert it all to unicode then deal with it everywhere internally as
> unicode.  Whenever you need to send output to stdout, a file,
> whatever, then you encode it.
> 
>> A string (unicode or not) is a bunch of bytes.  unicode chars may use more than
>> one byte.
> 
> unicode is actually represented internally as "code points;" it's not
> stored in bytes while it's "unicode."

Um, what's a "code point"?  and what are you calling "bytes", cuz in my 
vocabulary, everything is stored as a set of bytes, those 8 bit things that the 
CPU reads and writes to ram and disk drives.

> 
>> What I don't understand:  Why do I need to encode / decode?
> 
> Because you can't write unicode to a file, for example.  A file
> contains bytes and unicode has arbitrary byte representations.  When
> you encode unicode as UTF-8 the bytestring will look different than if
> you encode it as LATIN-1.  The reason this is so confusing is that
> Python will **try** to do the encoding/decoding for you automatically.
>  This is also why the errors you see are often very confusing (if you
> don't know Python is doing this under the hood).
> 

This will make more sense once I get a grip on what a byte is.

>>  I get
>> the feeling the error caused is a reminder "so that you know that you need to do
>> the other operation later."
> 
> if you post a little bit more of the error I can try and give some
> specific suggestions for solving it.  I wasn't clear exactly what code
> was raising the exception you posted earlier.

code that errored wasn't mine - it was Paul's, and I think he fixed it.  I am 
back to helping flesh out your unicode talk :)

Carl K