[Chicago] understanding unicode problems
Carl Karsten
carl at personnelware.com
Fri Nov 16 17:36:14 CET 2007
Kumar McMillan wrote:
> On Nov 16, 2007 9:07 AM, Carl Karsten <carl at personnelware.com> wrote:
>> Kumar McMillan wrote:
>>> I wrote up a little something about it when it finally clicked for me:
>>> http://farmdev.com/thoughts/23/what-i-thought-i-knew-about-unicode-in-python-amounted-to-nothing/
>>> (I was in the same spot, I knew I *should* use UTF-8 but wasn't sure
>>> how or why or what that even implied)
>> "However, it's not always possible to work with unicode all the time because not
>> everything supports it. As just one example, you'll need to create a wrapper
>> that temporarily encodes / decodes data when reading a csv file using the
>> standard csv module."
>>
>> Is there a standard way of encoding?
>
> I suppose the standard way is to find all the boundaries of your
> application (where you accept strings from files or user input) and
> convert it all to unicode then deal with it everywhere internally as
> unicode. Whenever you need to send output to stdout, a file,
> whatever, then you encode it.
>
>> A string (unicode or not) is a bunch of bytes. unicode chars may use more than
>> one byte.
>
> unicode is actually represented internally as "code points;" it's not
> stored in bytes while it's "unicode."
Um, what's a "code point"? and what are you calling "bytes", cuz in my
vocabulary, everything is stored as a set of bytes, those 8 bit things that the
CPU reads and writes to ram and disk drives.
>
>> What I don't understand: Why do I need to encode / decode?
>
> Because you can't write unicode to a file, for example. A file
> contains bytes and unicode has arbitrary byte representations. When
> you encode unicode as UTF-8 the bytestring will look different than if
> you encode it as LATIN-1. The reason this is so confusing is that
> Python will **try** to do the encoding/decoding for you automatically.
> This is also why the errors you see are often very confusing (if you
> don't know Python is doing this under the hood).
>
This will make more sense once I get a grip on what a byte is.
>> I get
>> the feeling the error caused is a reminder "so that you know that you need to do
>> the other operation later."
>
> if you post a little bit more of the error I can try and give some
> specific suggestions for solving it. I wasn't clear exactly what code
> was raising the exception you posted earlier.
code that errored wasn't mine - it was Paul's, and I think he fixed it. I am
back to helping flesh out your unicode talk :)
Carl K
More information about the Chicago
mailing list