[Chicago] understanding unicode problems

Fri Nov 16 17:29:05 CET 2007

Feihong Hsu wrote:
> Carl, that's a pretty good question.
> 
> There's probably no good, complete answer that can be given in a short 
> email post. 

I'll bet you can do it in a PyCon session :)

 > Basically, there's supposed to be a standard encoding for
> unicode: UTF-8. However, go to google.cn for instance and you'll see 
> that it's GB2312. That alone tells you there are millions (billions?) of 
> people using other encodings. So no, in practice there's no single 
> encoding that is the standard. I think UTF-8 is the way to go because 
> it's the closest to being the standard right now, and also it's 
> identical to ASCII when you aren't dealing with multibyte characters.
> 
> So we have to encode/decode because there is no standard encoding yet. 

I am not sure I understand why that means we have to.

hmm, ok, so maybe it is more than a reminder, but a safety net: "you are about 
to put raw bytes into a place that normally only has printable characters.  This 
may cause someone else a headache."

> That's why GB2312 and all those other bizarro encodings are packed into 
> the Python standard library.
> 
> Maybe someday a bunch of tech-savvy pop stars will get together and put 
> out a rendition of "We Are the World" to raise awareness about this 
> whole text encoding mess, and we will finally unite under one glorious 
> standard. Along that note, does anyone have Bono's cell phone number?

Encodings are all just different ways of representing the same values, right?

Kinda like "number of months in the year" can be shown as 12, 1100, c, XII, right?

Carl K

> 
> - Feihong
> 
> */Carl Karsten <carl at personnelware.com>/* wrote:
> 
>     Kumar McMillan wrote:
>      > On Nov 15, 2007 4:13 PM, Carl Karsten wrote:
>      >> of course now a unicode problem just hit me.
>      >>
>      >> i use the django admin to enter Ivan Krstic'
>      >> and reportlab spits out:
>     http://dev.personnelware.com/carl/a/IvanK1.pdf
>      >>
>      >> so pretty much 100% python.
>      >>
>      >> I am told:
>      >>
>      >> > Make sure that you are using utf-8 and not some other
>     encoding, such as
>      >> > latin-1.
>      >>
>      >> But I really don't know what that means, nor do I even know how
>     to debug this.
>      >
>      > I wrote up a little something about it when it finally clicked
>     for me:
>      >
>     http://farmdev.com/thoughts/23/what-i-thought-i-knew-about-unicode-in-python-amounted-to-nothing/
>      > (I was in the same spot, I knew I *should* use UTF-8 but wasn't sure
>      > how or why or what that even implied)
> 
>     "However, it's not always possible to work with unicode all the time
>     because not
>     everything supports it. As just one example, you'll need to create a
>     wrapper
>     that temporarily encodes / decodes data when reading a csv file
>     using the
>     standard csv module."
> 
>     Is there a standard way of encoding?
> 
>     A string (unicode or not) is a bunch of bytes. unicode chars may use
>     more than
>     one byte. What I don't understand: Why do I need to encode / decode?
>     I get
>     the feeling the error caused is a reminder "so that you know that
>     you need to do
>     the other operation later."
> 
>     Carl K
>     _______________________________________________
>     Chicago mailing list
>     Chicago at python.org
>     http://mail.python.org/mailman/listinfo/chicago
> 
> 
> Never miss a thing. Make Yahoo your homepage. 
> <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago