[Chicago] understanding unicode problems

Fri Nov 16 19:05:44 CET 2007

On Nov 16, 2007 11:56 AM, Carl Karsten <carl at personnelware.com> wrote:
> Pete wrote:
> > On Friday November 16 2007 10:07:40 am Carl Karsten wrote:
> >> A string (unicode or not) is a bunch of bytes.  unicode chars may use more
> >> than one byte.  What I don't understand:  Why do I need to encode / decode?
> >>  I get the feeling the error caused is a reminder "so that you know that
> >> you need to do the other operation later."
> >
> > I like to think of it this way:
> >
> > "Unicode is the Platonic Ideal of text; Strings are the shadows on the wall."
> >
> > Feel free to quote me.
> >
> > Unicode characters exist in some abstract heavenly place.  They are about as
> > pure a representation of text as you could conceive - no glyphs (fonts), no
> > bytes, no memory representation (that you care about), nada.  Unicode can
> > contain all characters now in existence or that will ever be.
> >
> > Here on earth, we can't write such things to disk or email them around.  This
> > is where strings come in.  Strings (more properly termed bytestrings) are
> > simply that - a sequence of bytes.  They have an associated encoding, which
> > is basically the alphabet of legal bytes.  The string itself doesn't know its
> > encoding; you either need to be told that by an external mechanism or guess
> > (often both).
>
> You made a leap from an abstract concept devoid of implementation details (i
> hope I got my terms right) to implementation details that don't seem to include
>   how Unicode characters are handled by python, in/by objects, in memory.
>
> "no memory representation (that you care about)" - um, I do care.  From what I
> am reading, there is some representation that can only exist in ... ram?  and
> can not be written to disk.  What, will it conger up demons or suck my drive
> into a black hole?

seriously, print this out and read it on the plane:
http://www.joelonsoftware.com/articles/Unicode.html

you will arrive at your destination all clean and snuggly feeling.

>
> If my python app has a bunch of Unicode stuff going on, and I hibernate my OS,
> all that Unicode stuff gets written to a disk file.  So this "it can't be
> written to a disk file" needs to be clarified.  And yes, this is all hair
> splitting semantics, but I think terminology assumptions it is a huge part of
> the problem.
>
> One of my favorite topics is Feynman Diagrams.  I have 0.1 clue what the hell
> they represent, but I can appreciate the leap it made in that field: it greatly
> improved communication of concepts between people.
>
> >
> > encode() takes a unicode and produces a str
> > decode() takes a str and produces a unicode
>
> For this to help, unicode and str need to be defined better.  but I think I made
> that clear :)
>
> We need to start a trac project to log the loose ends.  I'll get right on that.
>   someday.
>
>
> >
> > You need to supply the source/destination encoding that your working under.
> > The fact that both str & unicode objects have both methods in python doesn't
> > help things.  There's a reason, but it's not very good.
> >
> > Note that most encodings have a limited alphabet and are therefore not capable
> > of representing the full range of unicode characters.
>
> OK, now you threw me again.  the stuff is in memory.  how about we invent a hex
> encoding that does a hexdump of whatever is in memory?
>
> What happens if you pickle one of theses suckers?
>
> yeah, this all revolves around my dis-understanding.
>
> we should just save this for march.  I am going to submit a proposal: "Watch
> Chipy teach Unicode to the dumbest person on the planet.  If Carl can be taught,
> you can too."
>
>  > utf8 (sometimes
> > referred to incorrectly & unhelpfully as 'unicode') is a particular encoding
> > for bytestrings.  It's the most comprehensive and most widely used, but it's
> > not the only one. Other commonly seen encodings are us-ascii, latin-8,
> > windows-1252.
> >
> > When coding text handling apps, I find it's best to do all of your processing
> > on unicode.  This means *decoding* as *soon* as possible (right after
> > reading) and *encoding* as *late* as possible (just before writing).
> >
> > Here's a little picture:
> >
> > network => str => decode => unicode => munge => encode => disk
> >
> > Hope this helps.  I've got some bookmarks at http://del.icio.us/pfein/unicode
> > if it's still not clear.
>
> Now your picture makes sense.
>
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>