[Chicago] understanding unicode problems
Kumar McMillan
kumar.mcmillan at gmail.com
Fri Nov 16 19:05:44 CET 2007
On Nov 16, 2007 11:56 AM, Carl Karsten <carl at personnelware.com> wrote:
> Pete wrote:
> > On Friday November 16 2007 10:07:40 am Carl Karsten wrote:
> >> A string (unicode or not) is a bunch of bytes. unicode chars may use more
> >> than one byte. What I don't understand: Why do I need to encode / decode?
> >> I get the feeling the error caused is a reminder "so that you know that
> >> you need to do the other operation later."
> >
> > I like to think of it this way:
> >
> > "Unicode is the Platonic Ideal of text; Strings are the shadows on the wall."
> >
> > Feel free to quote me.
> >
> > Unicode characters exist in some abstract heavenly place. They are about as
> > pure a representation of text as you could conceive - no glyphs (fonts), no
> > bytes, no memory representation (that you care about), nada. Unicode can
> > contain all characters now in existence or that will ever be.
> >
> > Here on earth, we can't write such things to disk or email them around. This
> > is where strings come in. Strings (more properly termed bytestrings) are
> > simply that - a sequence of bytes. They have an associated encoding, which
> > is basically the alphabet of legal bytes. The string itself doesn't know its
> > encoding; you either need to be told that by an external mechanism or guess
> > (often both).
>
> You made a leap from an abstract concept devoid of implementation details (i
> hope I got my terms right) to implementation details that don't seem to include
> how Unicode characters are handled by python, in/by objects, in memory.
>
> "no memory representation (that you care about)" - um, I do care. From what I
> am reading, there is some representation that can only exist in ... ram? and
> can not be written to disk. What, will it conger up demons or suck my drive
> into a black hole?
seriously, print this out and read it on the plane:
http://www.joelonsoftware.com/articles/Unicode.html
you will arrive at your destination all clean and snuggly feeling.
>
> If my python app has a bunch of Unicode stuff going on, and I hibernate my OS,
> all that Unicode stuff gets written to a disk file. So this "it can't be
> written to a disk file" needs to be clarified. And yes, this is all hair
> splitting semantics, but I think terminology assumptions it is a huge part of
> the problem.
>
> One of my favorite topics is Feynman Diagrams. I have 0.1 clue what the hell
> they represent, but I can appreciate the leap it made in that field: it greatly
> improved communication of concepts between people.
>
> >
> > encode() takes a unicode and produces a str
> > decode() takes a str and produces a unicode
>
> For this to help, unicode and str need to be defined better. but I think I made
> that clear :)
>
> We need to start a trac project to log the loose ends. I'll get right on that.
> someday.
>
>
> >
> > You need to supply the source/destination encoding that your working under.
> > The fact that both str & unicode objects have both methods in python doesn't
> > help things. There's a reason, but it's not very good.
> >
> > Note that most encodings have a limited alphabet and are therefore not capable
> > of representing the full range of unicode characters.
>
> OK, now you threw me again. the stuff is in memory. how about we invent a hex
> encoding that does a hexdump of whatever is in memory?
>
> What happens if you pickle one of theses suckers?
>
> yeah, this all revolves around my dis-understanding.
>
> we should just save this for march. I am going to submit a proposal: "Watch
> Chipy teach Unicode to the dumbest person on the planet. If Carl can be taught,
> you can too."
>
> > utf8 (sometimes
> > referred to incorrectly & unhelpfully as 'unicode') is a particular encoding
> > for bytestrings. It's the most comprehensive and most widely used, but it's
> > not the only one. Other commonly seen encodings are us-ascii, latin-8,
> > windows-1252.
> >
> > When coding text handling apps, I find it's best to do all of your processing
> > on unicode. This means *decoding* as *soon* as possible (right after
> > reading) and *encoding* as *late* as possible (just before writing).
> >
> > Here's a little picture:
> >
> > network => str => decode => unicode => munge => encode => disk
> >
> > Hope this helps. I've got some bookmarks at http://del.icio.us/pfein/unicode
> > if it's still not clear.
>
> Now your picture makes sense.
>
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
More information about the Chicago
mailing list