[Chicago] understanding unicode problems
Carl Karsten
carl at personnelware.com
Fri Nov 16 18:56:37 CET 2007
Pete wrote:
> On Friday November 16 2007 10:07:40 am Carl Karsten wrote:
>> A string (unicode or not) is a bunch of bytes. unicode chars may use more
>> than one byte. What I don't understand: Why do I need to encode / decode?
>> I get the feeling the error caused is a reminder "so that you know that
>> you need to do the other operation later."
>
> I like to think of it this way:
>
> "Unicode is the Platonic Ideal of text; Strings are the shadows on the wall."
>
> Feel free to quote me.
>
> Unicode characters exist in some abstract heavenly place. They are about as
> pure a representation of text as you could conceive - no glyphs (fonts), no
> bytes, no memory representation (that you care about), nada. Unicode can
> contain all characters now in existence or that will ever be.
>
> Here on earth, we can't write such things to disk or email them around. This
> is where strings come in. Strings (more properly termed bytestrings) are
> simply that - a sequence of bytes. They have an associated encoding, which
> is basically the alphabet of legal bytes. The string itself doesn't know its
> encoding; you either need to be told that by an external mechanism or guess
> (often both).
You made a leap from an abstract concept devoid of implementation details (i
hope I got my terms right) to implementation details that don't seem to include
how Unicode characters are handled by python, in/by objects, in memory.
"no memory representation (that you care about)" - um, I do care. From what I
am reading, there is some representation that can only exist in ... ram? and
can not be written to disk. What, will it conger up demons or suck my drive
into a black hole?
If my python app has a bunch of Unicode stuff going on, and I hibernate my OS,
all that Unicode stuff gets written to a disk file. So this "it can't be
written to a disk file" needs to be clarified. And yes, this is all hair
splitting semantics, but I think terminology assumptions it is a huge part of
the problem.
One of my favorite topics is Feynman Diagrams. I have 0.1 clue what the hell
they represent, but I can appreciate the leap it made in that field: it greatly
improved communication of concepts between people.
>
> encode() takes a unicode and produces a str
> decode() takes a str and produces a unicode
For this to help, unicode and str need to be defined better. but I think I made
that clear :)
We need to start a trac project to log the loose ends. I'll get right on that.
someday.
>
> You need to supply the source/destination encoding that your working under.
> The fact that both str & unicode objects have both methods in python doesn't
> help things. There's a reason, but it's not very good.
>
> Note that most encodings have a limited alphabet and are therefore not capable
> of representing the full range of unicode characters.
OK, now you threw me again. the stuff is in memory. how about we invent a hex
encoding that does a hexdump of whatever is in memory?
What happens if you pickle one of theses suckers?
yeah, this all revolves around my dis-understanding.
we should just save this for march. I am going to submit a proposal: "Watch
Chipy teach Unicode to the dumbest person on the planet. If Carl can be taught,
you can too."
> utf8 (sometimes
> referred to incorrectly & unhelpfully as 'unicode') is a particular encoding
> for bytestrings. It's the most comprehensive and most widely used, but it's
> not the only one. Other commonly seen encodings are us-ascii, latin-8,
> windows-1252.
>
> When coding text handling apps, I find it's best to do all of your processing
> on unicode. This means *decoding* as *soon* as possible (right after
> reading) and *encoding* as *late* as possible (just before writing).
>
> Here's a little picture:
>
> network => str => decode => unicode => munge => encode => disk
>
> Hope this helps. I've got some bookmarks at http://del.icio.us/pfein/unicode
> if it's still not clear.
Now your picture makes sense.
More information about the Chicago
mailing list