[Chicago] understanding unicode problems

Fri Nov 16 18:56:37 CET 2007

Pete wrote:
> On Friday November 16 2007 10:07:40 am Carl Karsten wrote:
>> A string (unicode or not) is a bunch of bytes.  unicode chars may use more
>> than one byte.  What I don't understand:  Why do I need to encode / decode?
>>  I get the feeling the error caused is a reminder "so that you know that
>> you need to do the other operation later."
> 
> I like to think of it this way:
> 
> "Unicode is the Platonic Ideal of text; Strings are the shadows on the wall."
> 
> Feel free to quote me.
> 
> Unicode characters exist in some abstract heavenly place.  They are about as 
> pure a representation of text as you could conceive - no glyphs (fonts), no 
> bytes, no memory representation (that you care about), nada.  Unicode can 
> contain all characters now in existence or that will ever be.
> 
> Here on earth, we can't write such things to disk or email them around.  This 
> is where strings come in.  Strings (more properly termed bytestrings) are 
> simply that - a sequence of bytes.  They have an associated encoding, which 
> is basically the alphabet of legal bytes.  The string itself doesn't know its 
> encoding; you either need to be told that by an external mechanism or guess 
> (often both).

You made a leap from an abstract concept devoid of implementation details (i 
hope I got my terms right) to implementation details that don't seem to include 
  how Unicode characters are handled by python, in/by objects, in memory.

"no memory representation (that you care about)" - um, I do care.  From what I 
am reading, there is some representation that can only exist in ... ram?  and 
can not be written to disk.  What, will it conger up demons or suck my drive 
into a black hole?

If my python app has a bunch of Unicode stuff going on, and I hibernate my OS, 
all that Unicode stuff gets written to a disk file.  So this "it can't be 
written to a disk file" needs to be clarified.  And yes, this is all hair 
splitting semantics, but I think terminology assumptions it is a huge part of 
the problem.

One of my favorite topics is Feynman Diagrams.  I have 0.1 clue what the hell 
they represent, but I can appreciate the leap it made in that field: it greatly 
improved communication of concepts between people.

> 
> encode() takes a unicode and produces a str
> decode() takes a str and produces a unicode

For this to help, unicode and str need to be defined better.  but I think I made 
that clear :)

We need to start a trac project to log the loose ends.  I'll get right on that. 
  someday.

> 
> You need to supply the source/destination encoding that your working under. 
> The fact that both str & unicode objects have both methods in python doesn't 
> help things.  There's a reason, but it's not very good.
> 
> Note that most encodings have a limited alphabet and are therefore not capable 
> of representing the full range of unicode characters. 

OK, now you threw me again.  the stuff is in memory.  how about we invent a hex 
encoding that does a hexdump of whatever is in memory?

What happens if you pickle one of theses suckers?

yeah, this all revolves around my dis-understanding.

we should just save this for march.  I am going to submit a proposal: "Watch 
Chipy teach Unicode to the dumbest person on the planet.  If Carl can be taught, 
you can too."

 > utf8 (sometimes
> referred to incorrectly & unhelpfully as 'unicode') is a particular encoding 
> for bytestrings.  It's the most comprehensive and most widely used, but it's 
> not the only one. Other commonly seen encodings are us-ascii, latin-8, 
> windows-1252.
> 
> When coding text handling apps, I find it's best to do all of your processing 
> on unicode.  This means *decoding* as *soon* as possible (right after 
> reading) and *encoding* as *late* as possible (just before writing).
> 
> Here's a little picture:
> 
> network => str => decode => unicode => munge => encode => disk
> 
> Hope this helps.  I've got some bookmarks at http://del.icio.us/pfein/unicode 
> if it's still not clear.

Now your picture makes sense.