[Chicago] understanding unicode problems

Fri Nov 16 19:25:55 CET 2007

On Friday November 16 2007 12:56:37 pm Carl Karsten wrote:
> "no memory representation (that you care about)" - um, I do care.  From
> what I am reading, there is some representation that can only exist in ...
> ram?  and can not be written to disk.  What, will it conger up demons or
> suck my drive into a black hole?

Really, you'll be much happier if you just close your eyes & think of England.

As Kumar notes, the whole reason we use Python is so that we don't have to 
think about memory layout issues.  You don't go around worrying about how 
your Python/Java/C++/VB classes are layed out in memory, do you?  This is no 
different.

> > encode() takes a unicode and produces a str
> > decode() takes a str and produces a unicode
>
> For this to help, unicode and str need to be defined better.  but I think I
> made that clear :)

str == bytes. Like, the same bytes you grew up with writing BASIC on your 
Amiga, though dressed up sexy in OO. 

unicode == text. With all the nastiness of character sets & memory 
representation hidden so you don't have to worry about it.

You can treat a str like text if you want, but that's your business. And doing 
so will give you encoding errors.

> OK, now you threw me again.  the stuff is in memory.  how about we invent a
> hex encoding that does a hexdump of whatever is in memory?

Try GDB.  Seriously, it doesn't matter how it's represented internally by 
Python.  Heck, a Python int isn't a C int either.

> What happens if you pickle one of theses suckers?

It gets written in some internal binary format that pickle understands.  How 
does a list get pickled?

From within python, a unicode is composed of a sequence of 1-character 
unicodes.  That's all you need to know to get your work done.

-- 
Peter Fein   ||   773-575-0694   ||   pfein at pobox.com
http://www.pobox.com/~pfein/   ||   PGP: 0xCCF6AE6B
irc: pfein at freenode.net   ||   jabber: peter.fein at gmail.com