[Chicago] understanding unicode problems

Fri Nov 16 17:53:01 CET 2007

>
> >> A string (unicode or not) is a bunch of bytes.  unicode chars may use
> more than
> >> one byte.
> >
> > unicode is actually represented internally as "code points;" it's not
> > stored in bytes while it's "unicode."
>
> Um, what's a "code point"?  and what are you calling "bytes", cuz in my
> vocabulary, everything is stored as a set of bytes, those 8 bit things
> that the
> CPU reads and writes to ram and disk drives.

Traditionally a string is stored in memory as an array of bytes. So for
example this means you treat the string as an array of bytes, loop over them
one byte at a time and then convert each byte individually to the ASCII
character. This is what gets people in trouble with Unicode because all of a
sudden a CHARACTER is anywhere between 1 and 4 bytes. And you need to start
caring all of a sudden about how many bytes are in a character and big vs.
little endian.

This is why a unicode string is stored in memory as a set of code points and
code that operates on strings operates on one code point at a time (as
opposed to one byte at a time). How the code points themselves are
represented in memory is something you don't need to worry about (the same
way you don't really need to care how an instance of a python class is
stored in memory).

> >
> >> What I don't understand:  Why do I need to encode / decode?
> >
> > Because you can't write unicode to a file, for example.  A file
> > contains bytes and unicode has arbitrary byte representations.  When
> > you encode unicode as UTF-8 the bytestring will look different than if
> > you encode it as LATIN-1.  The reason this is so confusing is that
> > Python will **try** to do the encoding/decoding for you automatically.
> >  This is also why the errors you see are often very confusing (if you
> > don't know Python is doing this under the hood).
> >
>
> This will make more sense once I get a grip on what a byte is.

A byte  is a set of 8 bits. In the days of ASCII a byte was enough to
represent each character. With Unicode it's no longer the case. Because you
need more than one byte to store each code point there are multiple
encodings. Personally I like UTF-8 because for ASCII characters it still
only uses one byte (but uses up to 4 for other Unicode characters).

-- 
Cosmin Stejerean
http://blog.offbytwo.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/chicago/attachments/20071116/2ca167b7/attachment.htm