Ascii to Unicode.

John Nagle nagle at animats.com
Thu Jul 29 15:17:14 EDT 2010


On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
> This still seems odd to me.  I would have thought that the unicode function
> would return a properly encoded byte stream that could then simply be
> written to disk. Instead it seems like you have to re-encode the byte stream
> to some kind of escaped Ascii before it can be written back out.

    Here's what's really going on.

    Unicode strings within Python have to be indexable.  So the internal
representation of Unicode has (usually) two bytes for each character,
so they work like arrays.

    UTF-8 is a stream format for Unicode.  It's slightly compressed;
each character occupies 1 to 4 bytes, and the base ASCII characters
(0..127 only, not 128..255) occupy one byte each.  The format is
described in "http://en.wikipedia.org/wiki/UTF-8".  A UTF-8 file or
stream has to be parsed from the beginning to keep track of where each
Unicode character begins.  So it's not a suitable format for
data being actively worked on in memory; it can't be easily indexed.

    That's why it's necessary to convert to UTF-8 before writing
to a file or socket.

				John Nagle



More information about the Python-list mailing list