why isn't Unicode the default encoding?

John Salerno johnjsal at NOSPAMgmail.com
Mon Mar 20 22:53:41 CET 2006


Martin v. Löwis wrote:

> The real problem is that the Python string type is used to represent
> two very different concepts: bytes, and characters. You can't just drop
> the current Python string type, and use the Unicode type instead - then
> you would have no good way to represent sequences of bytes anymore.
> Byte sequences occur more often than you might think: a ZIP file, a
> MS Word file, a PDF file, and even an HTTP conversation are represented
> through byte sequences.
> 
> So for a byte sequence, internal representation is important; for a
> character string, it is not. Now, for historical reasons, the Python
> string literals create byte strings, not character strings. Since we
> cannot know whether a certain string literal is meant to denote bytes
> or characters, we can't just change the interpretation.

Interesting. So then the read() method, if given a numeric argument for 
bytes to read, would act differently depending on if you were using 
Unicode or not? As it is now, it seems to equate the bytes with number 
of characters, but if the document was written using Unicode characters, 
is it possible that read(2) might only pull out one character?



More information about the Python-list mailing list