why isn't Unicode the default encoding?
matt at pollenation.net
Mon Mar 20 23:24:38 CET 2006
John Salerno wrote:
> Martin v. Löwis wrote:
>> The real problem is that the Python string type is used to represent
>> two very different concepts: bytes, and characters. You can't just drop
>> the current Python string type, and use the Unicode type instead - then
>> you would have no good way to represent sequences of bytes anymore.
>> Byte sequences occur more often than you might think: a ZIP file, a
>> MS Word file, a PDF file, and even an HTTP conversation are represented
>> through byte sequences.
>> So for a byte sequence, internal representation is important; for a
>> character string, it is not. Now, for historical reasons, the Python
>> string literals create byte strings, not character strings. Since we
>> cannot know whether a certain string literal is meant to denote bytes
>> or characters, we can't just change the interpretation.
> Interesting. So then the read() method, if given a numeric argument for
> bytes to read, would act differently depending on if you were using
> Unicode or not? As it is now, it seems to equate the bytes with number
> of characters, but if the document was written using Unicode characters,
> is it possible that read(2) might only pull out one character?
Exactly. read(2) might pull out one character, or only half a character.
It all depends on the encoding of the data you're reading.
If you're reading or writing text to a file (or anywhere, for that
matter) you need to know the unicode encoding of the file's content to
read it correctly.
Fortunately, the codecs module makes the whole process relatively painless:
>>> import codecs
>>> f = open("a_utf8_encoded_file.txt")
>>> stream = codecs.getreader('utf-8')(f)
>>> c = stream.read(1)
The 'stream' works on unicode characters so 'c' is a unicode instance,
i.e. a whole textual character.
/ \__ Matt Goodall, Pollenation Internet Ltd
\__/ \ w: http://www.pollenation.net
__/ \__/ e: matt at pollenation.net
/ \__/ \ t: +44 (0)113 2252500
/ \ Any views expressed are my own and do not necessarily
\__/ reflect the views of my employer.
More information about the Python-list