[Python-Dev] codecs question
Fred L. Drake, Jr.
fdrake@beopen.com
Fri, 29 Sep 2000 12:15:09 -0400 (EDT)
Jeremy was just playing with the xml.sax package, and decided to
print the string returned from parsing "û" (the copyright
symbol). Sure enough, he got a traceback:
>>> print u'\251'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
and asked me about it. I was a little surprised myself. First, that
anyone would use "print" in a SAX handler to start with, and second,
that it was so painful.
Now, I can chalk this up to not using a reasonable stdout that
understands that Unicode needs to be translated to Latin-1 given my
font selection. So I looked at the codecs module to provide a usable
output stream. The EncodedFile class provides a nice wrapper around
another file object, and supports both encoding both ways.
Unfortunately, I can't see what "encoding" I should use if I want to
read & write Unicode string objects to it. ;( (Marc-Andre, please
tell me I've missed something!) I also don't think I
can use it with "print", extended or otherwise.
The PRINT_ITEM opcode calls PyFile_WriteObject() with whatever it
gets, so that's fine. Then it converts the object using
PyObject_Str() or PyObject_Repr(). For Unicode objects, the tp_str
handler attempts conversion to the default encoding ("ascii" in this
case), and raises the traceback we see above.
Perhaps a little extra work is needed in PyFile_WriteObject() to
allow Unicode objects to pass through if the file is merely file-like,
and let the next layer handle the conversion? This would probably
break code, and therefore not be acceptable.
On the other hand, it's annoying that I can't create a file-object
that takes Unicode strings from "print", and doesn't seem intuitive.
-Fred
--
Fred L. Drake, Jr. <fdrake at beopen.com>
BeOpen PythonLabs Team Member