[Python-Dev] codecs question

Fred L. Drake, Jr. fdrake@beopen.com
Fri, 29 Sep 2000 12:15:09 -0400 (EDT)

  Jeremy was just playing with the xml.sax package, and decided to
print the string returned from parsing "û" (the copyright
symbol).  Sure enough, he got a traceback:

>>> print u'\251'

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)

and asked me about it.  I was a little surprised myself.  First, that
anyone would use "print" in a SAX handler to start with, and second,
that it was so painful.
  Now, I can chalk this up to not using a reasonable stdout that
understands that Unicode needs to be translated to Latin-1 given my
font selection.  So I looked at the codecs module to provide a usable
output stream.  The EncodedFile class provides a nice wrapper around
another file object, and supports both encoding both ways.
  Unfortunately, I can't see what "encoding" I should use if I want to
read & write Unicode string objects to it.  ;(  (Marc-Andre, please
tell me I've missed something!)  I also don't think I
can use it with "print", extended or otherwise.
  The PRINT_ITEM opcode calls PyFile_WriteObject() with whatever it
gets, so that's fine.  Then it converts the object using
PyObject_Str() or PyObject_Repr().  For Unicode objects, the tp_str
handler attempts conversion to the default encoding ("ascii" in this
case), and raises the traceback we see above.
  Perhaps a little extra work is needed in PyFile_WriteObject() to
allow Unicode objects to pass through if the file is merely file-like,
and let the next layer handle the conversion?  This would probably
break code, and therefore not be acceptable.
  On the other hand, it's annoying that I can't create a file-object
that takes Unicode strings from "print", and doesn't seem intuitive.


Fred L. Drake, Jr.  <fdrake at beopen.com>
BeOpen PythonLabs Team Member