data:image/s3,"s3://crabby-images/6080b/6080b5db53479ba16a9b1bd4835a99b4789583ba" alt=""
Jeremy was just playing with the xml.sax package, and decided to print the string returned from parsing "û" (the copyright symbol). Sure enough, he got a traceback:
print u'\251'
Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) and asked me about it. I was a little surprised myself. First, that anyone would use "print" in a SAX handler to start with, and second, that it was so painful. Now, I can chalk this up to not using a reasonable stdout that understands that Unicode needs to be translated to Latin-1 given my font selection. So I looked at the codecs module to provide a usable output stream. The EncodedFile class provides a nice wrapper around another file object, and supports both encoding both ways. Unfortunately, I can't see what "encoding" I should use if I want to read & write Unicode string objects to it. ;( (Marc-Andre, please tell me I've missed something!) I also don't think I can use it with "print", extended or otherwise. The PRINT_ITEM opcode calls PyFile_WriteObject() with whatever it gets, so that's fine. Then it converts the object using PyObject_Str() or PyObject_Repr(). For Unicode objects, the tp_str handler attempts conversion to the default encoding ("ascii" in this case), and raises the traceback we see above. Perhaps a little extra work is needed in PyFile_WriteObject() to allow Unicode objects to pass through if the file is merely file-like, and let the next layer handle the conversion? This would probably break code, and therefore not be acceptable. On the other hand, it's annoying that I can't create a file-object that takes Unicode strings from "print", and doesn't seem intuitive. -Fred -- Fred L. Drake, Jr. <fdrake at beopen.com> BeOpen PythonLabs Team Member
data:image/s3,"s3://crabby-images/addaf/addaf2247848dea3fd25184608de7f243dd54eca" alt=""
"Fred L. Drake, Jr." wrote:
That's a consequence of defaulting to ASCII for all platforms instead of choosing the encoding depending on the current locale (the site.py file has code which does the latter).
That depends on what you want to see as output ;-) E.g. in Europe you'd use Latin-1 (which also contains the copyright symbol).
Right.
The problem is that the .write() method of a file-like object will most probably only work with string objects. If it uses "s#" or "t#" it's lucky, because then the argument parser will apply the necessariy magic to the input object to get out some object ready for writing to the file. Otherwise it will simply fail with a type error. Simply allowing PyObject_Str() to return Unicode objects too is not an alternative either since that would certainly break tons of code. Implementing tp_print for Unicode wouldn't get us anything either. Perhaps we'll need to fix PyFile_WriteObject() to special case Unicode and allow calling .write() with an Unicode object and fix those .write() methods which don't do the right thing ?! This is a project for 2.1. In 2.0 only explicitly calling the .write() method will do the trick and EncodedFile() helps with this. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
data:image/s3,"s3://crabby-images/addaf/addaf2247848dea3fd25184608de7f243dd54eca" alt=""
"Fred L. Drake, Jr." wrote:
That's a consequence of defaulting to ASCII for all platforms instead of choosing the encoding depending on the current locale (the site.py file has code which does the latter).
That depends on what you want to see as output ;-) E.g. in Europe you'd use Latin-1 (which also contains the copyright symbol).
Right.
The problem is that the .write() method of a file-like object will most probably only work with string objects. If it uses "s#" or "t#" it's lucky, because then the argument parser will apply the necessariy magic to the input object to get out some object ready for writing to the file. Otherwise it will simply fail with a type error. Simply allowing PyObject_Str() to return Unicode objects too is not an alternative either since that would certainly break tons of code. Implementing tp_print for Unicode wouldn't get us anything either. Perhaps we'll need to fix PyFile_WriteObject() to special case Unicode and allow calling .write() with an Unicode object and fix those .write() methods which don't do the right thing ?! This is a project for 2.1. In 2.0 only explicitly calling the .write() method will do the trick and EncodedFile() helps with this. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
participants (2)
-
Fred L. Drake, Jr.
-
M.-A. Lemburg