Printing and __unicode__
In c.l.p, Henry Thompson wondered why printing would ignore __unicode__. Consider this: import codecs stream = codecs.open("/tmp/bla","w", encoding="cp1252") class Foo: def __unicode__(self): return u"\N{EURO SIGN}" foo = Foo() print >>stream, foo This succeeds, but /tmp/bla now contains <__main__.Foo instance at 0x4026e68c> He argues that it instead should invoke __unicode__, similar to invoking automatically __str__ when writing to a byte stream. I agree that this is desirable, but I wonder what the best approach would be: A. Printing tries __str__, __unicode__, and __repr__, in this order. B. A file indicates "unicode-awareness" somehow. For a Unicode-aware file, it tries __unicode__, __str__, and __repr__, in order. C. A file indicates that it is "unicode-requiring" somehow. For a unicode-requiring file, it tries __unicode__; if that fails, it tries __repr__ and converts the result to Unicode. Which of these, if any, would be most Pythonish? Regards, Martin
In c.l.p, Henry Thompson wondered why printing would ignore __unicode__. Consider this:
import codecs stream = codecs.open("/tmp/bla","w", encoding="cp1252")
class Foo: def __unicode__(self): return u"\N{EURO SIGN}"
foo = Foo() print >>stream, foo
This succeeds, but /tmp/bla now contains
<__main__.Foo instance at 0x4026e68c>
He argues that it instead should invoke __unicode__, similar to invoking automatically __str__ when writing to a byte stream.
I agree that this is desirable, but I wonder what the best approach would be:
A. Printing tries __str__, __unicode__, and __repr__, in this order.
If you try __str__ before __unicode__, you'll always get the default __str__ for all new-style classes.
B. A file indicates "unicode-awareness" somehow. For a Unicode-aware file, it tries __unicode__, __str__, and __repr__, in order.
I like this.
C. A file indicates that it is "unicode-requiring" somehow. For a unicode-requiring file, it tries __unicode__; if that fails, it tries __repr__ and converts the result to Unicode.
Falling back to __repr__ without __str__ doesn't make sense.
Which of these, if any, would be most Pythonish?
B. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum
B. A file indicates "unicode-awareness" somehow. For a Unicode-aware file, it tries __unicode__, __str__, and __repr__, in order.
I like this.
Ok, then the question is: How can a file indicate its unicode-awareness? I propose that presence of an attribute "encoding" is taken as such an indication; this would cover all existing cases with no change to the file-like objects. In case the stream is "natively" Unicode (i.e. doesn't ever convert to byte strings), setting encoding to None should be allowed (this actually indicates that StringIO should have the encoding attribute). Regards, Martin
B. A file indicates "unicode-awareness" somehow. For a Unicode-aware file, it tries __unicode__, __str__, and __repr__, in order.
I like this.
Ok, then the question is: How can a file indicate its unicode-awareness? I propose that presence of an attribute "encoding" is taken as such an indication; this would cover all existing cases with no change to the file-like objects.
+1
In case the stream is "natively" Unicode (i.e. doesn't ever convert to byte strings), setting encoding to None should be allowed (this actually indicates that StringIO should have the encoding attribute).
+1 --Guido van Rossum (home page: http://www.python.org/~guido/)
Martin v. Loewis wrote:
Guido van Rossum
writes: B. A file indicates "unicode-awareness" somehow. For a Unicode-aware file, it tries __unicode__, __str__, and __repr__, in order.
I like this.
Ok, then the question is: How can a file indicate its unicode-awareness? I propose that presence of an attribute "encoding" is taken as such an indication; this would cover all existing cases with no change to the file-like objects.
Thanks to the time machine, this attribute is already available on stream objects created with codecs.open(). +1
In case the stream is "natively" Unicode (i.e. doesn't ever convert to byte strings), setting encoding to None should be allowed (this actually indicates that StringIO should have the encoding attribute).
-1 The presence of .encoding should indicate that it is safe to write Unicode objects to .write(). Let the stream decide what to do with the Unicode object (e.g. it would probably encode the Unicode object using the .encoding and only then write it to the outside world). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
"M.-A. Lemburg"
In case the stream is "natively" Unicode (i.e. doesn't ever convert to byte strings), setting encoding to None should be allowed (this actually indicates that StringIO should have the encoding attribute).
-1
The presence of .encoding should indicate that it is safe to write Unicode objects to .write(). Let the stream decide what to do with the Unicode object (e.g. it would probably encode the Unicode object using the .encoding and only then write it to the outside world).
So should StringIO object have an .encoding attribute or not? If not, should f = StringIO.StringIO() print >>f,x try to invoke Unicode conversion or not? If it should, how should it find out that this is safe to do? Regards, Martin
Martin v. Loewis wrote:
"M.-A. Lemburg"
writes: In case the stream is "natively" Unicode (i.e. doesn't ever convert to byte strings), setting encoding to None should be allowed (this actually indicates that StringIO should have the encoding attribute).
-1
The presence of .encoding should indicate that it is safe to write Unicode objects to .write(). Let the stream decide what to do with the Unicode object (e.g. it would probably encode the Unicode object using the .encoding and only then write it to the outside world).
So should StringIO object have an .encoding attribute or not?
If not, should
f = StringIO.StringIO() print >>f,x
try to invoke Unicode conversion or not?
StringIO should be considered a non-Unicode aware stream, so it should not implement .encoding. Instead, PyFile_WriteObject() will simply call __str__ on the Unicode object and thus use the default encoding for conversion (this is what StringIO does currently). If somebody wants to use a StringIO object as Unicode aware stream, the tools in codecs.py can be used for this (basically by doing the same kind of wrapping as codecs.open() does).
If it should, how should it find out that this is safe to do?
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
"M.-A. Lemburg"
StringIO should be considered a non-Unicode aware stream, so it should not implement .encoding. Instead, PyFile_WriteObject() will simply call __str__ on the Unicode object and thus use the default encoding for conversion (this is what StringIO does currently).
This is not what StringIO does currently:
s=StringIO.StringIO() print >>s,u"Hallo" s.getvalue() u'Hallo\n'
print special-cases Unicode objects and passes them to the stream. So printing Unicode objects on a StringIO builds up a Unicode value.
If somebody wants to use a StringIO object as Unicode aware stream
StringIO *is* Unicode-aware. Regards, Martin
StringIO *is* Unicode-aware.
Though it acts somewhat as if its default encoding is "ascii". This is somewhat inconsistent: you can write arbitrary Unicode strings, but the Unicode won't be converted to ASCII. ASCII is converted to Unicode though. And of course cStringIO doesn't support Unicode at all. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum
Though it acts somewhat as if its default encoding is "ascii". This is somewhat inconsistent: you can write arbitrary Unicode strings, but the Unicode won't be converted to ASCII. ASCII is converted to Unicode though.
It is the only case of a "pure Unicode" stream in Python, where the underlying "native" sequence is not one of bytes, but one of Unicode characters. The real problem is that the "orientation" (wide or narrow strings) is determined by the things written into the stream. It might be more reasonable to have StringIO.ByteIO and StringIO.UnicodeIO constructors, which both accept an encoding= argument, and will convert objects of the wrong "orientation" using that encoding (defaulting to the system encoding). Regards, Martin
Though it acts somewhat as if its default encoding is "ascii". This is somewhat inconsistent: you can write arbitrary Unicode strings, but the Unicode won't be converted to ASCII. ASCII is converted to Unicode though.
It is the only case of a "pure Unicode" stream in Python, where the underlying "native" sequence is not one of bytes, but one of Unicode characters.
The real problem is that the "orientation" (wide or narrow strings) is determined by the things written into the stream.
It might be more reasonable to have StringIO.ByteIO and StringIO.UnicodeIO constructors, which both accept an encoding= argument, and will convert objects of the wrong "orientation" using that encoding (defaulting to the system encoding).
I'm not sure about those names, but I agree that the encoding should be forced when the StringIO instance is created. Given that using Unicode with these is currently fragile at best, maybe we should say that unless you give an encoding argument, it's a byte stream and doesn't allow Unicode at all? That would be consistent with cStringIO. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Though it acts somewhat as if its default encoding is "ascii". This is somewhat inconsistent: you can write arbitrary Unicode strings, but the Unicode won't be converted to ASCII. ASCII is converted to Unicode though.
It is the only case of a "pure Unicode" stream in Python, where the underlying "native" sequence is not one of bytes, but one of Unicode characters.
The real problem is that the "orientation" (wide or narrow strings) is determined by the things written into the stream.
It might be more reasonable to have StringIO.ByteIO and StringIO.UnicodeIO constructors, which both accept an encoding= argument, and will convert objects of the wrong "orientation" using that encoding (defaulting to the system encoding).
I'm not sure about those names, but I agree that the encoding should be forced when the StringIO instance is created. Given that using Unicode with these is currently fragile at best, maybe we should say that unless you give an encoding argument, it's a byte stream and doesn't allow Unicode at all? That would be consistent with cStringIO.
+1 The fact that StringIO works with Unicode (and then only in the case where you *only* pass Unicode to it) is more an implementation detail than a true feature. cStringIO doesn't have this implementation detail, so porting from StringIO to the much faster cStringIO doesn't work at all for Unicode. I think that StringIO and cStringIO should be regarded as binary streams without any encoding knowledge. It is easy enough to wrap these into Unicode aware streams using the codecs.StreamReaderWriter class as is done in codecs.open(). That API already adds the .encoding attribute to the stream object, BTW. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
"M.-A. Lemburg"
The fact that StringIO works with Unicode (and then only in the case where you *only* pass Unicode to it) is more an implementation detail than a true feature.
It's a true feature. You explicitly fixed that feature in revision 1.20 date: 2002/01/06 17:15:05; author: lemburg; state: Exp; lines: +8 -5 Restore Python 2.1 StringIO.py behaviour: support concatenating Unicode string snippets to larger Unicode strings. This fix should also go into Python 2.2.1. after you broke it in revision 1.19 date: 2001/09/24 17:34:52; author: lemburg; state: Exp; lines: +4 -1 branches: 1.19.12; StringIO patch #462596: let's [c]StringIO accept read buffers on input to .write() too.
cStringIO doesn't have this implementation detail, so porting from StringIO to the much faster cStringIO doesn't work at all for Unicode.
Correct. That still doesn't make it an implementation detail.
I think that StringIO and cStringIO should be regarded as binary streams without any encoding knowledge. It is easy enough to wrap these into Unicode aware streams using the codecs.StreamReaderWriter class as is done in codecs.open().
Then why did you fix that behaviour when you broke it? Regards, Martin
Martin v. Loewis wrote:
"M.-A. Lemburg"
writes: The fact that StringIO works with Unicode (and then only in the case where you *only* pass Unicode to it) is more an implementation detail than a true feature.
It's a true feature. You explicitly fixed that feature in
revision 1.20 date: 2002/01/06 17:15:05; author: lemburg; state: Exp; lines: +8 -5 Restore Python 2.1 StringIO.py behaviour: support concatenating Unicode string snippets to larger Unicode strings.
This fix should also go into Python 2.2.1.
after you broke it in
revision 1.19 date: 2001/09/24 17:34:52; author: lemburg; state: Exp; lines: +4 -1 branches: 1.19.12; StringIO patch #462596: let's [c]StringIO accept read buffers on input to .write() too.
I doubt that it's a true feature. The fact that I broke it in the above patch by introducing the str(data) call in StringIO.py suggests that whoever complained about this change was using an implementation detail rather than a documented and originally intended feature of StringIO. If you need something like StringIO for Unicode then I would suggest to create a similar object which then only deals with Unicode, e.g. UnicodeIO. cStringIO could then be extended to also support such an object by using the same trick as SRE does to support two native types (putting the code into a .h file and then including it twice). Back to the original question. I don't have a problem with leaving in the Unicode support in StringIO's .write() method, but the introduction of the Unicode print support should not rely on this detail. Instead someone wanting to write Unicode only to a StringIO like object should be directed to UnicodeIO. Now, to satisfy the request of the poster who wanted support for __unicode__ in PyFile_WriteObject() we need to add something which lets PyFile_WriteObject() determine wether to look for __unicode__ or not (per default, it passes through Unicode objects as-is and applies str() to all other objects). I like the idea of using the .encoding attribute as flag for this. What I don't like is that setting it to None should be used for Unicode-only streams (ones that take Unicode on input and use Unicode on output). To me, .encoding = None would signal: this stream doesn't do anything to the input data and passes it to the output stream as-is. Much better, IMHO, would be to use .encoding = 'unicode' on Unicode-only streams such as the mentioned UnicodeIO object. In summary, StringIO objects should not implement .encoding while a new Unicode-only stream-like object UnicodeIO should have .encoding = 'unicode'. The same could then be done with the corresponding cStringIO objects. PS: Some may not know, but the obvious way of fixing printing of Unicode by adding a tp_print slot implementation does not work, since that slot takes a FILE* pointer as file "object" which, of course, cannot include any additional information such as the encoding. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Martin v. Loewis wrote:
"M.-A. Lemburg"
writes: The fact that StringIO works with Unicode (and then only in the case where you *only* pass Unicode to it) is more an implementation detail than a true feature.
It's a true feature. You explicitly fixed that feature in
revision 1.20 date: 2002/01/06 17:15:05; author: lemburg; state: Exp; lines: +8 -5 Restore Python 2.1 StringIO.py behaviour: support concatenating Unicode string snippets to larger Unicode strings.
This fix should also go into Python 2.2.1.
after you broke it in
revision 1.19 date: 2001/09/24 17:34:52; author: lemburg; state: Exp; lines: +4 -1 branches: 1.19.12; StringIO patch #462596: let's [c]StringIO accept read buffers on input to .write() too.
I doubt that it's a true feature. The fact that I broke it in the above patch by introducing the str(data) call in StringIO.py suggests that whoever complained about this change was using an implementation detail rather than a documented and originally intended feature of StringIO.
If you need something like StringIO for Unicode then I would suggest to create a similar object which then only deals with Unicode, e.g. UnicodeIO.
But since StringIO already works for Unicode, why bother?
cStringIO could then be extended to also support such an object by using the same trick as SRE does to support two native types (putting the code into a .h file and then including it twice).
(Off-topic: each time I fix a bug twice, once in stringobject.c and once in unicodeobject.c, I wish we'd done that for string and unicode objects. But it's too late now, and also may not be realistic given some different implementation choices.)
Back to the original question. I don't have a problem with leaving in the Unicode support in StringIO's .write() method, but the introduction of the Unicode print support should not rely on this detail.
Agreed.
Instead someone wanting to write Unicode only to a StringIO like object should be directed to UnicodeIO.
Now, to satisfy the request of the poster who wanted support for __unicode__ in PyFile_WriteObject() we need to add something which lets PyFile_WriteObject() determine wether to look for __unicode__ or not (per default, it passes through Unicode objects as-is and applies str() to all other objects).
I like the idea of using the .encoding attribute as flag for this. What I don't like is that setting it to None should be used for Unicode-only streams (ones that take Unicode on input and use Unicode on output). To me, .encoding = None would signal: this stream doesn't do anything to the input data and passes it to the output stream as-is.
But I'm not sure that's a useful feature. Maybe encoding=None could mean the current StringIO behavior. <0.5 wink>
Much better, IMHO, would be to use .encoding = 'unicode' on Unicode-only streams such as the mentioned UnicodeIO object.
Yes. (Except 'unicode' is not an encoding name, right? Maybe it should be?)
In summary, StringIO objects should not implement .encoding while a new Unicode-only stream-like object UnicodeIO should have .encoding = 'unicode'.
The same could then be done with the corresponding cStringIO objects.
PS: Some may not know, but the obvious way of fixing printing of Unicode by adding a tp_print slot implementation does not work, since that slot takes a FILE* pointer as file "object" which, of course, cannot include any additional information such as the encoding.
Yes, tp_print is only an optimization for tp_repr and tp_str when writing to a "real" file object. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum
I'm not sure about those names, but I agree that the encoding should be forced when the StringIO instance is created. Given that using Unicode with these is currently fragile at best, maybe we should say that unless you give an encoding argument, it's a byte stream and doesn't allow Unicode at all? That would be consistent with cStringIO.
But it would break compatibility, atleast with xml.dom.minidom.Node.write, which support StringIO currently, and will collect Unicode strings in it. Regards, Martin
I'm not sure about those names, but I agree that the encoding should be forced when the StringIO instance is created. Given that using Unicode with these is currently fragile at best, maybe we should say that unless you give an encoding argument, it's a byte stream and doesn't allow Unicode at all? That would be consistent with cStringIO.
But it would break compatibility, atleast with xml.dom.minidom.Node.write, which support StringIO currently, and will collect Unicode strings in it.
Would it be acceptable if StringIO required you to be consistent, i.e. write only Unicode *or* only 8-bit strings, and never mix them? That would be some kind of magical behavior; the encoding attribute should be set to reflect the mode after the first write, and should be None initially (or some other way to indicate the magic). --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum
But it would break compatibility, atleast with xml.dom.minidom.Node.write, which support StringIO currently, and will collect Unicode strings in it.
Would it be acceptable if StringIO required you to be consistent, i.e. write only Unicode *or* only 8-bit strings, and never mix them?
OK, I give up. Let's just keep StringIO exactly as it was. The current behavior is relied upon too much to be able to change it. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (3)
-
Guido van Rossum
-
M.-A. Lemburg
-
martin@v.loewis.de