why doesn't print pass unicode strings on to the file object?
is there any reason why "print" cannot pass unicode strings on to the underlying write method?
Mostly because there is no guarantee that every .write method will support Unicode objects. I see two options: either a stream might declare itself as supporting unicode on output (by, say, providing a unicode attribute), or all streams are required by BDFL pronouncement to accept Unicode objects. BTW, your wrapper example can be rewritten as import sys,codecs sys.stdout = codecs.lookup("iso-8859-1")[3](sys.stdout) I wish codecs.lookup returned a record with named fields, instead of a list, so I could write sys.stdout = codecs.lookup("iso-8859-1").writer(sys.stdout) (the other field names would be encode,decode, and reader). Regards, Martin
Martin von Loewis wrote:
is there any reason why "print" cannot pass unicode strings on to the underlying write method?
Mostly because there is no guarantee that every .write method will support Unicode objects. I see two options: either a stream might declare itself as supporting unicode on output (by, say, providing a unicode attribute), or all streams are required by BDFL pronouncement to accept Unicode objects.
I think the latter option would go a long way: many file-like objects are written in C and will use the C parser markers. These can handle Unicode without problem (issuing an exception in case the conversion to ASCII fails). The only notable exception is the cStringIO module -- but this could probably be changed to be buffer interface compliant too.
BTW, your wrapper example can be rewritten as
import sys,codecs sys.stdout = codecs.lookup("iso-8859-1")[3](sys.stdout)
I wish codecs.lookup returned a record with named fields, instead of a list, so I could write
sys.stdout = codecs.lookup("iso-8859-1").writer(sys.stdout)
(the other field names would be encode,decode, and reader).
Why don't you write a small helper function for the codecs.py module ?! E.g. codecs.info("iso-8859-1") could provide an alternative interface which returns a CodecInfo instance with attributes instead of tuple entries. Note that the tuple interface was chosen for sake of speed and better handling at C level (tuples can be cached and are easily parseable in C). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
Martin von Loewis wrote:
is there any reason why "print" cannot pass unicode strings on to the underlying write method?
Mostly because there is no guarantee that every .write method will support Unicode objects. I see two options: either a stream might declare itself as supporting unicode on output (by, say, providing a unicode attribute), or all streams are required by BDFL pronouncement to accept Unicode objects.
I think the latter option would go a long way: many file-like objects are written in C and will use the C parser markers. These can handle Unicode without problem (issuing an exception in case the conversion to ASCII fails).
Agreed, but BDFL pronouncement doesn't make it so: individual modules still have to be modified if they don't do the right thing (especially 3rd party modules -- we have no control there). And then, what's the point of handling Unicode if we only accept Unicode-encoded ASCII strings?
The only notable exception is the cStringIO module -- but this could probably be changed to be buffer interface compliant too.
Sure, just submit a patch. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Martin von Loewis wrote:
is there any reason why "print" cannot pass unicode strings on to the underlying write method?
Mostly because there is no guarantee that every .write method will support Unicode objects. I see two options: either a stream might declare itself as supporting unicode on output (by, say, providing a unicode attribute), or all streams are required by BDFL pronouncement to accept Unicode objects.
I think the latter option would go a long way: many file-like objects are written in C and will use the C parser markers. These can handle Unicode without problem (issuing an exception in case the conversion to ASCII fails).
Agreed, but BDFL pronouncement doesn't make it so: individual modules still have to be modified if they don't do the right thing (especially 3rd party modules -- we have no control there).
True, but we are only talking about file objects which are used for sys.stdout -- I don't think that allowing Unicode to be passed to their .write() methods will break a whole lot of code.
And then, what's the point of handling Unicode if we only accept Unicode-encoded ASCII strings?
I was under the impression that Fredrik wants to let Unicode pass through from the print statement to the .write method of sys.stdout. If the sys.stdout object knows about Unicode then things will work just fine; if not, the internal Python machinery will either try to convert it to an ASCII string (e.g. if the file object uses "s#") or the file object will raise a TypeError (this is what cStringIO does). Currently, Python forces conversion to 8-bit strings for all printed objects (at least this is what it did last time I looked into this problem a long while ago).
The only notable exception is the cStringIO module -- but this could probably be changed to be buffer interface compliant too.
Sure, just submit a patch.
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
Mostly because there is no guarantee that every .write method will support Unicode objects. I see two options: either a stream might declare itself as supporting unicode on output (by, say, providing a unicode attribute), or all streams are required by BDFL pronouncement to accept Unicode objects.
I think the latter option would go a long way: many file-like objects are written in C and will use the C parser markers. These can handle Unicode without problem (issuing an exception in case the conversion to ASCII fails).
Agreed, but BDFL pronouncement doesn't make it so: individual modules still have to be modified if they don't do the right thing (especially 3rd party modules -- we have no control there).
And then, what's the point of handling Unicode if we only accept Unicode-encoded ASCII strings?
By accepting Unicode, I would specifically require that they, at least: - do not crash the interpreter when being passed Unicode objects - attempt to perform some conversion if they do not support Unicode directly; if they don't know any specific conversion, the default conversion should be used (i.e. that they don't give a TypeError). With these assumptions, it is possible to allow print to pass Unicode objects to the file's write method, instead of converting Unicode itself. This, in turn, enables users to replace sys.stdout with something that supports a different encoding. Of course, you still may get Unicode errors, since some streams may not support all Unicode characters (e.g. since the terminal does not support them). Regards, Martin
Martin von Loewis wrote:
Mostly because there is no guarantee that every .write method will support Unicode objects. I see two options: either a stream might declare itself as supporting unicode on output (by, say, providing a unicode attribute), or all streams are required by BDFL pronouncement to accept Unicode objects.
I think the latter option would go a long way: many file-like objects are written in C and will use the C parser markers. These can handle Unicode without problem (issuing an exception in case the conversion to ASCII fails).
Agreed, but BDFL pronouncement doesn't make it so: individual modules still have to be modified if they don't do the right thing (especially 3rd party modules -- we have no control there).
And then, what's the point of handling Unicode if we only accept Unicode-encoded ASCII strings?
By accepting Unicode, I would specifically require that they, at least: - do not crash the interpreter when being passed Unicode objects
I don't see how this could happen. At worst users will see a TypeError or UnicodeError when passing a Unicode object to print with some sys.stdout hook in place which doesn't know about Unicode objects.
- attempt to perform some conversion if they do not support Unicode directly; if they don't know any specific conversion, the default conversion should be used (i.e. that they don't give a TypeError).
That's what happens if the hook uses "s#" or "t#". Otherwise they'll raise a TypeError.
With these assumptions, it is possible to allow print to pass Unicode objects to the file's write method, instead of converting Unicode itself. This, in turn, enables users to replace sys.stdout with something that supports a different encoding.
Of course, you still may get Unicode errors, since some streams may not support all Unicode characters (e.g. since the terminal does not support them).
Right. Now who will write the patch ? :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
And then, what's the point of handling Unicode if we only accept Unicode-encoded ASCII strings?
By accepting Unicode, I would specifically require that they, at least: - do not crash the interpreter when being passed Unicode objects - attempt to perform some conversion if they do not support Unicode directly; if they don't know any specific conversion, the default conversion should be used (i.e. that they don't give a TypeError).
With these assumptions, it is possible to allow print to pass Unicode objects to the file's write method, instead of converting Unicode itself. This, in turn, enables users to replace sys.stdout with something that supports a different encoding.
Of course, you still may get Unicode errors, since some streams may not support all Unicode characters (e.g. since the terminal does not support them).
OK. That's very reasonable. What do we need to change to make this happen? --Guido van Rossum (home page: http://www.python.org/~guido/)
OK. That's very reasonable.
What do we need to change to make this happen?
Applying patch #462849 may be sufficient for the moment. If there are any file-like objects that are used in print but fail to convert Unicode objects, this hopefully will be found until the final release. Regards, Martin
Guido van Rossum wrote:
The only notable exception is the cStringIO module -- but this could probably be changed to be buffer interface compliant too.
Sure, just submit a patch.
Done. See #462596. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
I wish codecs.lookup returned a record with named fields, instead of a list, so I could write
sys.stdout = codecs.lookup("iso-8859-1").writer(sys.stdout)
(the other field names would be encode,decode, and reader).
Why don't you write a small helper function for the codecs.py module ?!
Because I'd like to avoid an inflation of functions. Instead, I'd prefer codecs.lookup to return an object that has the needed fields, but behaves like a tuple for backwards compatibility.
Note that the tuple interface was chosen for sake of speed and better handling at C level (tuples can be cached and are easily parseable in C).
It may be that an inherited tuple class might achieve the same effect. Can you identify the places where codecs.lookup is assumed to return tuples? Regards, Martin
Martin von Loewis wrote:
I wish codecs.lookup returned a record with named fields, instead of a list, so I could write
sys.stdout = codecs.lookup("iso-8859-1").writer(sys.stdout)
(the other field names would be encode,decode, and reader).
Why don't you write a small helper function for the codecs.py module ?!
Because I'd like to avoid an inflation of functions. Instead, I'd prefer codecs.lookup to return an object that has the needed fields, but behaves like a tuple for backwards compatibility.
That won't be possible without breaking user code since it is well documented that codecs.lookup() returns a tuple. BTW, I don't think that adding a class CodecInfo which takes the encoding name as constructor argument would introduce much inflation of functions here. You will have to provide such a class anyway to achieve what you are looking for, so I guess this is the way to go.
Note that the tuple interface was chosen for sake of speed and better handling at C level (tuples can be cached and are easily parseable in C).
It may be that an inherited tuple class might achieve the same effect. Can you identify the places where codecs.lookup is assumed to return tuples?
I'd rather not make the interface more complicated. The C side certainly cannot be changed for the reasons given above and Python uses could choose your new CodecInfo class to get access to a nicer interface. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
That won't be possible without breaking user code since it is well documented that codecs.lookup() returns a tuple.
Suppose codecs.lookup would return an instance of _fields = {'encode':0,'decode':1,'reader':2,'writer':3} class CodecInfo(tuple): __dynamic__ = 0 def __getattr__(self, name): try: return self[_fields[name]] except KeyError: raise AttributeError, name What user code exactly would break? Would that be a serious problem?
BTW, I don't think that adding a class CodecInfo which takes the encoding name as constructor argument would introduce much inflation of functions here. You will have to provide such a class anyway to achieve what you are looking for, so I guess this is the way to go.
I guess not. If the codec.lookup return value is changed, then I can write encoder = codecs.lookup("latin-1").encode Without that, I have to write encoder = codecs.CodecInfo(codecs.lookup("latin-1")).encode This is overly complicated.
It may be that an inherited tuple class might achieve the same effect. Can you identify the places where codecs.lookup is assumed to return tuples?
I'd rather not make the interface more complicated. The C side certainly cannot be changed for the reasons given above and Python uses could choose your new CodecInfo class to get access to a nicer interface.
What code exactly would have to change if I wanted lookup to return a CodecInfo object? Regards, Martin
martin wrote:
I guess not. If the codec.lookup return value is changed, then I can write
encoder = codecs.lookup("latin-1").encode
Without that, I have to write
encoder = codecs.CodecInfo(codecs.lookup("latin-1")).encode
or you could make things more readable, and add explicit get-functions for each property: encoder = codecs.getencoder("latin-1") much easier to understand, especially for casual users (cf. os.path.getsize etc) </F>
Fredrik Lundh wrote:
martin wrote:
I guess not. If the codec.lookup return value is changed, then I can write
encoder = codecs.lookup("latin-1").encode
Without that, I have to write
encoder = codecs.CodecInfo(codecs.lookup("latin-1")).encode
or you could make things more readable, and add explicit get-functions for each property:
encoder = codecs.getencoder("latin-1")
much easier to understand, especially for casual users (cf. os.path.getsize etc)
+1 Funny, the C API provides APIs for this already ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
_fields = {'encode':0,'decode':1,'reader':2,'writer':3} class CodecInfo(tuple): __dynamic__ = 0 def __getattr__(self, name): try: return self[_fields[name]] except KeyError: raise AttributeError, name
You want to change that raise statement into return tuple.__getattr__(self, name) Remember, new-style __getattr__ *replaces* the built-in getattr operation; it doesn't only get invoked when the built-in getattr doesn't find the attribute, like it does for classic classes. (I'm still considering whether this is too backwards incompatible; we could have two getattr hooks, one old-style and one new-style.) --Guido van Rossum (home page: http://www.python.org/~guido/)
_fields = {'encode':0,'decode':1,'reader':2,'writer':3} class CodecInfo(tuple): __dynamic__ = 0 def __getattr__(self, name): try: return self[_fields[name]] except KeyError: raise AttributeError, name
You want to change that raise statement into
return tuple.__getattr__(self, name)
Remember, new-style __getattr__ *replaces* the built-in getattr operation
Originally, I had _fields as a class attribute, and was using self._fields inside getattr, which caused a StackOverflow. I could not figure out why it wouldn't just find _fields in the class before invoking __getattr__...
(I'm still considering whether this is too backwards incompatible; we could have two getattr hooks, one old-style and one new-style.)
So far, it is not really incompatible, since it only applies to new-style classes. It is confusing to long-time users, so it deserves documentation. Regards, Martin
(Changing subject)
_fields = {'encode':0,'decode':1,'reader':2,'writer':3} class CodecInfo(tuple): __dynamic__ = 0 def __getattr__(self, name): try: return self[_fields[name]] except KeyError: raise AttributeError, name
You want to change that raise statement into
return tuple.__getattr__(self, name)
Remember, new-style __getattr__ *replaces* the built-in getattr operation
Originally, I had _fields as a class attribute, and was using self._fields inside getattr, which caused a StackOverflow. I could not figure out why it wouldn't just find _fields in the class before invoking __getattr__...
(I'm still considering whether this is too backwards incompatible; we could have two getattr hooks, one old-style and one new-style.)
So far, it is not really incompatible, since it only applies to new-style classes. It is confusing to long-time users, so it deserves documentation.
It seems to be the number one point of confusion. It's also one of the few places where you have to change your code when porting a class to the new style -- which is otherwise very simple, either place a ``__metaclass__ = type'' line in your module or make all your base classes derive from object. --Guido van Rossum (home page: http://www.python.org/~guido/)
Martin von Loewis wrote:
That won't be possible without breaking user code since it is well documented that codecs.lookup() returns a tuple.
Suppose codecs.lookup would return an instance of
_fields = {'encode':0,'decode':1,'reader':2,'writer':3} class CodecInfo(tuple): __dynamic__ = 0 def __getattr__(self, name): try: return self[_fields[name]] except KeyError: raise AttributeError, name
What user code exactly would break? Would that be a serious problem?
All code which assumes a tuple as return value. It's hard to say how much code makes such an assumption. Most Python code probably only uses the sequence interface, but the C interface was deliberately designed to return tuples so that C programmers can easily access the data.
BTW, I don't think that adding a class CodecInfo which takes the encoding name as constructor argument would introduce much inflation of functions here. You will have to provide such a class anyway to achieve what you are looking for, so I guess this is the way to go.
I guess not. If the codec.lookup return value is changed, then I can write
encoder = codecs.lookup("latin-1").encode
Without that, I have to write
encoder = codecs.CodecInfo(codecs.lookup("latin-1")).encode
This is overly complicated.
No... codecs.CodecInfo("latin-1").encode The __init__ contructor can to the call to lookup() and apply the needed initialization of the attributes.
It may be that an inherited tuple class might achieve the same effect. Can you identify the places where codecs.lookup is assumed to return tuples?
I'd rather not make the interface more complicated. The C side certainly cannot be changed for the reasons given above and Python uses could choose your new CodecInfo class to get access to a nicer interface.
What code exactly would have to change if I wanted lookup to return a CodecInfo object?
I don't see a need to argue over this. It's no use putting a lot of work into inventing some overly complex (subclassing types, etc.) strategy to maintain backwards compatibility when an easy solution is so close at hand. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
What user code exactly would break? Would that be a serious problem?
All code which assumes a tuple as return value. It's hard to say how much code makes such an assumption. Most Python code probably only uses the sequence interface, but the C interface was deliberately designed to return tuples so that C programmers can easily access the data.
I believe that code would continue to work if you got a instance of a tuple subtype.
I don't see a need to argue over this. It's no use putting a lot of work into inventing some overly complex (subclassing types, etc.) strategy to maintain backwards compatibility when an easy solution is so close at hand.
Subclassing tuples is not at all overly complex. Regards, Martin
Martin von Loewis wrote:
What user code exactly would break? Would that be a serious problem?
All code which assumes a tuple as return value. It's hard to say how much code makes such an assumption. Most Python code probably only uses the sequence interface, but the C interface was deliberately designed to return tuples so that C programmers can easily access the data.
I believe that code would continue to work if you got a instance of a tuple subtype.
It might work at Python level, maybe even at C level, but I really don't see the point in trying to hack up a new type just for this purpose. Here's an implementation which pretty much solves the "problem": -- ### Helpers for codec lookup def getencoder(encoding): """ Lookup up the codec for the given encoding and return its encoder function. Raises a LookupError in case the encoding cannot be found. """ return lookup(encoding)[0] def getdecoder(encoding): """ Lookup up the codec for the given encoding and return its decoder function. Raises a LookupError in case the encoding cannot be found. """ return lookup(encoding)[1] def getreader(encoding): """ Lookup up the codec for the given encoding and return its StreamReader class or factory function. Raises a LookupError in case the encoding cannot be found. """ return lookup(encoding)[2] def getwriter(encoding): """ Lookup up the codec for the given encoding and return its StreamWriter class or factory function. Raises a LookupError in case the encoding cannot be found. """ return lookup(encoding)[3] -- If noone objects, I'll check these into CVS along with some docs for libcodecs.tex. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
Martin von Loewis wrote:
If noone objects, I'll check these into CVS along with some docs for libcodecs.tex.
Sounds good to me.
Great. I just checked them in... -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
Because I'd like to avoid an inflation of functions. Instead, I'd prefer codecs.lookup to return an object that has the needed fields, but behaves like a tuple for backwards compatibility.
Here's how you can do that in 2.2a3 (I wrote this incomplete example for another situation :-): class Stat(tuple): def __new__(cls, t): assert len(t) == 9 self = tuple.__new__(cls, t[:7]) self.st_seven = t[7] self.st_eight = t[8] return self st_zero = property(lambda x: x[0]) st_one = property(lambda x: x[1]) # etc. t = (0,1,2,3,4,5,6,7,8) s = Stat(t) a,b,c,d,e,f,g = s assert (a, b, c, d, e, f, g) == t[:7] assert t == s + (s.st_seven, s.st_eight)
Note that the tuple interface was chosen for sake of speed and better handling at C level (tuples can be cached and are easily parseable in C).
Alas, a tuple subclass loses some of the speed and size advantage -- the additional instance variables require allocation of a dictionary. (And no, you cannot use __slots__ here -- the slot access mechanism doesn't jive well with the variable length tuple structure. If we were to subclass a list, we could add __slots__ = ["st_seven", "st_eight"] to the class. But that's not fully tuple-compatible.) --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (5)
-
Fredrik Lundh
-
Fredrik Lundh
-
Guido van Rossum
-
M.-A. Lemburg
-
Martin von Loewis