Allowing u.encode() to return non-strings

As you may know, the method u"abc".encode(encoding) currently guarantees that the return value will always be an 8-bit string value. Now that more and more codecs become available and the scope of those codecs goes far beyond only encoding from Unicode to strings and back, I am tempted to open up that restriction, thereby opening up u.encode() for applications that wish to use other codecs that return e.g. Unicode objects as well. There are several applications for this, such as character escaping, remapping characters (much like you would use string.translate() on 8-bit strings), compression, etc. etc. Note that codecs are not restricted in what they can return for their .encode() or .decode() method, so any object type is acceptable, including subclasses of str or unicode, buffers, mmapped files, etc. The needed code change is a one-liner. What do you think ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 16 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On Wed, 2004-06-16 at 05:55, M.-A. Lemburg wrote:
Now that more and more codecs become available and the scope of those codecs goes far beyond only encoding from Unicode to strings and back, I am tempted to open up that restriction, thereby opening up u.encode() for applications that wish to use other codecs that return e.g. Unicode objects as well.
+1 -Barry

M.-A. Lemburg wrote:
Now that more and more codecs become available and the scope of those codecs goes far beyond only encoding from Unicode to strings and back, I am tempted to open up that restriction, thereby opening up u.encode() for applications that wish to use other codecs that return e.g. Unicode objects as well. [...] Note that codecs are not restricted in what they can return for their .encode() or .decode() method, so any object type is acceptable, including subclasses of str or unicode, buffers, mmapped files, etc.
+1. I find it surprising that the restriction exists. I would have thought u.encode('foo') would pretty transparently wrap the foo codec's .encode(). This is also a good reminder that type checking of the result of codec or unicode .encode() calls is prudent, anytime.

M.-A. Lemburg wrote:
Now that more and more codecs become available and the scope of those codecs goes far beyond only encoding from Unicode to strings and back, I am tempted to open up that restriction, thereby opening up u.encode() for applications that wish to use other codecs that return e.g. Unicode objects as well. [...] Note that codecs are not restricted in what they can return for their .encode() or .decode() method, so any object type is acceptable, including subclasses of str or unicode, buffers, mmapped files, etc.
+1. I find it surprising that the restriction exists. I would have thought u.encode('foo') would pretty transparently wrap the foo codec's .encode().
This is also a good reminder that type checking of the result of codec or unicode .encode() calls is prudent, anytime.
May I make one tiny objection? I don't know if it's enough to stop this (I value it at -0.5 at most), but this will make reasoning about types harder. Given that approaches like StarKiller and IronPython are likely the best way to get near-C speed for Python, I'd like the standard library at least to make life eacy for their approach. The issue is that currently the type inferencer can know that the return type of u.encode(s) is 'unicode', assuming u's type is 'unicode'. But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'. (I've never liked functions whose return type depends on the value of an argument -- I guess my intuition has always anticipated type inferencing. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
M.-A. Lemburg wrote:
Now that more and more codecs become available and the scope of those codecs goes far beyond only encoding from Unicode to strings and back, I am tempted to open up that restriction, thereby opening up u.encode() for applications that wish to use other codecs that return e.g. Unicode objects as well. [...] Note that codecs are not restricted in what they can return for their .encode() or .decode() method, so any object type is acceptable, including subclasses of str or unicode, buffers, mmapped files, etc.
+1. I find it surprising that the restriction exists. I would have thought u.encode('foo') would pretty transparently wrap the foo codec's .encode().
This is also a good reminder that type checking of the result of codec or unicode .encode() calls is prudent, anytime.
May I make one tiny objection? I don't know if it's enough to stop this (I value it at -0.5 at most), but this will make reasoning about types harder. Given that approaches like StarKiller and IronPython are likely the best way to get near-C speed for Python, I'd like the standard library at least to make life eacy for their approach.
The issue is that currently the type inferencer can know that the return type of u.encode(s) is 'unicode', assuming u's type is 'unicode'. But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'.
Ok, how about a compromise: .encode() and .decode() of string and unicode objects may return string or unicode objects only (limiting the set of types to two base types). I think those would cover 90% of all cases. For the remaining cases we could add codecs.encode() and codecs.decode() which then do allow arbitrary return types.
(I've never liked functions whose return type depends on the value of an argument -- I guess my intuition has always anticipated type inferencing. :-)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 17 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Ok, how about a compromise: .encode() and .decode() of string and unicode objects may return string or unicode objects only (limiting the set of types to two base types).
This works for me, especially since I expect type inferencers to collapse the two types (just as they should collapse int and long). --Guido van Rossum (home page: http://www.python.org/~guido/)

On Sat, 2004-06-19 at 11:29, Guido van Rossum wrote:
This works for me, especially since I expect type inferencers to collapse the two types (just as they should collapse int and long).
And it's historical baggage anyway right? IOW, eventually <wink> we're just going to have a single string type, right? -Barry

Guido van Rossum wrote:
Ok, how about a compromise: .encode() and .decode() of string and unicode objects may return string or unicode objects only (limiting the set of types to two base types).
This works for me, especially since I expect type inferencers to collapse the two types (just as they should collapse int and long).
Ok, I'll make the necessary changes next week. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 25 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
IL Workshop @ Net.ObjectDays 2004, Erfurt, Germany 93 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I think those would cover 90% of all cases. For the remaining cases we could add codecs.encode() and codecs.decode() which then do allow arbitrary return types.
Can you give examples for the remaining cases?
A codec might want to return a buffer object, a mmapped file, a home grown object, an array, a PIL Image object, a WAV audio file object, etc. etc. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 24 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
IL Workshop @ Net.ObjectDays 2004, Erfurt, Germany 94 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
Can you give examples for the remaining cases?
A codec might want to return a buffer object, a mmapped file, a home grown object, an array, a PIL Image object, a WAV audio file object, etc. etc.
Which specific encoding would return a mmaped filed? Regards, Martin

On Thu, 17 Jun 2004 08:43:15 -0700, Guido van Rossum <guido@python.org> wrote:
The issue is that currently the type inferencer can know that the return type of u.encode(s) is 'unicode', assuming u's type is 'unicode'. But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'.
Who cares about the type inference <0.2 wink>. It's harder for the reader of the program to understand if encode() returns a different type. Would there be some common property that all encode() return values would share? Can't think of one myself. Jeremy

Jeremy Hylton wrote:
On Thu, 17 Jun 2004 08:43:15 -0700, Guido van Rossum <guido@python.org> wrote:
The issue is that currently the type inferencer can know that the return type of u.encode(s) is 'unicode', assuming u's type is 'unicode'. But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'.
Who cares about the type inference <0.2 wink>. It's harder for the reader of the program to understand if encode() returns a different type. Would there be some common property that all encode() return values would share? Can't think of one myself.
In my reply to Guido's post I mentioned that it would be reasonable to limit the number of types to 2 (basically types.StringTypes and subclasses). We could then add two new helpers codecs.encode() and .decode() to do more general codec work without this type restriction. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 18 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

At 09:59 PM 6/17/04 -0400, Jeremy Hylton wrote:
On Thu, 17 Jun 2004 08:43:15 -0700, Guido van Rossum <guido@python.org> wrote:
The issue is that currently the type inferencer can know that the return type of u.encode(s) is 'unicode', assuming u's type is 'unicode'. But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'.
Who cares about the type inference <0.2 wink>. It's harder for the reader of the program to understand if encode() returns a different type. Would there be some common property that all encode() return values would share? Can't think of one myself.
Indeed. What does this proposal offer that writing 'somefunc(u)' in place of 'u.encode("somecodec")' doesn't? Unicode streams aren't going to work with this, right? And anything else that already uses '.encode()' is going to expect a string. In the former case, you know you have to look at 'somefunc' to know what's returned, but in the latter, you are encouraged to think that it's a string, and tempted to worry about the details of the actual encoding later, even if you don't recognize the codec name. Anyway, it seems to me that things returned from u.encode() should either be strings or "stringlike". Maybe implementing the read character buffer interface should suffice? But I don't think this should be opened up to any old objects without some kind of defined invariant that they should satisfy.

Jeremy Hylton wrote:
Who cares about the type inference <0.2 wink>. It's harder for the reader of the program to understand if encode() returns a different type. Would there be some common property that all encode() return values would share? Can't think of one myself.
No. In addition, the stream codec classes become meaningless (StreamReader, StreamWriter), as they are supposed to return a concatenation of encoding results - however, there is no guarantee that the encoding results even can be concatenated. Regards, Martin

On Thursday 2004-06-17 16:43, Guido van Rossum wrote: [MAL proposed that restrictions on the "encode" method should be lifted...]
May I make one tiny objection? I don't know if it's enough to stop this (I value it at -0.5 at most), but this will make reasoning about types harder. Given that approaches like StarKiller and IronPython are likely the best way to get near-C speed for Python, I'd like the standard library at least to make life easy for their approach.
The issue is that currently the type inferencer can know that the return type of u.encode(s) is 'unicode', assuming u's type is 'unicode'.
Um, you don't mean that. u"foo".encode() == "foo", of type str.
But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'.
When looking for near-C speed, type inferencing is most important for a relatively small set of particularly efficiently manipulable types: most notably, smallish integers. Being able to prove that something is a Unicode object just isn't all that useful for efficiency, because most of the things you can do to Unicode objects aren't all that cheap relative to the cost of finding out what they are. Likewise, though perhaps a bit less so, for being able to prove that something is a string. At least, so it seems to me. Maybe I'm wrong. I suppose the extract-one-character operation might be used quite a bit, and that could be cheap. But I can't help feeling that occasions where (1) the compiler can prove that something is a string because it comes from calling an "encode" method, (2) it can't prove that any other way, (3) this makes an appreciable difference to the speed of the code, and (4) there isn't any less-rigorous (Psyco-like, say) way for the type to be discovered and efficient code used, are likely to be pretty rare, and in particular rare enough that supplying some sort of optional type declaration won't be unacceptable to users. (I bet that any version of Python that achieves near-C speed by doing extensive type inference will have optional type declarations.) The above paragraph, of course, presupposes that we keep the restriction on the return value of u.encode(s), and start enforcing it so that the compiler can take advantage.
(I've never liked functions whose return type depends on the value of an argument -- I guess my intuition has always anticipated type inferencing. :-)
def f(x): return x+x has that property, even if you pretend that "+" only works on numbers. -- g

The issue is that currently the type inferencer can know that the return type of u.encode(s) is 'unicode', assuming u's type is 'unicode'.
Um, you don't mean that. u"foo".encode() == "foo", of type str.
Yes, my mistake in haste.
But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'.
When looking for near-C speed, type inferencing is most important for a relatively small set of particularly efficiently manipulable types: most notably, smallish integers.
If type inferencing only worked for *smallish* ints it would be a waste of time. You don't want the program to run 50x faster but compute the wrong result if some intermediate result is larger than 32 bits.
Being able to prove that something is a Unicode object just isn't all that useful for efficiency, because most of the things you can do to Unicode objects aren't all that cheap relative to the cost of finding out what they are. Likewise, though perhaps a bit less so, for being able to prove that something is a string.
Hm, strings are so fundamental as arguments to other things (used as keys etc.) that my intuition tells me that it actually would matter. And there are quite a few fast operations on strings: len(), "iftrue", even slicing: slices with a fixed size are O(1). Also, the type gets propagated to other function calls, so now you have to analyze those with nothing more than 'object' for some argument type.
At least, so it seems to me. Maybe I'm wrong. I suppose the extract-one-character operation might be used quite a bit, and that could be cheap. But I can't help feeling that occasions where (1) the compiler can prove that something is a string because it comes from calling an "encode" method, (2) it can't prove that any other way, (3) this makes an appreciable difference to the speed of the code, and (4) there isn't any less-rigorous (Psyco-like, say) way for the type to be discovered and efficient code used, are likely to be pretty rare, and in particular rare enough that supplying some sort of optional type declaration won't be unacceptable to users. (I bet that any version of Python that achieves near-C speed by doing extensive type inference will have optional type declarations.)
Don't forget all those other uses of type inferencing, e.g. for pointing out latent bugs in programs (pychecker etc.).
The above paragraph, of course, presupposes that we keep the restriction on the return value of u.encode(s), and start enforcing it so that the compiler can take advantage.
(I've never liked functions whose return type depends on the value of an argument -- I guess my intuition has always anticipated type inferencing. :-)
def f(x): return x+x
has that property, even if you pretend that "+" only works on numbers.
No, the type of f depends on the *type* of x (unless x has a type whose '+' operation has a type that depends on the value of x). --Guido van Rossum (home page: http://www.python.org/~guido/)

On Tuesday 2004-06-22 03:37, Guido van Rossum wrote:
But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'.
When looking for near-C speed, type inferencing is most important for a relatively small set of particularly efficiently manipulable types: most notably, smallish integers.
If type inferencing only worked for *smallish* ints it would be a waste of time. You don't want the program to run 50x faster but compute the wrong result if some intermediate result is larger than 32 bits.
Either I'm misunderstanding you, or that's a straw man. I'm not saying type inference is useful if it gives the wrong answer when non-smallish ints occur. I'm saying it's useful if it stops providing major speedups when non-smallish ints occur. Which is what happens in, say, modern Lisp systems when their type inferencing can prove that some important intermediate value is an integer but not that it's small enough to fit in a single word.
Being able to prove that something is a Unicode object just isn't all that useful for efficiency, because most of the things you can do to Unicode objects aren't all that cheap relative to the cost of finding out what they are. Likewise, though perhaps a bit less so, for being able to prove that something is a string.
Hm, strings are so fundamental as arguments to other things (used as keys etc.) that my intuition tells me that it actually would matter.
As a Python user I am required by law to have great respect for your intuition :-), and I would do anyway, so you may be right here. But surely most places where strings are used so very heavily almost always *do* get strings, so their type-checking is just a matter of, um, checking the type (i.e., no dynamic dispatch is needed in the common case), so if you then need to do something non-trivial like a dict lookup the cost of the type check is relatively rather small.
And there are quite a few fast operations on strings: len(), "iftrue", even slicing: slices with a fixed size are O(1).
Yes, though it's O(1) with a rather large constant. (Except maybe for single-character slices.) I'll agree about len and iftrue, though.
At least, so it seems to me. Maybe I'm wrong. I suppose the extract-one-character operation might be used quite a bit, and that could be cheap. But I can't help feeling that occasions where (1) the compiler can prove that something is a string because it comes from calling an "encode" method, (2) it can't prove that any other way, (3) this makes an appreciable difference to the speed of the code, and (4) there isn't any less-rigorous (Psyco-like, say) way for the type to be discovered and efficient code used, are likely to be pretty rare, and in particular rare enough that supplying some sort of optional type declaration won't be unacceptable to users. (I bet that any version of Python that achieves near-C speed by doing extensive type inference will have optional type declarations.)
Don't forget all those other uses of type inferencing, e.g. for pointing out latent bugs in programs (pychecker etc.).
Sure, and I think that's a better argument. If you'd said "We'll probably do heavy type inferencing eventually for speed, and it's really helpful for finding bugs too, so it would be a shame to do anything that interferes with it" then I'd probably just have agreed :-).
(I've never liked functions whose return type depends on the value of an argument -- I guess my intuition has always anticipated type inferencing. :-)
def f(x): return x+x
has that property, even if you pretend that "+" only works on numbers.
No, the type of f depends on the *type* of x (unless x has a type whose '+' operation has a type that depends on the value of x).
Oh, I see. I misunderstood you; sorry about that. How do you feel about the "eval" function? :-) Slightly more seriously and digressing a little, my "f" still has that property if you consider Python's 'int' and 'long' to be different types (which you certainly need to do if you're doing type inference for the sake of speed). It is (or will be) better for most purposes to consider them a single type with two internal representations; I wonder whether sooner or later it will be appropriate to take the same view of string and unicode objects... Probably later rather than sooner, for various reasons. -- g

Guido van Rossum wrote:
M.-A. Lemburg wrote:
Now that more and more codecs become available and the scope of those codecs goes far beyond only encoding from Unicode to strings and back, I am tempted to open up that restriction, thereby opening up u.encode() for applications that wish to use other codecs that return e.g. Unicode objects as well. [...] Note that codecs are not restricted in what they can return for their .encode() or .decode() method, so any object type is acceptable, including subclasses of str or unicode, buffers, mmapped files, etc.
+1. I find it surprising that the restriction exists. I would have thought u.encode('foo') would pretty transparently wrap the foo codec's .encode().
This is also a good reminder that type checking of the result of codec or unicode .encode() calls is prudent, anytime.
May I make one tiny objection? I don't know if it's enough to stop this (I value it at -0.5 at most), but this will make reasoning about types harder. Given that approaches like StarKiller and IronPython are likely the best way to get near-C speed for Python, I'd like the standard library at least to make life eacy for their approach.
The issue is that currently the type inferencer can know that the return type of u.encode(s) is 'unicode', assuming u's type is 'unicode'. But with the proposed change, the return type will depend on the *value* of s, and I don't know how easy it is for the type inferencers to handle that case -- likely, a type inferencer will have to give up and say it returns 'object'.
If you use something like the Cartesian product algorithm (what StarKiller uses) then for different call signatures a new inferred return type is done for a method. But this pretty much only works with Python code since you have full access to the source to do the analysis again. With Unicode stuff being done in C, you would have to just take the lowest common-denominator result, which would be 'object' since you can't reanalyze the execution path for different call signatures unless someone wants to take the pain of type inferring C code. Otherwise this type fo case can be taken into consideration when developing a type inferencing framework that deals with C code, but that just seems painful and overly complicated. -Brett

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
What do you think ?
-1. I find it unfortunate that there are encodings which don't convert between Unicode and byte strings; this direction should not be followed.
Instead, text processing utilities should be proper libraries.
I don't understand... codecs are not limited to only text processing. It's a completely independent framework from the Unicode sub-system. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 24 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
IL Workshop @ Net.ObjectDays 2004, Erfurt, Germany 94 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
-1. I find it unfortunate that there are encodings which don't convert between Unicode and byte strings; this direction should not be followed.
Instead, text processing utilities should be proper libraries.
I don't understand... codecs are not limited to only text processing. It's a completely independent framework from the Unicode sub-system.
I know this is viewed, and perhaps even documented, as a framework independent of Unicode. I think this is a mistake, and it should have been constrained to character encodings (i.e. conversions to and from Unicode, using character tables or similar algorithms) right from the beginning. Regards, Martin

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
-1. I find it unfortunate that there are encodings which don't convert between Unicode and byte strings; this direction should not be followed.
Instead, text processing utilities should be proper libraries.
I don't understand... codecs are not limited to only text processing. It's a completely independent framework from the Unicode sub-system.
I know this is viewed, and perhaps even documented, as a framework independent of Unicode. I think this is a mistake, and it should have been constrained to character encodings (i.e. conversions to and from Unicode, using character tables or similar algorithms) right from the beginning.
Ok, noted. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 25 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
IL Workshop @ Net.ObjectDays 2004, Erfurt, Germany 93 days left ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

While we're talking about this, Martin, what is the encoding of the "string" returned by struct.pack("bbb", 0xFF, 0x00, 0x83) And what should it be? Bill

Bill Janssen wrote:
While we're talking about this, Martin, what is the encoding of the "string" returned by
struct.pack("bbb", 0xFF, 0x00, 0x83)
And what should it be?
It's a byte string, so it doesn't have an encoding. It's MIME type might be 'application/octet-stream'. Regards, Martin
participants (10)
-
"Martin v. Löwis"
-
Barry Warsaw
-
Bill Janssen
-
Brett C.
-
Gareth McCaughan
-
Guido van Rossum
-
Jeremy Hylton
-
M.-A. Lemburg
-
Mike Brown
-
Phillip J. Eby