[Python-Dev] Allowing u.encode() to return non-strings

Mon Jun 21 18:21:09 EDT 2004

Guido van Rossum wrote:

>>M.-A. Lemburg wrote:
>>
>>>Now that more and more codecs become available and the scope
>>>of those codecs goes far beyond only encoding from Unicode to
>>>strings and back, I am tempted to open up that restriction,
>>>thereby opening up u.encode() for applications that wish to
>>>use other codecs that return e.g. Unicode objects as well.
>>>[...]
>>>Note that codecs are not restricted in what they can return
>>>for their .encode() or .decode() method, so any object
>>>type is acceptable, including subclasses of str or
>>>unicode, buffers, mmapped files, etc.
>>
>>+1. I find it surprising that the restriction exists. I would have
>>thought u.encode('foo') would pretty transparently wrap the foo
>>codec's .encode().
>>
>>This is also a good reminder that type checking of the result of
>>codec or unicode .encode() calls is prudent, anytime.
> 
> 
> May I make one tiny objection?  I don't know if it's enough to stop
> this (I value it at -0.5 at most), but this will make reasoning about
> types harder.  Given that approaches like StarKiller and IronPython
> are likely the best way to get near-C speed for Python, I'd like the
> standard library at least to make life eacy for their approach.
> 
> The issue is that currently the type inferencer can know that the
> return type of u.encode(s) is 'unicode', assuming u's type is
> 'unicode'.  But with the proposed change, the return type will depend
> on the *value* of s, and I don't know how easy it is for the type
> inferencers to handle that case -- likely, a type inferencer will have
> to give up and say it returns 'object'.
> 

If you use something like the Cartesian product algorithm (what 
StarKiller uses) then for different call signatures a new inferred 
return type is done for a method.  But this pretty much only works with 
Python code since you have full access to the source to do the analysis 
again.  With Unicode stuff being done in C, you would have to just take 
the lowest common-denominator result, which would be 'object' since you 
can't reanalyze the execution path for different call signatures unless 
someone wants to take the pain of type inferring C code.  Otherwise this 
type fo case can be taken into consideration when developing a type 
inferencing framework that deals with C code, but that just seems 
painful and overly complicated.

-Brett