[docs] [issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

Thu May 12 06:14:07 EDT 2016

Ben Spiller added the comment:

Thanks that's really helpful

Having thought about it some more, I think if possible it'd be really so much better to actually 'fix' the behaviour for the unicode<->str standard codecs (i.e. not base64) rather than just documenting around it. The current behaviour is not only confusing but leads to bugs that are very easy to miss since the methods work correctly when given 7-bit ascii characters. 

I had a poke around in the python source but couldn't quite identify where it's happening - presumably there is somewhere in the str.encode('utf-8') implementation that first "decodes" the string and does so using the ascii codec. If it could be made to use the same encoding that was passed in (e.g. utf8) then this would end up being a no-op and there would be no unpleasant bugs that only appear when the input includes non-ascii characters. 

It would also allow X.encode('utf-8') to be called successfully whether X is already a str or is a unicode object, which would save callers having to explicitly check what kind of string they've been passed. 

Is anyone able to look into the code to see where this would need to be fixed and how difficult it would be to do? I have a feeling that once the line is located it might be quite a straightforward fix

Many thanks

----------
components: +Interpreter Core -Documentation
title: doc for unicode.decode and str.encode is unnecessarily confusing -> unicode.decode and str.encode are unnecessarily confusing for non-ascii

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue26369>
_______________________________________