[Python-Dev] bytes / unicode

Tue Jun 22 22:46:45 CEST 2010

On Tue, Jun 22, 2010 at 1:07 PM, James Y Knight <foom at fuhm.net> wrote:

> The surrogateescape method is a nice workaround for this, but I can't help
> thinking that it might've been better to just treat stuff as
> possibly-invalid-but-probably-utf8 byte-strings from input, through
> processing, to output. It seems kinda too late for that, though: next time
> someone designs a language, they can try that. :)
>

surrogateescape does help a lot, my only problem with it is that it's
out-of-band information.  That is, if you have data that went through
data.decode('utf8', 'surrogateescape') you can restore it to bytes or
transcode it to another encoding, but you have to know that it was decoded
specifically that way.  And of course if you did have to transcode it (e.g.,
text.encode('utf8', 'surrogateescape').decode('latin1')) then if you had
actually handled the text in any way you may have broken it; you don't
*really* have valid text.  A lazier solution feels like it would be easier
and more transparent to work with.

But... I also don't see any major language constraint to having another kind
of string that is bytes+encoding.  I think PJE brought up a problem with a
couple coercion aspects.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100622/094f8723/attachment.html>