[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Tue Sep 16 17:00:32 CEST 2014

On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico <rosuav at gmail.com> wrote:
> On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> > Jim J. Jewett writes:
> >
> > > In terms of best-effort, it is reasonable to treat the smuggled bytes
> > > as representing a character outside of your unicode repertoire
> >
> > I have to disagree. If you ever end up passing them to something that
> > validates or tries to reencode them without surrogateescape, BOOM!
> > These things are the text equivalent of IEEE NaNs.  If all you know
> > (as in the stdlib) is that you have "generic text", the only fairly
> > safe things to do with them are (1) delete them, (2) substitute an
> > appropriate replacement character for them, (3) pass the text
> > containing them verbatim to other code, and (4) reencode them using
> > the same codec they were read with.
> 
> Don't forget, these are *errors*. These are bytes that cannot be
> correctly decoded. That's not something that has any meaning
> whatsoever in text; so by definition, the only things you can do are
> the four you list there (as long as "codec" means both the choice of
> encoding and the use of the surrogateescape flag). It's like dealing
> with control characters when you need to print something visually,
> except that they have an official solution [1] and surrogateescape is
> unofficial. They're not real text, so you have to deal with them
> somehow.

That isn't the case in the email package.  The smuggled bytes are not
errors[*], they are literally smuggled bytes.  But, as Stephen said, the
only things email does with them are the last three of the four he
listed (if you read (3) as passing it between parts of the email
package): the data comes in as text mixed with binary, and the email
package parses it until it knows what the binary is supposed to be,
turns it back into bytes, and decodes it properly.  The goal is to never
let the smuggled bytes escape out the email APIs as surrogateescape
encoded text; though, in practice, this being consenting-adults Python
and code not being bug free, there are places where people have used the
knowledge of how surrogateescape is used by email to work around both
API and code bugs.

--David

[*] Some of the encoded bytes *are* errors (non-ascii in headers or
undecodable bytes in whatever the CTE/charset is), and in that case
email may just turn them back into error bytes in the output, but only
*some* of the smuggled bytes are actually errors (and none are if the
message is RFC compliant).