[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Tue Sep 16 05:51:23 CEST 2014

On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Jim J. Jewett writes:
>
> > In terms of best-effort, it is reasonable to treat the smuggled bytes
> > as representing a character outside of your unicode repertoire
>
> I have to disagree. If you ever end up passing them to something that
> validates or tries to reencode them without surrogateescape, BOOM!
> These things are the text equivalent of IEEE NaNs.  If all you know
> (as in the stdlib) is that you have "generic text", the only fairly
> safe things to do with them are (1) delete them, (2) substitute an
> appropriate replacement character for them, (3) pass the text
> containing them verbatim to other code, and (4) reencode them using
> the same codec they were read with.

Don't forget, these are *errors*. These are bytes that cannot be
correctly decoded. That's not something that has any meaning
whatsoever in text; so by definition, the only things you can do are
the four you list there (as long as "codec" means both the choice of
encoding and the use of the surrogateescape flag). It's like dealing
with control characters when you need to print something visually,
except that they have an official solution [1] and surrogateescape is
unofficial. They're not real text, so you have to deal with them
somehow.

The bytes might each represent one character. Several of them together
might represent a single character. Or maybe they don't mean anything
at all, and they're just part of a chunked data format... like I was
finding in the .cwk files that I was reading this weekend (it's mostly
MacRoman encoding, but the text is divided into chunks separated by
\0\0 and two more bytes - turns out the bytes are chunk lengths, so
they don't mean any sort of characters at all). You can't know.

ChrisA

[1] http://www.unicode.org/charts/PDF/U2400.pdf