[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Tue Sep 16 05:34:36 CEST 2014

Jim J. Jewett writes:

 > In terms of best-effort, it is reasonable to treat the smuggled bytes
 > as representing a character outside of your unicode repertoire

I have to disagree.  If you ever end up passing them to something that
validates or tries to reencode them without surrogateescape, BOOM!
These things are the text equivalent of IEEE NaNs.  If all you know
(as in the stdlib) is that you have "generic text", the only fairly
safe things to do with them are (1) delete them, (2) substitute an
appropriate replacement character for them, (3) pass the text
containing them verbatim to other code, and (4) reencode them using
the same codec they were read with.

 > -- so it won't ever match entirely valid strings, except perhaps
 > via a wildcard.  And it should still work for .endswith(<the same
 > invalid characters>).

Incorrect, I'm pretty sure, unless you know that both texts containing
<the same invalid code points> were read with the same codec.  Eg,
consider two filenames encoded in ISO Cyrillic and ISO Hebrew, read
with (encoding='ascii', errors='surrogateescape').

Apps that know the semantics of the text may DWIM/DTRT if they want
to, but FWIW-IMHO-YMMV-and-any-other-4-letter-caveat-acronyms-that-
may-apply Python and the stdlib shouldn't try to guess.

Guessing may be unavoidable, of course.