[Python-ideas] Processing surrogates in

Fri May 15 03:02:26 CEST 2015

Andrew Barnert writes:

 > >>>> '\U0000d834\U0000dd1e'
 > > '\ud834\udd1e'
 > > 
 > > Isn't that disgusting?  
 > 
 > No; if the former gave you surrogates, the latter pretty much has to.

That, of course.  What I was referring to as "disgusting" was using
32-bit syntax for Unicode literals to create surrogates.

 > But meanwhile: if you're intentionally writing literals for invalid
 > strings to test for invalid string handling, is that an argument
 > for this proposal?

No.  I see three cases:

(1) Problem: You created a Python string which is invalid Unicode
    using literals or chr().
    Solution: You know why you did that, we don't.  You deal with it.
    (aka, "consenting adults")

(2) Problem: You used surrogateescape or surrogatepass because you want
    the invalid Unicode to get to the other side some times.
    Solution: That's not a problem, that's a solution.
    Advice:  Handle with care, like radioactives.  Use strict error
    handling everywhere except the "out" door for invalid Unicode.  If
    you can't afford a UnicodeError if such a string inadvertantly
    gets mixed with other stuff, use "try".
    (aka, "consenting adults")

(3) Problem: Code you can't or won't fix buggily passes you Unicode
    that might have surrogates in it.
    Solution: text-to-text codecs (but I don't see why they can't be
    written as encode-decode chains).

As I've written before, I think text-to-text codecs are an attractive
nuisance.  The temptation to use them in most cases should be refused,
because it's a better solution to deal with the problem at the
incoming boundary or the outgoing boundary (using str<->bytes codecs).
Dealing with them elsewhere and reintroducing the corrupted str into
the data flow is likely to cause issues with correctness (if altered
data is actually OK, why didn't you use a replace error handler in the
first place?)  And most likely unless you do a complete analysis of
all the ways str can get into or out of your module, you've just
started a game of whack-a-mole.

I could very easily be wrong about my assessment of where the majority
of these Unicode handling defects get injected: it's possible the
great majority comes from assorted legacy modules, and whack-a-mole is
the most cost-effective way to deal with them for most programs.  I
hope not, though. :-/