[Python-ideas] Processing surrogates in

Sat May 9 08:40:48 CEST 2015

Serhiy Storchaka writes:
 > On 05.05.15 11:23, Stephen J. Turnbull wrote:
 > > Serhiy Storchaka writes:
 > >
 > >   > Use cases include programs that use tkinter (common build of Tcl/Tk
 > >   > don't accept non-BMP characters), email or wsgiref.
 > >
 > > So, consider Tcl/Tk.  If you use it for input, no problem, it *can't*
 > > produce non-BMP characters.  So you're using it for output.  If
 > > knowing that your design involves tkinter, you deduce you must not
 > > accept non-BMP characters on input, where's your problem?
 > 
 > With Tcl/Tk all is not so easy.

I didn't claim *all* was easy; IME Tcl is just easy to break, and not
only in its Unicode handling.  But dealing with the problem you
mentioned at the interface between Python and Tcl/Tk can be done this
way.

 > The main issue is with translating from Tcl to Python. Tcl uses at
 > least two representations for strings (UCS-2 and modified UTF-8,
 > and Latin1 in some cases),

These are not represented *in Tcl* as Python str, are they?  If not,
they need to be converted with a regular byte-oriented codec, no?
Once again, a regular codec with appropriate error handler can deal
with it early, and better.  So fix Tkinter; it's probably not much
harder than documenting the correct use of these functions in dealing
with Tkinter.

 > > And ... you looked twice at your proposal?  You have basically
 > > reproduced the codec error handling API for .decode and .encode in a
 > > bunch to str2str "rehandle" functions.
 > 
 > Yes, this is the main advantage of proposed functions. They reuse
 > existing error handlers and are extensible by writing new error
 > handlers.

They also violate TOOWTDI.  In fact, that's their whole purpose.<wink/>

 > > In other words, you need to know as much to use "rehandle_*"
 > > properly as you do to use .decode and .encode.  I do not see a
 > > win for the programmer who is mostly innocent of encoding
 > > knowledge.
 > 
 > Is it a problem? These functions are for experienced users. Perhaps 
 > mostly for authors of libraries and frameworks.

Yes, it's a problem.  You say they're "for" experienced users, but
that's a null concept. You intend to make them *available* to all
users.  Very few users have experience in I18N technology, and those
are generally able to chain .encode().decode() correctly, which is
conceptually what you're doing anyway (in fact, that's the
*implementation* *you* published in issue18814!)

OTOH, *most* experienced users have experienced I18N headaches.  "To a
man with a hammer, every problem looks like a nail" but with this
hammer, mostly it's actually a thumb.  These functions should only
ever be used on input, but in practice programmers under time pressure
(and who isn't?) tend to apply bandaids at the point where the problem
is detected -- which is output, since Python itself has no problems
with lone surrogates or astral characters.

As for authors of libraries and frameworks, *they* should *really*
should be handling these problems at the external bytes -> internal
Unicode interface when the original data, and often metadata or even a
human user, is available for interrogation.  Not later, when all you
have is the resulting radioactive garbage, which you'll end up passing
on to the framework users.

 > > If we apply these rehandle_* thumbs to the holes in the I18N dike,
 > > it's just going to spring more leaks elsewhere.
 > 
 > There are a lot of butteries included in Python. They can explode
 > if use them incorrectly.

I think a better analogy is explosive, which can be useful if used
safely. :-)

If you have to add these functions, *please* do not put them anywhere
near the codecs.  They are not codecs, they do not transform the
representation of data.  They change the semantics of the data.  Put
them in a "validation" submodule of the unicodedata package, or create
a new unicodetools package or something like that to hold them.

And they should be documented as dangerous because the transformations
they perform cannot be inverted to get the original input once the
strings produced are passed to other code (unless you also pass the
history of transformations as metadata).  This matters in applications
where the input bytes may have been digitally signed, for example.

(I've posted the last two paragraphs in somewhat more precise form to
the issue18814.)