[Python-ideas] Processing surrogates in

Sat May 16 11:47:02 CEST 2015

On May 16, 2015, at 00:50, Nick Coghlan <ncoghlan at gmail.com> wrote:
> 
>> On 15 May 2015 at 22:21, Paul Moore <p.f.moore at gmail.com> wrote:
>>> On 15 May 2015 at 02:02, Stephen J. Turnbull <stephen at xemacs.org> wrote:
>>> (3) Problem: Code you can't or won't fix buggily passes you Unicode
>>>    that might have surrogates in it.
>>>    Solution: text-to-text codecs (but I don't see why they can't be
>>>    written as encode-decode chains).
>>> 
>>> As I've written before, I think text-to-text codecs are an attractive
>>> nuisance.  The temptation to use them in most cases should be refused,
>>> because it's a better solution to deal with the problem at the
>>> incoming boundary or the outgoing boundary (using str<->bytes codecs).
>> 
>> One case I'd found a need for text->text handling (although not
>> related to surrogates) was taking arbitrary Unicode and applying an
>> error handler to it before writing it to a stream with "strict"
>> encoding. (So something like "arbitrary text".encode('latin1',
>> 'errors='backslashescape').decode('latin1')).
>> 
>> The encode/decode pair seemed ugly, although it was the only way I
>> could find. I could easily imagine using a "rehandle" type of function
>> for this (although I wouldn't use the actual proposed functions here,
>> as the use of "surrogate" and "astral" in the names would lead me to
>> assume they were inappropriate).
> 
> That's a different case, as you need to know the encoding of the
> target stream in order to know which code points that codec can't
> handle. Even when you do know the target encoding, Python itself has
> no idea which code points a given text encoding can and can't handle,
> so the only way to find out is to try it and see what happens.
> 
> The unique thing about the surrogate case is that *no* codec is
> supposed to encode them, not even the universal ones:

Python doesn't have a CESU-8 codec (or "JNI UTF-8" or any of the other near-equivalent abominations), right? Because IIRC, CESU-8 says that (in Python terms) '\U00010400' and '\uD801\uDC00' should both encode to b'\xED\xA0\x81\xED\xB0\x80', instead of the former encoding to b'\xF0\x90\x90\x80' and the latter not being encodable because it's not a string.

Anyway, I don't know if that counts as a Unicode encoding, since it's only described in a TR, not the standard itself. And Python is probably right to ignore it (assuming I'm remembering right and Python does ignore it...), even if that makes problems for Jython or Oracle DB-API libs or whatever.