[issue18814] Add tools for "cleaning" surrogate escaped strings

Ezio Melotti report at bugs.python.org
Sun Aug 24 09:58:21 CEST 2014


Ezio Melotti added the comment:

I think similar functions should be added in the unicodedata module rather than the string module or as str methods.  If I'm not mistaken this was already proposed in another issue.
In C we already added macros like IS_{HIGH|LOW|}_SURROGATE and possibly others to help dealing with surrogates but AFAIK there's no Python equivalent yet.
As for the specific constants/functions/methods you propose, IMHO the name escaped_surrogates is not too clear.  If it's a string of lone surrogates I would just call it unicodedata.surrogates (and .high_surrogates/.low_surrogates).  These can also be used to build oneliner to check if a string contains surrogates and/or to remove them.
clean has a very generic name with no hints about surrogates, and its purpose is quite specific.
I'm also not a big fan of redecode.  The equivalent calls to encode/decode are not much longer and more explicit.  Also having to redecode often indicates that there's a bug before that should be fixed instead (if possible).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18814>
_______________________________________


More information about the Python-bugs-list mailing list