Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)
On 26 Aug 2014 21:34, "MRAB" <python@mrabarnett.plus.com> wrote:
On 2014-08-26 03:11, Stephen J. Turnbull wrote:
Nick Coghlan writes:
"purge_surrogate_escapes" was the other term that occurred to me.
"purge" suggests removal, not replacement. That may be useful too.
neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')
How about:
replace_surrogate_escapes(s, replacement='\uFFFD')
If you want them removed, just pass an empty string as the replacement.
The current proposal on the issue tracker is to instead take advantage of the existing error handlers: def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) That code is short, but semantically dense - it took a few iterations to come up with that version. (Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions. The standard name just makes it easier to look up when you come across it in a piece of code, and provides the option of optimising it later if it ever seems worth the extra work) I also filed a separate RFE to make backslashreplace usable on input, since that allows the option of separating the replacement operation from the encoding operation. Cheers, Nick.
Nick Coghlan writes:
The current proposal on the issue tracker is to instead take advantage of the existing error handlers:
def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
That code is short, but semantically dense
And it doesn't implement your original suggestion of replacement with '?' (and another possibility for history buffs is 0x1A, ASCII SUB). At least, AFAICT from the docs there's no way to specify the replacement character; decoding always uses U+FFFD. (If I knew how to do that, I would have suggested this.)
(Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions.
I'm not sure that's true. At least, to me that code was obvious -- I got the exact definition (except for the function name) on the first try -- but I ruled it out because it didn't implement your suggestion of replacement with '?', even as an option. OTOH, I think a lot of the resistance to codec-based solutions is the misconception that en/decoding streams is expensive, or the misconception that Python's internal representation of text as an array of code points (rather than an array of "characters" or "grapheme clusters") is somehow insufficient for text processing. Steve
In the process of booking up for my other post in this thread, I noticed the 'surrogatepass' handler. Is there a real use case for the 'surrogatepass' error handler? It seems like a horrible break in the abstraction. IMHO, if there's a need, the application should handle this. Python shouldn't provide it on encoding as the resulting streams are not Unicode conformant, nor on decoding UTF-16, as conversion of surrogate pairs is a requirement of all Unicode versions since about 1995. Steve
On 29 August 2014 10:32, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Nick Coghlan writes:
The current proposal on the issue tracker is to instead take advantage of the existing error handlers:
def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
That code is short, but semantically dense
And it doesn't implement your original suggestion of replacement with '?' (and another possibility for history buffs is 0x1A, ASCII SUB). At least, AFAICT from the docs there's no way to specify the replacement character; decoding always uses U+FFFD. (If I knew how to do that, I would have suggested this.)
If that actually matters in a given context, I can do an ordinary string replacement later. I couldn't think of a case where it actually mattered though - if "must be ASCII" was a requirement, then backslashreplace was a suitable alternative that lost less information (hence the RFE to make that also usable on input).
(Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions.
I'm not sure that's true. At least, to me that code was obvious -- I got the exact definition (except for the function name) on the first try -- but I ruled it out because it didn't implement your suggestion of replacement with '?', even as an option.
Yeah, part of the tracker discussion involved me realising that part wasn't a necessary requirement - the key is being able to get rid of the surrogates, or replace them with something readily identifiable, and less about being able to control exactly what they get replaced by.
OTOH, I think a lot of the resistance to codec-based solutions is the misconception that en/decoding streams is expensive, or the misconception that Python's internal representation of text as an array of code points (rather than an array of "characters" or "grapheme clusters") is somehow insufficient for text processing.
We don't actually have any technical deep dives into how Python 3's text handling works readily available online, so there's a lot of speculation and misinformation floating around. My recent article gives the high level context, but it really needs to be paired up with a piece (or pieces) that go deep into the details of codec optimisation, the UTF-8 caching, how it integrates with the UTF-16-LE Windows APIs, how the internal storage structure is determined at allocation time, how it maintains compatibility with the legacy C extension APIs, etc. The only current widely distributed articles on those topics are written from a perspective that assumes we don't know anything about Unicode, and are just making things unnecessarily complicated (rather than solving hard cross platform compatibility and text processing performance problems). That perspective is incorrect, but "trust me, they're wrong" doesn't work very well with people that are already angry. Text manipulation is one of the most sophisticated subsystems in the interpreter, though, so it's hard to know where to start on such a series (and easy to get intimidated by the sheer magnitude of the work involved in doing it right). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 29.08.2014 02:41, Stephen J. Turnbull wrote:
In the process of booking up for my other post in this thread, I noticed the 'surrogatepass' handler.
Is there a real use case for the 'surrogatepass' error handler? It seems like a horrible break in the abstraction. IMHO, if there's a need, the application should handle this. Python shouldn't provide it on encoding as the resulting streams are not Unicode conformant, nor on decoding UTF-16, as conversion of surrogate pairs is a requirement of all Unicode versions since about 1995.
This error handler allows applications to reactivate the Python 2 style behavior of the UTF codecs in Python 3, which allow reading lone surrogates on input. Since Python allows working with lone surrogates in Unicode (they are valid code points) and we're using UTF-8 for marshal, we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8). See http://bugs.python.org/issue3672 http://bugs.python.org/issue12892 for discussions. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Fri, 29 Aug 2014, M.-A. Lemburg wrote:
On 29.08.2014 02:41, Stephen J. Turnbull wrote: Since Python allows working with lone surrogates in Unicode (they are valid code points) and we're using UTF-8 for marshal, we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8).
If I want that wouldn't I specify "cesu-8" as the encoding? i.e., instead of .decode ('utf-8') I would use .decode ('cesu-8'). Right now, trying this I get that cesu-8 is an unknown encoding but that could be changed without affecting the behaviour of the utf-8 codec. It seems to me that .decode ('utf-8') should decode exactly and only valid utf-8, including the non-use of surrogate pairs as an intermediate encoding step. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist
On 29.08.2014 13:22, Isaac Morland wrote:
On Fri, 29 Aug 2014, M.-A. Lemburg wrote:
On 29.08.2014 02:41, Stephen J. Turnbull wrote: Since Python allows working with lone surrogates in Unicode (they are valid code points) and we're using UTF-8 for marshal, we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8).
If I want that wouldn't I specify "cesu-8" as the encoding?
i.e., instead of .decode ('utf-8') I would use .decode ('cesu-8'). Right now, trying this I get that cesu-8 is an unknown encoding but that could be changed without affecting the behaviour of the utf-8 codec.
Why write a new codec that's almost identical to the utf-8 codec, if you can get the same functionality by explicitly using a special error handler ?
From a maintenance POV that does not sound like a good approach.
It seems to me that .decode ('utf-8') should decode exactly and only valid utf-8, including the non-use of surrogate pairs as an intermediate encoding step.
It does in Python 3. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 29 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 21 days to go 2014-09-27: PyDDF Sprint 2014 ... 29 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
M.-A. Lemburg wrote:
we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8).
I don't think CESU-8 is the same thing. According to the wiki page, CESU-8 *requires* all code points above 0xffff to be split into surrogate pairs before encoding. It also doesn't say that lone surrogates are valid -- it doesn't mention lone surrogates at all, only pairs. Neither does the linked technical report. The technical report also says that CESU-8 forbids any UTF-8 sequences of more than three bytes, so it's definitely not "UTF-8 plus lone surrogates". -- Greg
Greg Ewing writes:
M.-A. Lemburg wrote:
we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8).
Besides what Greg says, CESU-8 is an UTF, and therefore encodes valid Unicode. Speaking imprecisely, CESU-8 is UTF-16 with variable-width code units (ie, each 16-bit code point is represented using the UTF-8 variable-width representation).[1] I think you are thinking of Markus Kuhn's utf-8b (which I believe is exactly what is implemented by the surrogateescape handler). As far as the goal of "working with lone surrogates in such UTF-8 streams", the surrogateescape handler already permits that, and does so consistently across streams in the sense that lone surrogates in the UTF-8 stream cannot be mixed with garbage bytes decoded by surrogateescape in another stream, which produces an unencodable mess. I still don't see a justification for the surrogatepass handler. What applications are producing (not merely passing through) UTF-8-encoded surrogates these days? Footnotes: [1] For the curious, it's imprecise because in Unicode code units are fixed-width by definition.
On 30.08.2014 01:37, Greg Ewing wrote:
M.-A. Lemburg wrote:
we needed a way to make sure that Python 3 also optionally supports working with lone surrogates in such UTF-8 streams (nowadays called CESU-8: http://en.wikipedia.org/wiki/CESU-8).
I don't think CESU-8 is the same thing. According to the wiki page, CESU-8 *requires* all code points above 0xffff to be split into surrogate pairs before encoding. It also doesn't say that lone surrogates are valid -- it doesn't mention lone surrogates at all, only pairs. Neither does the linked technical report.
The technical report also says that CESU-8 forbids any UTF-8 sequences of more than three bytes, so it's definitely not "UTF-8 plus lone surrogates".
You're right, it's not the same as UTF-8 plus lone surrogates. CESU-8 does encode surrogates as individual code points using the UTF-8 encoding, which is what probably caused it to be mentioned in discussions when talking about having UTF-8 streams do the same for lone surrogates. So let's call the encoding UTF-8-py so that everyone knows what we're talking about :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 30 2014)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2014-08-27: Released eGenix PyRun 2.0.1 ... http://egenix.com/go62 2014-09-19: PyCon UK 2014, Coventry, UK ... 20 days to go 2014-09-27: PyDDF Sprint 2014 ... 28 days to go eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
participants (5)
-
Greg Ewing
-
Isaac Morland
-
M.-A. Lemburg
-
Nick Coghlan
-
Stephen J. Turnbull