Bytes path related questions for Guido
At Guido's request, splitting out two specific questions from Serhiy's thread where I believe we could do with an explicit "yes or no" from him. 1. Should we accept patches adding support for the direct use of bytes paths in lower level filesystem manipulation APIs? (i.e. everything that isn't pathlib) This was Serhiy's original question (due to some open issues [1,2]). I think the answer is yes, as we already do in some cases, and the "pathlib doesn't support binary paths" design decision is a high level platform independent API vs low level potentially platform dependent API one rather than being about disallowing the use of bytes paths in general. [1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797 2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text? My proposal [3] is to add: * string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI) "s != string.clean(s)" would then serve as a check for "does this string contain any surrogate escaped bytes?" [3] http://bugs.python.org/issue18814#msg225791 Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 24 August 2014 14:44, Nick Coghlan <ncoghlan@gmail.com> wrote:
2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?
My proposal [3] is to add:
* string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)
Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader) "s != codecs.clean_surrogate_escapes(s)" would then become the check for "does this string contain any surrogate escaped bytes?" Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Le 24/08/2014 09:04, Nick Coghlan a écrit :
On 24 August 2014 14:44, Nick Coghlan <ncoghlan@gmail.com> wrote:
2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?
My proposal [3] is to add:
* string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)
Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader)
"clean" conveys the wrong meaning. It should use a scary word such as "trap". "Cleaning" surrogates is unlikely to be the right procedure when dealing with surrogates produced by undecodable byte sequences. Regards Antoine.
On 25 August 2014 00:23, Antoine Pitrou <antoine@python.org> wrote:
Le 24/08/2014 09:04, Nick Coghlan a écrit :
Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader)
"clean" conveys the wrong meaning. It should use a scary word such as "trap". "Cleaning" surrogates is unlikely to be the right procedure when dealing with surrogates produced by undecodable byte sequences.
"purge_surrogate_escapes" was the other term that occurred to me. Either way, my use case is to filter them out when I *don't* want to pass them along to other software, but would prefer the Unicode replacement character to the ASCII question mark created by using the "replace" filter when encoding. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan writes:
"purge_surrogate_escapes" was the other term that occurred to me.
"purge" suggests removal, not replacement. That may be useful too. neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD') maybe? (Of course the remove argument is feature creep, so I'm only about +0.5 myself. And the name is long, but I can't think of any better synonyms for "make safe" in English right now).
Either way, my use case is to filter them out when I *don't* want to pass them along to other software, but would prefer the Unicode replacement character to the ASCII question mark created by using the "replace" filter when encoding.
I think it would be preferable to be unicodely correct here by default, since this is a str -> str function.
On 2014-08-26 03:11, Stephen J. Turnbull wrote:
Nick Coghlan writes:
"purge_surrogate_escapes" was the other term that occurred to me.
"purge" suggests removal, not replacement. That may be useful too.
neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')
How about: replace_surrogate_escapes(s, replacement='\uFFFD') If you want them removed, just pass an empty string as the replacement.
maybe? (Of course the remove argument is feature creep, so I'm only about +0.5 myself. And the name is long, but I can't think of any better synonyms for "make safe" in English right now).
Either way, my use case is to filter them out when I *don't* want to pass them along to other software, but would prefer the Unicode replacement character to the ASCII question mark created by using the "replace" filter when encoding.
I think it would be preferable to be unicodely correct here by default, since this is a str -> str function.
On 8/26/2014 4:31 AM, MRAB wrote:
On 2014-08-26 03:11, Stephen J. Turnbull wrote:
Nick Coghlan writes:
"purge_surrogate_escapes" was the other term that occurred to me.
"purge" suggests removal, not replacement. That may be useful too.
neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')
How about:
replace_surrogate_escapes(s, replacement='\uFFFD')
If you want them removed, just pass an empty string as the replacement.
And further, replacement could be a vector of 128 characters, to do immediate transcoding, or a single character to do wholesale replacement with some gibberish character, or None to remove (or an empty string).
Glenn Linderman writes:
On 8/26/2014 4:31 AM, MRAB wrote:
On 2014-08-26 03:11, Stephen J. Turnbull wrote:
Nick Coghlan writes:
How about:
replace_surrogate_escapes(s, replacement='\uFFFD')
If you want them removed, just pass an empty string as the replacement.
That seems better to me (I had too much C for breakfast, I think).
And further, replacement could be a vector of 128 characters, to do immediate transcoding,
Using what encoding? If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful. OTOH, I could see using replace_surrogate_escapes(s, replacement='�') in HTML. (Actually, probably not; if it makes sense to use Unicode features you're probably using Unicode as the external encoding, so a character entity is silly. But there might be contexts with a useful multicharacter replacements.)
or a single character to do wholesale replacement with some gibberish character, or None to remove (or an empty string).
Not None, that means default (which should be the Unicode standard REPLACEMENT CHARACTER U+FFFD). Steve
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
On 8/26/2014 4:31 AM, MRAB wrote:
On 2014-08-26 03:11, Stephen J. Turnbull wrote:
Nick Coghlan writes:
How about:
replace_surrogate_escapes(s, replacement='\uFFFD')
If you want them removed, just pass an empty string as the replacement.
That seems better to me (I had too much C for breakfast, I think).
And further, replacement could be a vector of 128 characters, to do immediate transcoding,
Using what encoding?
The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.
If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found. But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.
Glenn Linderman writes:
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
And further, replacement could be a vector of 128 characters, to do immediate transcoding,
Using what encoding?
The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.
Yes, that's obvious. The question is where do you get the vector?
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.
If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.
Exactly. That's precisely why bytes have a .decode method.
But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data.
Not every one-line expression needs to be in the stdlib: data[start, end] = data[start, end].encode('utf-8', errors=surrogateescape).decode('DTRT-now') Note that you *do* need to know start and end, because of the possibility of "several encodings", where once you apply this technique to the whole text, you can't recover the surrogates when you get the encoding wrong.
Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec.
Sure. And in fact I do this kind of thing all the time in Emacs, using the decode(encode(slice)) approach. The only times in 25 years of working with the insanity of digitized Japanese I've had a use for anything other than that is when I don't have a round-tripping codec. In that case I have to preserve the bytes or suffer lossy conversion anyway, regardless of the method used to reconvert. But surrogateescape is necessarily round-tripping (maybe with a few exceptions in Chinese and a very small number in other languages, but those failures are due to Unicode, not to surrogateescape).
There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result.
And there currently cannot be. codecs are bytes<->str, not str->str.
This technique could be used instead, for single-byte, non-escaped encodings.
That's pure theory, not a use case. We have codecs for all the encodings with significant numbers of users, and writing a new one simply isn't that hard. Steve
On 2014-08-28 05:56, Glenn Linderman wrote:
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
On 8/26/2014 4:31 AM, MRAB wrote:
On 2014-08-26 03:11, Stephen J. Turnbull wrote:
Nick Coghlan writes:
How about:
replace_surrogate_escapes(s, replacement='\uFFFD')
If you want them removed, just pass an empty string as the replacement.
That seems better to me (I had too much C for breakfast, I think).
And further, replacement could be a vector of 128 characters, to do immediate transcoding,
Using what encoding?
The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.
If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.
But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.
There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct. If you picked the wrong encoding, the other codepoints could be wrong too.
On 8/28/2014 12:30 AM, MRAB wrote:
On 2014-08-28 05:56, Glenn Linderman wrote:
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
On 8/26/2014 4:31 AM, MRAB wrote:
On 2014-08-26 03:11, Stephen J. Turnbull wrote:
Nick Coghlan writes:
How about:
replace_surrogate_escapes(s, replacement='\uFFFD')
If you want them removed, just pass an empty string as the replacement.
That seems better to me (I had too much C for breakfast, I think).
And further, replacement could be a vector of 128 characters, to do immediate transcoding,
Using what encoding?
The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.
If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.
But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.
There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.
If you picked the wrong encoding, the other codepoints could be wrong too.
Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters.
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/28/2014 12:30 AM, MRAB wrote:
On 2014-08-28 05:56, Glenn Linderman wrote:
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
On 8/26/2014 4:31 AM, MRAB wrote:
On 2014-08-26 03:11, Stephen J. Turnbull wrote: > Nick Coghlan writes:
How about:
replace_surrogate_escapes(s, replacement='\uFFFD')
If you want them removed, just pass an empty string as the replacement.
That seems better to me (I had too much C for breakfast, I think).
And further, replacement could be a vector of 128 characters, to do immediate transcoding,
Using what encoding?
The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.
If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.
But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.
There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.
If you picked the wrong encoding, the other codepoints could be wrong too.
Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters.
Well, replace would still be useful for ASCII+surrogateescape. Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. --David
On 8/28/2014 10:41 AM, R. David Murray wrote:
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/28/2014 12:30 AM, MRAB wrote:
On 2014-08-28 05:56, Glenn Linderman wrote:
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
Glenn Linderman writes:
On 8/26/2014 4:31 AM, MRAB wrote: > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >> Nick Coghlan writes:
> How about: > > replace_surrogate_escapes(s, replacement='\uFFFD') > > If you want them removed, just pass an empty string as the > replacement.
That seems better to me (I had too much C for breakfast, I think).
And further, replacement could be a vector of 128 characters, to do immediate transcoding,
Using what encoding? The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful. If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.
But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.
There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.
If you picked the wrong encoding, the other codepoints could be wrong too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. Well, replace would still be useful for ASCII+surrogateescape.
Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with
How? that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?
On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/28/2014 10:41 AM, R. David Murray wrote:
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 8/28/2014 12:30 AM, MRAB wrote:
There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.
If you picked the wrong encoding, the other codepoints could be wrong too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. Well, replace would still be useful for ASCII+surrogateescape.
How?
Because there "can't" be any incorrectly decoded bytes in the ASCII part, so all undecodable bytes turning into 'unrecognized character' glyphs is useful. "can't" is in quotes because of course if you decode random binary data as ASCII+surrogate escape you could get a mess just like any other encoding, so this is really a "more *likely* to be useful" version of my second point, because "real" ASCII with some junk bytes mixed in is much more likely to be encountered in the wild than, say, utf-8 with some junk bytes mixed in (although is probably changing as use of utf-8 becomes more widespread, so this point applies to utf-8 as well).
Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case.
Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with
Well, it does if the alternative is not being able to display the string to the user at all. And yeah, people being able to recognize mojibake in specific problem domains is what I'm talking about...not perhaps a great use case, but it is a use case.
that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?
Yeah, that idea has been floated as well, and I think it would indeed be more useful than the 'unknown character' glyph. I've also seen fonts that display the hex code inside a box character when the code point is unknown, which would be cool...but that can hardly be part of unicode, can it? :) --David
On 28 Aug 2014, at 19:54, Glenn Linderman wrote:
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python@g.nevcal.com> wrote: [...] Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with
On 8/28/2014 10:41 AM, R. David Murray wrote: that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?
For that we could extend the "backslashreplace" codec error callback, so that it can be used for decoding too, not just for encoding. I.e. b"a\xffb".decode("utf-8", "backslashreplace") would return "a\\xffb" Servus, Walter
Yes on #1 -- making the low-level functions more usable for edge cases by supporting bytes seems fine (as long as the support for strings, where it exists, is not compromised). The status of pathlib is a little unclear to me -- is there a plan to eventually support bytes or not? For #2 I think you should probably just work with the others you have mentioned. On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At Guido's request, splitting out two specific questions from Serhiy's thread where I believe we could do with an explicit "yes or no" from him.
1. Should we accept patches adding support for the direct use of bytes paths in lower level filesystem manipulation APIs? (i.e. everything that isn't pathlib)
This was Serhiy's original question (due to some open issues [1,2]). I think the answer is yes, as we already do in some cases, and the "pathlib doesn't support binary paths" design decision is a high level platform independent API vs low level potentially platform dependent API one rather than being about disallowing the use of bytes paths in general.
[1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797
2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?
My proposal [3] is to add:
* string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)
"s != string.clean(s)" would then serve as a check for "does this string contain any surrogate escaped bytes?"
[3] http://bugs.python.org/issue18814#msg225791
Regards, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
On 25 Aug 2014 03:55, "Guido van Rossum" <guido@python.org> wrote:
Yes on #1 -- making the low-level functions more usable for edge cases by
supporting bytes seems fine (as long as the support for strings, where it exists, is not compromised). Thanks!
The status of pathlib is a little unclear to me -- is there a plan to eventually support bytes or not?
It's text only and Antoine plans to keep it that - the concatenation operations, etc, are really only safe if you decode first.
For #2 I think you should probably just work with the others you have
mentioned. Yes, that sounds like a good idea. There's been some good progress on the issue tracker, so I think we can thrash out some workable (and comprehensible!) utilities that will be useful in their own right while also serving as aids to understanding for the underlying mechanisms. Cheers, Nick.
On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
At Guido's request, splitting out two specific questions from Serhiy's thread where I believe we could do with an explicit "yes or no" from him.
1. Should we accept patches adding support for the direct use of bytes paths in lower level filesystem manipulation APIs? (i.e. everything that isn't pathlib)
This was Serhiy's original question (due to some open issues [1,2]). I think the answer is yes, as we already do in some cases, and the "pathlib doesn't support binary paths" design decision is a high level platform independent API vs low level potentially platform dependent API one rather than being about disallowing the use of bytes paths in general.
[1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797
2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?
My proposal [3] is to add:
* string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)
"s != string.clean(s)" would then serve as a check for "does this string contain any surrogate escaped bytes?"
[3] http://bugs.python.org/issue18814#msg225791
Regards, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
https://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (python.org/~guido)
participants (8)
-
Antoine Pitrou
-
Glenn Linderman
-
Guido van Rossum
-
MRAB
-
Nick Coghlan
-
R. David Murray
-
Stephen J. Turnbull
-
Walter Dörwald