Mailman 3 Bytes path related questions for Guido - Python-Dev

Bytes path related questions for Guido

older
Windows Unicode console support...

Nick Coghlan

24 Aug 2014 24 Aug '14

4:44 a.m.

At Guido's request, splitting out two specific questions from Serhiy's thread where I believe we could do with an explicit "yes or no" from him. 1. Should we accept patches adding support for the direct use of bytes paths in lower level filesystem manipulation APIs? (i.e. everything that isn't pathlib) This was Serhiy's original question (due to some open issues [1,2]). I think the answer is yes, as we already do in some cases, and the "pathlib doesn't support binary paths" design decision is a high level platform independent API vs low level potentially platform dependent API one rather than being about disallowing the use of bytes paths in general. [1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797 2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text? My proposal [3] is to add: * string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI) "s != string.clean(s)" would then serve as a check for "does this string contain any surrogate escaped bytes?" [3] http://bugs.python.org/issue18814#msg225791 Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Show replies by date

Nick Coghlan

24 Aug 24 Aug

1:04 p.m.

On 24 August 2014 14:44, Nick Coghlan wrote:

...

2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?

My proposal [3] is to add:

* string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)

Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader) "s != codecs.clean_surrogate_escapes(s)" would then become the check for "does this string contain any surrogate escaped bytes?" Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Antoine Pitrou

2:23 p.m.

Le 24/08/2014 09:04, Nick Coghlan a écrit :

...

On 24 August 2014 14:44, Nick Coghlan wrote:

...
2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?

My proposal [3] is to add:

* string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)

Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader)

"clean" conveys the wrong meaning. It should use a scary word such as "trap". "Cleaning" surrogates is unlikely to be the right procedure when dealing with surrogates produced by undecodable byte sequences. Regards Antoine.

Nick Coghlan

3:26 p.m.

On 25 August 2014 00:23, Antoine Pitrou wrote:

...

Le 24/08/2014 09:04, Nick Coghlan a écrit :

...
Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader)

"clean" conveys the wrong meaning. It should use a scary word such as "trap". "Cleaning" surrogates is unlikely to be the right procedure when dealing with surrogates produced by undecodable byte sequences.

"purge_surrogate_escapes" was the other term that occurred to me. Either way, my use case is to filter them out when I *don't* want to pass them along to other software, but would prefer the Unicode replacement character to the ASCII question mark created by using the "replace" filter when encoding. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Stephen J. Turnbull

26 Aug 26 Aug

2:11 a.m.

Nick Coghlan writes:

...

"purge_surrogate_escapes" was the other term that occurred to me.

"purge" suggests removal, not replacement. That may be useful too. neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD') maybe? (Of course the remove argument is feature creep, so I'm only about +0.5 myself. And the name is long, but I can't think of any better synonyms for "make safe" in English right now).

...

Either way, my use case is to filter them out when I *don't* want to pass them along to other software, but would prefer the Unicode replacement character to the ASCII question mark created by using the "replace" filter when encoding.

I think it would be preferable to be unicodely correct here by default, since this is a str -> str function.

MRAB

11:31 a.m.

On 2014-08-26 03:11, Stephen J. Turnbull wrote:

...

Nick Coghlan writes:

...
"purge_surrogate_escapes" was the other term that occurred to me.

"purge" suggests removal, not replacement. That may be useful too.

neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')

How about: replace_surrogate_escapes(s, replacement='\uFFFD') If you want them removed, just pass an empty string as the replacement.

...

maybe? (Of course the remove argument is feature creep, so I'm only about +0.5 myself. And the name is long, but I can't think of any better synonyms for "make safe" in English right now).

...
Either way, my use case is to filter them out when I *don't* want to pass them along to other software, but would prefer the Unicode replacement character to the ASCII question mark created by using the "replace" filter when encoding.

I think it would be preferable to be unicodely correct here by default, since this is a str -> str function.

Glenn Linderman

27 Aug 27 Aug

6:21 p.m.

On 8/26/2014 4:31 AM, MRAB wrote:

...

On 2014-08-26 03:11, Stephen J. Turnbull wrote:

...
Nick Coghlan writes:

...
"purge_surrogate_escapes" was the other term that occurred to me.

"purge" suggests removal, not replacement. That may be useful too.

neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')

How about:

replace_surrogate_escapes(s, replacement='\uFFFD')

If you want them removed, just pass an empty string as the replacement.

And further, replacement could be a vector of 128 characters, to do immediate transcoding, or a single character to do wholesale replacement with some gibberish character, or None to remove (or an empty string).

Stephen J. Turnbull

28 Aug 28 Aug

1:08 a.m.

Glenn Linderman writes:

...

On 8/26/2014 4:31 AM, MRAB wrote:

...
On 2014-08-26 03:11, Stephen J. Turnbull wrote:

...
Nick Coghlan writes:

...

...
How about:

replace_surrogate_escapes(s, replacement='\uFFFD')

If you want them removed, just pass an empty string as the replacement.

That seems better to me (I had too much C for breakfast, I think).

...

And further, replacement could be a vector of 128 characters, to do immediate transcoding,

Using what encoding? If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful. OTOH, I could see using replace_surrogate_escapes(s, replacement='�') in HTML. (Actually, probably not; if it makes sense to use Unicode features you're probably using Unicode as the external encoding, so a character entity is silly. But there might be contexts with a useful multicharacter replacements.)

...

or a single character to do wholesale replacement with some gibberish character, or None to remove (or an empty string).

Not None, that means default (which should be the Unicode standard REPLACEMENT CHARACTER U+FFFD). Steve

Glenn Linderman

4:56 a.m.

On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

...

Glenn Linderman writes:

...
On 8/26/2014 4:31 AM, MRAB wrote:

...
On 2014-08-26 03:11, Stephen J. Turnbull wrote:

...
Nick Coghlan writes:

...
...
How about:

replace_surrogate_escapes(s, replacement='\uFFFD')

If you want them removed, just pass an empty string as the replacement.

That seems better to me (I had too much C for breakfast, I think).

...
And further, replacement could be a vector of 128 characters, to do immediate transcoding,

Using what encoding?

The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.

...

If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.

If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found. But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.

Stephen J. Turnbull

6:30 a.m.

Glenn Linderman writes:

...

On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...

...
...
And further, replacement could be a vector of 128 characters, to do immediate transcoding,

Using what encoding?

The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.

Yes, that's obvious. The question is where do you get the vector?

...

...
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.

If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.

Exactly. That's precisely why bytes have a .decode method.

...

But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data.

Not every one-line expression needs to be in the stdlib: data[start, end] = data[start, end].encode('utf-8', errors=surrogateescape).decode('DTRT-now') Note that you *do* need to know start and end, because of the possibility of "several encodings", where once you apply this technique to the whole text, you can't recover the surrogates when you get the encoding wrong.

...

Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec.

Sure. And in fact I do this kind of thing all the time in Emacs, using the decode(encode(slice)) approach. The only times in 25 years of working with the insanity of digitized Japanese I've had a use for anything other than that is when I don't have a round-tripping codec. In that case I have to preserve the bytes or suffer lossy conversion anyway, regardless of the method used to reconvert. But surrogateescape is necessarily round-tripping (maybe with a few exceptions in Chinese and a very small number in other languages, but those failures are due to Unicode, not to surrogateescape).

...

There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result.

And there currently cannot be. codecs are bytes<->str, not str->str.

...

This technique could be used instead, for single-byte, non-escaped encodings.

That's pure theory, not a use case. We have codecs for all the encodings with significant numbers of users, and writing a new one simply isn't that hard. Steve

MRAB

7:30 a.m.

On 2014-08-28 05:56, Glenn Linderman wrote:

...

On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...
On 8/26/2014 4:31 AM, MRAB wrote:

...
On 2014-08-26 03:11, Stephen J. Turnbull wrote:

...
Nick Coghlan writes:

...
...
How about:

replace_surrogate_escapes(s, replacement='\uFFFD')

If you want them removed, just pass an empty string as the replacement.

That seems better to me (I had too much C for breakfast, I think).

...
And further, replacement could be a vector of 128 characters, to do immediate transcoding,

Using what encoding?

The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.

...
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.

If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.

There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct. If you picked the wrong encoding, the other codepoints could be wrong too.

Glenn Linderman

5:15 p.m.

On 8/28/2014 12:30 AM, MRAB wrote:

...

On 2014-08-28 05:56, Glenn Linderman wrote:

...
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...
On 8/26/2014 4:31 AM, MRAB wrote:

...
On 2014-08-26 03:11, Stephen J. Turnbull wrote:

...
Nick Coghlan writes:

...
...
How about:

replace_surrogate_escapes(s, replacement='\uFFFD')

If you want them removed, just pass an empty string as the replacement.

That seems better to me (I had too much C for breakfast, I think).

...
And further, replacement could be a vector of 128 characters, to do immediate transcoding,

Using what encoding?

The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.

...
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.

If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.

There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong too.

Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters.

R. David Murray

5:41 p.m.

On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote:

...

On 8/28/2014 12:30 AM, MRAB wrote:

...
On 2014-08-28 05:56, Glenn Linderman wrote:

...
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...
On 8/26/2014 4:31 AM, MRAB wrote:

...
On 2014-08-26 03:11, Stephen J. Turnbull wrote: > Nick Coghlan writes:

...
...
How about:

replace_surrogate_escapes(s, replacement='\uFFFD')

If you want them removed, just pass an empty string as the replacement.

That seems better to me (I had too much C for breakfast, I think).

...
And further, replacement could be a vector of 128 characters, to do immediate transcoding,

Using what encoding?

The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.

...
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful.

If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.

There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong too.

Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters.

Well, replace would still be useful for ASCII+surrogateescape. Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. --David

Glenn Linderman

5:54 p.m.

On 8/28/2014 10:41 AM, R. David Murray wrote:

...

On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote:

...
On 8/28/2014 12:30 AM, MRAB wrote:

...
On 2014-08-28 05:56, Glenn Linderman wrote:

...
On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...
On 8/26/2014 4:31 AM, MRAB wrote: > On 2014-08-26 03:11, Stephen J. Turnbull wrote: >> Nick Coghlan writes:

...
> How about: > > replace_surrogate_escapes(s, replacement='\uFFFD') > > If you want them removed, just pass an empty string as the > replacement.

That seems better to me (I had too much C for breakfast, I think).

...
And further, replacement could be a vector of 128 characters, to do immediate transcoding,

Using what encoding? The vector would contain the transcoding. Each lone surrogate would map to a character in the vector.

...
If you knew that much, why didn't you use (write, if necessary) an appropriate codec? I can't envision this being useful. If the data format describes its encoding, possibly containing data from several encodings in various spots, then perhaps it is best read as binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then when the encoding is determined, to transcode the data. Previously, a proposal was made to reverse the surrogate escapes to the original bytes, and then apply the (now known) appropriate codec. There are not appropriate codecs that can convert directly from surrogate escapes to the desired end result. This technique could be used instead, for single-byte, non-escaped encodings. On the other hand, writing specialty codecs for the purpose would be more general.

There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. Well, replace would still be useful for ASCII+surrogateescape.

...

Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with

How? that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?

R. David Murray

6:43 p.m.

On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman wrote:

...

On 8/28/2014 10:41 AM, R. David Murray wrote:

...
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote:

...
On 8/28/2014 12:30 AM, MRAB wrote:

...
There'll be a surrogate escape if a byte couldn't be decoded, but just because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong too. Aha! Thanks for pointing out the flaw in my reasoning. But that means it is also pretty useless to "replace_surrogate_escapes" at all, because it only cleans out the non-decodable characters, not the incorrectly decoded characters. Well, replace would still be useful for ASCII+surrogateescape.

How?

Because there "can't" be any incorrectly decoded bytes in the ASCII part, so all undecodable bytes turning into 'unrecognized character' glyphs is useful. "can't" is in quotes because of course if you decode random binary data as ASCII+surrogate escape you could get a mess just like any other encoding, so this is really a "more *likely* to be useful" version of my second point, because "real" ASCII with some junk bytes mixed in is much more likely to be encountered in the wild than, say, utf-8 with some junk bytes mixed in (although is probably changing as use of utf-8 becomes more widespread, so this point applies to utf-8 as well).

...

...
Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case.

Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with

Well, it does if the alternative is not being able to display the string to the user at all. And yeah, people being able to recognize mojibake in specific problem domains is what I'm talking about...not perhaps a great use case, but it is a use case.

...

that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?

Yeah, that idea has been floated as well, and I think it would indeed be more useful than the 'unknown character' glyph. I've also seen fonts that display the hex code inside a box character when the code point is unknown, which would be cool...but that can hardly be part of unicode, can it? :) --David

Walter Dörwald

29 Aug 29 Aug

10:09 a.m.

On 28 Aug 2014, at 19:54, Glenn Linderman wrote:

...

...
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman wrote: [...] Also for cases where the data stream is *supposed* to be in a given encoding, but contains undecodable bytes. Showing the stuff that incorrectly decodes as whatever it decodes to is generally what you want in that case. Sure, people can learn to recognize mojibake for what it is, and maybe even learn to recognize it for what it was intended to be, in limited domains. But suppressing/replacing the surrogates doesn't help with

On 8/28/2014 10:41 AM, R. David Murray wrote: that... would it not be better to replace the surrogates with an escape sequence that shows the original, undecodable, byte value? Like \xNN ?

For that we could extend the "backslashreplace" codec error callback, so that it can be used for decoding too, not just for encoding. I.e. b"a\xffb".decode("utf-8", "backslashreplace") would return "a\\xffb" Servus, Walter

Guido van Rossum

24 Aug 24 Aug

5:55 p.m.

Yes on #1 -- making the low-level functions more usable for edge cases by supporting bytes seems fine (as long as the support for strings, where it exists, is not compromised). The status of pathlib is a little unclear to me -- is there a plan to eventually support bytes or not? For #2 I think you should probably just work with the others you have mentioned. On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan wrote:

...

At Guido's request, splitting out two specific questions from Serhiy's thread where I believe we could do with an explicit "yes or no" from him.

1. Should we accept patches adding support for the direct use of bytes paths in lower level filesystem manipulation APIs? (i.e. everything that isn't pathlib)

This was Serhiy's original question (due to some open issues [1,2]). I think the answer is yes, as we already do in some cases, and the "pathlib doesn't support binary paths" design decision is a high level platform independent API vs low level potentially platform dependent API one rather than being about disallowing the use of bytes paths in general.

[1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797

2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?

My proposal [3] is to add:

* string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)

"s != string.clean(s)" would then serve as a check for "does this string contain any surrogate escaped bytes?"

[3] http://bugs.python.org/issue18814#msg225791

Regards, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Nick Coghlan

11:19 p.m.

On 25 Aug 2014 03:55, "Guido van Rossum" wrote:

...

Yes on #1 -- making the low-level functions more usable for edge cases by

supporting bytes seems fine (as long as the support for strings, where it exists, is not compromised). Thanks!

...

The status of pathlib is a little unclear to me -- is there a plan to eventually support bytes or not?

It's text only and Antoine plans to keep it that - the concatenation operations, etc, are really only safe if you decode first.

...

For #2 I think you should probably just work with the others you have

mentioned. Yes, that sounds like a good idea. There's been some good progress on the issue tracker, so I think we can thrash out some workable (and comprehensible!) utilities that will be useful in their own right while also serving as aids to understanding for the underlying mechanisms. Cheers, Nick.

...

On Sat, Aug 23, 2014 at 9:44 PM, Nick Coghlan wrote:

...
At Guido's request, splitting out two specific questions from Serhiy's thread where I believe we could do with an explicit "yes or no" from him.

1. Should we accept patches adding support for the direct use of bytes paths in lower level filesystem manipulation APIs? (i.e. everything that isn't pathlib)

This was Serhiy's original question (due to some open issues [1,2]). I think the answer is yes, as we already do in some cases, and the "pathlib doesn't support binary paths" design decision is a high level platform independent API vs low level potentially platform dependent API one rather than being about disallowing the use of bytes paths in general.

[1] http://bugs.python.org/issue19997 [2] http://bugs.python.org/issue20797

2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?

My proposal [3] is to add:

* string.escaped_surrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)

"s != string.clean(s)" would then serve as a check for "does this string contain any surrogate escaped bytes?"

[3] http://bugs.python.org/issue18814#msg225791

Regards, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

https://mail.python.org/mailman/options/python-dev/guido%40python.org

...

-- --Guido van Rossum (python.org/~guido)

3527

Age (days ago)

3532

Last active (days ago)

List overview

Download

17 comments

8 participants

participants (8)

Antoine Pitrou
Glenn Linderman
Guido van Rossum
MRAB
Nick Coghlan
R. David Murray
Stephen J. Turnbull
Walter Dörwald

Bytes path related questions for Guido

tags

participants (8)