Re: [Python-ideas] Fix default encodings on Windows

Steve Dower writes:
I plan to use only Unicode to interact with the OS and then utf8 within Python if the caller wants bytes.
This doesn't answer Victor's questions, or mine. This proposal requires identifying and transcoding bytes that represent text in encodings other than UTF-8. 1. How do you propose to identify "bytes that represent text (and might be filenames)" if they did *not* originate in a filesystem or console API? 2. How do you propose to identify the non-UTF-8 encoding, if you have forced all variables signifying bytes encodings to UTF-8? Additional considerations: As far as I can see, this is just a recipe for a different way to get mojibake. *The* way to avoid mojibake is to "let text be text" *internally*. Developers who insist on processing text as bytes are going to get what they deserve *in edge cases*. But mostly (ie, in the mono-encoding environments of most users) it just (barely ;-) works. And there are many use cases where you *can* process bytes that happen to encode text as "just bytes" (eg, low-level networking code). These cases have performance issues if the bytes-text-bytes-text-bytes double-round-trip implied for *stream content* (vs the OS APIs you're concerned with, which effectively round-trip text-bytes-text) is imposed on them.

I guess I'm not sure what your question is then. Using text internally is of course the best way to deal with it. But for those who insist on using bytes, this change at least makes Windows a feasible target without requiring manual encoding/decoding at every boundary. Top-posted from my Windows Phone -----Original Message----- From: "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> Sent: 8/14/2016 22:06 To: "Steve Dower" <steve.dower@python.org> Cc: "Victor Stinner" <victor.stinner@gmail.com>; "python-ideas" <python-ideas@python.org>; "Random832" <random832@fastmail.com> Subject: RE: [Python-ideas] Fix default encodings on Windows Steve Dower writes:
I plan to use only Unicode to interact with the OS and then utf8 within Python if the caller wants bytes.
This doesn't answer Victor's questions, or mine. This proposal requires identifying and transcoding bytes that represent text in encodings other than UTF-8. 1. How do you propose to identify "bytes that represent text (and might be filenames)" if they did *not* originate in a filesystem or console API? 2. How do you propose to identify the non-UTF-8 encoding, if you have forced all variables signifying bytes encodings to UTF-8? Additional considerations: As far as I can see, this is just a recipe for a different way to get mojibake. *The* way to avoid mojibake is to "let text be text" *internally*. Developers who insist on processing text as bytes are going to get what they deserve *in edge cases*. But mostly (ie, in the mono-encoding environments of most users) it just (barely ;-) works. And there are many use cases where you *can* process bytes that happen to encode text as "just bytes" (eg, low-level networking code). These cases have performance issues if the bytes-text-bytes-text-bytes double-round-trip implied for *stream content* (vs the OS APIs you're concerned with, which effectively round-trip text-bytes-text) is imposed on them.

I guess I'm not sure what your question is then. Using text internally is of course the best way to deal with it. But for those who insist on using bytes, this change at least makes Windows a feasible target without requiring manual encoding/decoding at every boundary. Top-posted from my Windows Phone -----Original Message----- From: "Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> Sent: 8/14/2016 22:06 To: "Steve Dower" <steve.dower@python.org> Cc: "Victor Stinner" <victor.stinner@gmail.com>; "python-ideas" <python-ideas@python.org>; "Random832" <random832@fastmail.com> Subject: RE: [Python-ideas] Fix default encodings on Windows Steve Dower writes:
I plan to use only Unicode to interact with the OS and then utf8 within Python if the caller wants bytes.
This doesn't answer Victor's questions, or mine. This proposal requires identifying and transcoding bytes that represent text in encodings other than UTF-8. 1. How do you propose to identify "bytes that represent text (and might be filenames)" if they did *not* originate in a filesystem or console API? 2. How do you propose to identify the non-UTF-8 encoding, if you have forced all variables signifying bytes encodings to UTF-8? Additional considerations: As far as I can see, this is just a recipe for a different way to get mojibake. *The* way to avoid mojibake is to "let text be text" *internally*. Developers who insist on processing text as bytes are going to get what they deserve *in edge cases*. But mostly (ie, in the mono-encoding environments of most users) it just (barely ;-) works. And there are many use cases where you *can* process bytes that happen to encode text as "just bytes" (eg, low-level networking code). These cases have performance issues if the bytes-text-bytes-text-bytes double-round-trip implied for *stream content* (vs the OS APIs you're concerned with, which effectively round-trip text-bytes-text) is imposed on them.
participants (2)
-
Stephen J. Turnbull
-
Steve Dower