Re: [Python-ideas] Fix default encodings on Windows

18 Aug 2016


      On Thu, Aug 18, 2016 at 4:07 PM, Steve Dower <steve.dower@python.org> wrote:
...
On 18Aug2016 0900, Chris Angelico wrote:
...
On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower <steve.dower@python.org>
wrote:
...
On 18Aug2016 0829, Chris Angelico wrote:
...
The second call to glob doesn't have any Unicode characters at all,
the way I see it - it's all bytes. Am I completely misunderstanding
this?
You're not the only one - I think this has been the most common
misunderstanding.
On Windows, the paths as stored in the filesystem are actually all text -
more precisely, utf-16-le encoded bytes, represented as 16-bit characters
strings.
Converting to an 8-bit character representation only exists for
compatibility with code written for other platforms (either Linux, or
much
older versions of Windows). The operating system has one way to do the
conversion to bytes, which Python currently uses, but since we control
that
transformation I'm proposing an alternative conversion that is more
reliable
than compatible (with Windows 3.1... shouldn't affect compatibility with
code that properly handles multibyte encodings, which should include
anything developed for Linux in the last decade or two).
Does that help? I tried to keep the explanation short and focused :)
Ah, I think I see what you mean. There's a slight ambiguity in the
word "missing" here.
1) The Unicode character in the result lacks some of the information
it should have
2) The Unicode character in the file name is information that has now been
lost.
My reading was the first, but AIUI you actually meant the second. If
so, I'd be inclined to reword it very slightly, eg:
"The Unicode character in the second call to glob is now lost
information."
Is that a correct interpretation?
I think so, though I find the wording a little awkward (and on rereading, my
original wording was pretty bad). How about:
"The second call to glob has replaced the Unicode character with '?', which
means the actual filename cannot be recovered and the path is no longer
valid."
They're all just characters in the context of Unicode, so I think it's
clearest to use the character code, e.g.:

    The second call to glob has replaced the U+AB00 character with '?',
    which means ...

Re: [Python-ideas] Fix default encodings on Windows

eryk sun