On Thu, Aug 18, 2016 at 4:07 PM, Steve Dower <steve.dower@python.org> wrote:
On 18Aug2016 0900, Chris Angelico wrote:
On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower <steve.dower@python.org> wrote:
On 18Aug2016 0829, Chris Angelico wrote:
The second call to glob doesn't have any Unicode characters at all, the way I see it - it's all bytes. Am I completely misunderstanding this?
You're not the only one - I think this has been the most common misunderstanding.
On Windows, the paths as stored in the filesystem are actually all text - more precisely, utf-16-le encoded bytes, represented as 16-bit characters strings.
Converting to an 8-bit character representation only exists for compatibility with code written for other platforms (either Linux, or much older versions of Windows). The operating system has one way to do the conversion to bytes, which Python currently uses, but since we control that transformation I'm proposing an alternative conversion that is more reliable than compatible (with Windows 3.1... shouldn't affect compatibility with code that properly handles multibyte encodings, which should include anything developed for Linux in the last decade or two).
Does that help? I tried to keep the explanation short and focused :)
Ah, I think I see what you mean. There's a slight ambiguity in the word "missing" here.
1) The Unicode character in the result lacks some of the information it should have
2) The Unicode character in the file name is information that has now been lost.
My reading was the first, but AIUI you actually meant the second. If so, I'd be inclined to reword it very slightly, eg:
"The Unicode character in the second call to glob is now lost information."
Is that a correct interpretation?
I think so, though I find the wording a little awkward (and on rereading, my original wording was pretty bad). How about:
"The second call to glob has replaced the Unicode character with '?', which means the actual filename cannot be recovered and the path is no longer valid."
They're all just characters in the context of Unicode, so I think it's clearest to use the character code, e.g.: The second call to glob has replaced the U+AB00 character with '?', which means ...