On 3/17/2021 7:34 PM, Ivan Pozdeev via Python-Dev wrote:
On 17.03.2021 20:30, Steve Dower wrote:
On 3/17/2021 8:00 AM, Michał Górny wrote:
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
That's what we're trying to do, the problem is that they start as strings, and so we need to convert them to a bytestring.
That conversion is the encoding ;)
And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we need the tools that create these files (including Notepad!) to use the encoding we want.
I don't see a problem with using a file encoding specification like in Python source files. Since site.py is under our control, we can introduce it easily.
We can opt to allow only UTF-8 here -- then we wait out a transitional period and disallow anything else than UTF-8 (then the specification can be removed, too).
The only thing we can introduce *easily* is an error when the (exclusively third-party) tools that create them aren't up to date. Getting everyone to specify the encoding we want is a much bigger problem with a much slower solution.
This particular file is probably the worst case scenario, but preferring UTF-8 and handling existing files with a fallback is the best we can do (especially since an assumption of UTF-8 can be invalidated on a particular file, whereas most locale encodings cannot). Once we openly document that it should be UTF-8, tools will have a chance to catch up, and eventually the fallback will become harmless.