Hi, all. I found .pth file is decoded by the default (i.e. locale-specific) encoding. https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa... pth files contain: * import statements * paths For import statement, UTF-8 is the default Python code encoding. For paths, fsencoding is the right encoding. It is UTF-8 on Windows (excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific encoding in Linux. What encoding should we use? * UTF-8 * sys.getfilesystemencoding() * Keep status-quo. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Mon, Mar 15, 2021 at 7:53 PM Inada Naoki <songofacandy@gmail.com> wrote:
Hi, all.
I found .pth file is decoded by the default (i.e. locale-specific) encoding.
https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa...
pth files contain:
* import statements * paths
For import statement, UTF-8 is the default Python code encoding. For paths, fsencoding is the right encoding. It is UTF-8 on Windows (excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific encoding in Linux.
What encoding should we use?
* UTF-8 * sys.getfilesystemencoding() * Keep status-quo.
What are packaging tools like pip and setuptools writing .pth files out as?
OK. setuptools doesn't specify encoding at all. So locale-specific encoding is used. We can not fix it in short term. On Wed, Mar 17, 2021 at 4:56 AM Brett Cannon <brett@python.org> wrote:
On Mon, Mar 15, 2021 at 7:53 PM Inada Naoki <songofacandy@gmail.com> wrote:
Hi, all.
I found .pth file is decoded by the default (i.e. locale-specific) encoding. https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa...
pth files contain:
* import statements * paths
For import statement, UTF-8 is the default Python code encoding. For paths, fsencoding is the right encoding. It is UTF-8 on Windows (excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific encoding in Linux.
What encoding should we use?
* UTF-8 * sys.getfilesystemencoding() * Keep status-quo.
What are packaging tools like pip and setuptools writing .pth files out as?
-- Inada Naoki <songofacandy@gmail.com>
On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
OK. setuptools doesn't specify encoding at all. So locale-specific encoding is used. We can not fix it in short term.
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem. -- Best regards, Michał Górny
On Wed, 17 Mar 2021 at 08:13, Michał Górny <mgorny@gentoo.org> wrote:
On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
OK. setuptools doesn't specify encoding at all. So locale-specific encoding is used. We can not fix it in short term.
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
If I have a path in my Python program that is "a£b" (a unicode string) and I want to write it to a .pth file, what encoding should I use to "write it as a bytestring"? I don't understand what you;re trying to suggest here. Paul
On Wed, Mar 17, 2021 at 5:33 PM Paul Moore <p.f.moore@gmail.com> wrote:
On Wed, 17 Mar 2021 at 08:13, Michał Górny <mgorny@gentoo.org> wrote:
On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
OK. setuptools doesn't specify encoding at all. So locale-specific encoding is used. We can not fix it in short term.
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
If I have a path in my Python program that is "a£b" (a unicode string) and I want to write it to a .pth file, what encoding should I use to "write it as a bytestring"? I don't understand what you;re trying to suggest here. Paul
On Windows, it must be UTF-8. For example, we use `chcp 65001` in `activate.bat` to support unicode path. On Unix, raw path is bytestring. So paths can be written as-is. Python decode it with fsencoding. So I think this is the ideal solution. But this solution requires platform-specific code in the site.py. I don't think pth files are important enough for this complexity. Sub-optimal idea is using UTF-8. It is the best encoding for Windows. And most Unix systems use UTF-8 too. Regards, -- Inada Naoki <songofacandy@gmail.com>
On Wed, 17 Mar 2021 at 08:52, Inada Naoki <songofacandy@gmail.com> wrote:
On Windows, it must be UTF-8. For example, we use `chcp 65001` in `activate.bat` to support unicode path. On Unix, raw path is bytestring. So paths can be written as-is. Python decode it with fsencoding.
Remember that .pth files contain executable code as well as paths, so fsencoding is not correct for a .pth file as a whole.
So I think this is the ideal solution. But this solution requires platform-specific code in the site.py. I don't think pth files are important enough for this complexity.
.pth files are pretty important in the packaging community. I'd strongly support making their format and behaviour more precisely defined.
Sub-optimal idea is using UTF-8. It is the best encoding for Windows. And most Unix systems use UTF-8 too.
+1. IMO, UTF-8 is the only reasonable choice here. The problem is with the transition - we need to find a way to deal with existing `.pth` files, and with people using older version of tools (like setuptools and pipx) that write `.pth` files (so we can't assume, for example, that Python 3.12 will never see a .pth file using the old-style encoding). It's worth noting that using the default encoding is the *correct* way of writing .pth files at the moment (as that's how site.py reads them - see https://github.com/python/cpython/blob/master/Lib/site.py#L173) so this is technically a file format change - tools writing .pth files will *have* to include version-specific code if they want to support multiple versions of Python. We need to be very clear about this - it's not just a case of "tools need to specify the encoding". Paul
On Wed, 17 Mar 2021 at 09:26, Paul Moore <p.f.moore@gmail.com> wrote:
The problem is with the transition - we need to find a way to deal with existing `.pth` files, and with people using older version of tools (like setuptools and pipx) that write `.pth` files (so we can't assume, for example, that Python 3.12 will never see a .pth file using the old-style encoding).
Hmm, I just checked and pipx uses UTF-8 when writing .pth files. See https://github.com/pipxproject/pipx/blob/master/src/pipx/venv.py#L176 (and lol, it was my mistake, I wrote that code - https://github.com/pipxproject/pipx/pull/168). I'm inclined to report that as a bug, even though it appears no-one has complained about it. But that seems counter-productive given the context here. Paul
On 3/17/2021 8:00 AM, Michał Górny wrote:
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
That's what we're trying to do, the problem is that they start as strings, and so we need to convert them to a bytestring. That conversion is the encoding ;) And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we need the tools that create these files (including Notepad!) to use the encoding we want. Cheers, Steve
On Wed, Mar 17, 2021 at 6:37 PM Steve Dower <steve.dower@python.org> wrote:
On 3/17/2021 8:00 AM, Michał Górny wrote:
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
That's what we're trying to do, the problem is that they start as strings, and so we need to convert them to a bytestring.
That conversion is the encoding ;)
And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we need the tools that create these files (including Notepad!) to use the encoding we want.
A somewhat radical idea carrying this to the extreme would be to use UTF-16 (LE) on Windows. After all, this _is_ the native file system encoding, and Notepad will happily read and write it.
On 3/17/2021 6:08 PM, Stefan Ring wrote:
A somewhat radical idea carrying this to the extreme would be to use UTF-16 (LE) on Windows. After all, this _is_ the native file system encoding, and Notepad will happily read and write it.
I'm not opposed to detecting a BOM by default (when no other encoding is specified), but that won't help most UTF-8 files which these days come with no marker at all. I wouldn't change the default file encoding for writing though (except to unmarked UTF-8, and only with the compatibility approach Inada is working on). Everyone has basically come around to the idea that UTF-8 is the only needed encoding, and I'm sure if it had existed when Windows decided to support a universal character set, it would have been chosen. But with what we have now, UTF-16-LE is not a good choice for anything apart from compatibility with Windows. Cheers, Steve
On 17.03.2021 20:30, Steve Dower wrote:
On 3/17/2021 8:00 AM, Michał Górny wrote:
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
That's what we're trying to do, the problem is that they start as strings, and so we need to convert them to a bytestring.
That conversion is the encoding ;)
And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we need the tools that create these files (including Notepad!) to use the encoding we want.
I don't see a problem with using a file encoding specification like in Python source files. Since site.py is under our control, we can introduce it easily. We can opt to allow only UTF-8 here -- then we wait out a transitional period and disallow anything else than UTF-8 (then the specification can be removed, too).
Cheers, Steve
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MVD67FOA... Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Ivan
On 3/17/2021 7:34 PM, Ivan Pozdeev via Python-Dev wrote:
On 17.03.2021 20:30, Steve Dower wrote:
On 3/17/2021 8:00 AM, Michał Górny wrote:
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
That's what we're trying to do, the problem is that they start as strings, and so we need to convert them to a bytestring.
That conversion is the encoding ;)
And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we need the tools that create these files (including Notepad!) to use the encoding we want.
I don't see a problem with using a file encoding specification like in Python source files. Since site.py is under our control, we can introduce it easily.
We can opt to allow only UTF-8 here -- then we wait out a transitional period and disallow anything else than UTF-8 (then the specification can be removed, too).
The only thing we can introduce *easily* is an error when the (exclusively third-party) tools that create them aren't up to date. Getting everyone to specify the encoding we want is a much bigger problem with a much slower solution. This particular file is probably the worst case scenario, but preferring UTF-8 and handling existing files with a fallback is the best we can do (especially since an assumption of UTF-8 can be invalidated on a particular file, whereas most locale encodings cannot). Once we openly document that it should be UTF-8, tools will have a chance to catch up, and eventually the fallback will become harmless. Cheers, Steve
On 17.03.2021 23:04, Steve Dower wrote:
On 3/17/2021 7:34 PM, Ivan Pozdeev via Python-Dev wrote:
On 17.03.2021 20:30, Steve Dower wrote:
On 3/17/2021 8:00 AM, Michał Górny wrote:
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
That's what we're trying to do, the problem is that they start as strings, and so we need to convert them to a bytestring.
That conversion is the encoding ;)
And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we need the tools that create these files (including Notepad!) to use the encoding we want.
I don't see a problem with using a file encoding specification like in Python source files. Since site.py is under our control, we can introduce it easily.
We can opt to allow only UTF-8 here -- then we wait out a transitional period and disallow anything else than UTF-8 (then the specification can be removed, too).
The only thing we can introduce *easily* is an error when the (exclusively third-party) tools that create them aren't up to date. Getting everyone to specify the encoding we want is a much bigger problem with a much slower solution.
I don't see a problem with either. If we want to standardize something, we have to encourage, then ultimately enforce compliance, this way or another.
This particular file is probably the worst case scenario, but preferring UTF-8 and handling existing files with a fallback is the best we can do (especially since an assumption of UTF-8 can be invalidated on a particular file, whereas most locale encodings cannot). Once we openly document that it should be UTF-8, tools will have a chance to catch up, and eventually the fallback will become harmless.
Cheers, Steve
-- Regards, Ivan
On Wed, Mar 17, 2021 at 1:11 AM Michał Górny <mgorny@gentoo.org> wrote:
On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
OK. setuptools doesn't specify encoding at all. So locale-specific encoding is used. We can not fix it in short term.
How about writing paths as bytestrings in the long term? I think this should eliminate the necessity of knowing the correct encoding for the filesystem.
On Linux and many Unixes, there is no "correct" filesystem encoding. ASCII and UTF-8 are probably the most common encodings for individual files, maybe even large collections of files, but nevertheless, paths are bytestrings. Treating paths as UTF-8 works fine for most files, but once in a while there'll be a filename that fails to convert, and that's not the fault of the filename. For example, what happens if you need a file to be named touch "Ma$(echo | tr '\012' '\361')ana" ? For a presentation application (for EG), assuming UTF-8 is probably fine, maybe even a good thing. But for a filesystem backup tool, it's important to not assume an encoding so you can back up and restore all filenames irrespective of what the files' creators intended encodingwise.
On Tue, 16 Mar 2021 11:44:13 +0900 Inada Naoki <songofacandy@gmail.com> wrote:
Hi, all.
I found .pth file is decoded by the default (i.e. locale-specific) encoding. https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa...
pth files contain:
* import statements * paths
For import statement, UTF-8 is the default Python code encoding. For paths, fsencoding is the right encoding. It is UTF-8 on Windows (excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific encoding in Linux.
What encoding should we use?
* UTF-8 * sys.getfilesystemencoding() * Keep status-quo.
You could add special markup to specify utf8 encoding: # -*- encoding: utf8 -*- If no markup is present, use locale encoding. If markup is present, use utf8 encoding. Bail out if markup specifies something else than utf8. Then update all pth-producing tools to write utf8-encoded pth files (at least on the Python versions that support the encoding markup). In 15 years, you can switch to utf8 by default. Regards Antoine.
participants (9)
-
Antoine Pitrou
-
Brett Cannon
-
Dan Stromberg
-
Inada Naoki
-
Ivan Pozdeev
-
Michał Górny
-
Paul Moore
-
Stefan Ring
-
Steve Dower