string types for paths in PEP 517
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
Hi all, Quick question about an arcane topic: currently, PEP 517 says that paths are always represented as unicode strings. For example, when the frontend calls build_wheel, it has to create a temporary dir to hold the output wheel, and it passes this in as an absolute path represented as a unicode string. In Python 3 I think this is totally fine, because the surrogate-escape system means that all paths can be represented as unicode strings, even on systems like Linux where you can have paths that are invalid according to Python's idea of the filesystem encoding. In Python 2, if I understand correctly (and I'm not super confident that I do), then there is no surrogate-escape, and it's possible to have paths that can't be represented as a unicode object. For example, if someone's home directory is /home/stéfan in UTF-8 but Python thinks that the locale is C, and a frontend tries to make a tmpdir in $HOME/.local/tmp/ and pass it to a backend then... everything blows up, I guess? So I guess this is a question for those unfortunate souls who understand these details better than me (hi Nick!): is this actually a problem, and is there anything we can/should do differently? -n -- Nathaniel J. Smith -- https://vorpus.org
![](https://secure.gravatar.com/avatar/90a3b7816edd170b002641ade072b52a.jpg?s=120&d=mm&r=g)
I considered this. It's *potentially* a problem, but I think we should not try to deal with it for now: - Normally, temp files will go in /tmp - so it should be fine to construct paths of entirely ascii characters. - Frontends that want the wheel to end up elsewhere can ask for it in a tmp directory first and then move it, so there's a workaround if it becomes an issue. - We already have workarounds for the commonest case of UTF-8 paths + C locale: ignore the locale and treat paths as UTF-8. - The 'right' way to deal with it on Unix is to make all paths bytes, which would introduce a similar issue on Windows. If paths have to be bytes in some situations and unicode in others, both frontends and backends need extra complexity to handle that. - If your non-ascii username breaks stuff on Python 2... Python 3 is ready to make your life easier. Thomas On Tue, Sep 5, 2017, at 07:33 AM, Nathaniel Smith wrote:
Hi all,
Quick question about an arcane topic: currently, PEP 517 says that paths are always represented as unicode strings. For example, when the frontend calls build_wheel, it has to create a temporary dir to hold the output wheel, and it passes this in as an absolute path represented as a unicode string.
In Python 3 I think this is totally fine, because the surrogate-escape system means that all paths can be represented as unicode strings, even on systems like Linux where you can have paths that are invalid according to Python's idea of the filesystem encoding.
In Python 2, if I understand correctly (and I'm not super confident that I do), then there is no surrogate-escape, and it's possible to have paths that can't be represented as a unicode object. For example, if someone's home directory is /home/stéfan in UTF-8 but Python thinks that the locale is C, and a frontend tries to make a tmpdir in $HOME/.local/tmp/ and pass it to a backend then... everything blows up, I guess?
So I guess this is a question for those unfortunate souls who understand these details better than me (hi Nick!): is this actually a problem, and is there anything we can/should do differently?
-n
-- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
![](https://secure.gravatar.com/avatar/d995b462a98fea412efa79d17ba3787a.jpg?s=120&d=mm&r=g)
On 5 September 2017 at 09:00, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
I considered this. It's *potentially* a problem, but I think we should not try to deal with it for now:
- Normally, temp files will go in /tmp - so it should be fine to construct paths of entirely ascii characters. - Frontends that want the wheel to end up elsewhere can ask for it in a tmp directory first and then move it, so there's a workaround if it becomes an issue. - We already have workarounds for the commonest case of UTF-8 paths + C locale: ignore the locale and treat paths as UTF-8. - The 'right' way to deal with it on Unix is to make all paths bytes, which would introduce a similar issue on Windows. If paths have to be bytes in some situations and unicode in others, both frontends and backends need extra complexity to handle that. - If your non-ascii username breaks stuff on Python 2... Python 3 is ready to make your life easier.
+1 on this Paul
![](https://secure.gravatar.com/avatar/e1a2ac3e1eba0c0d26672b2a55948b77.jpg?s=120&d=mm&r=g)
+1 2017-09-05 3:21 GMT-05:00 Paul Moore <p.f.moore@gmail.com>:
On 5 September 2017 at 09:00, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
I considered this. It's *potentially* a problem, but I think we should not try to deal with it for now:
- Normally, temp files will go in /tmp - so it should be fine to construct paths of entirely ascii characters. - Frontends that want the wheel to end up elsewhere can ask for it in a tmp directory first and then move it, so there's a workaround if it becomes an issue. - We already have workarounds for the commonest case of UTF-8 paths + C locale: ignore the locale and treat paths as UTF-8. - The 'right' way to deal with it on Unix is to make all paths bytes, which would introduce a similar issue on Windows. If paths have to be bytes in some situations and unicode in others, both frontends and backends need extra complexity to handle that. - If your non-ascii username breaks stuff on Python 2... Python 3 is ready to make your life easier.
+1 on this Paul _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Tue, Sep 5, 2017 at 1:00 AM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
I considered this. It's *potentially* a problem, but I think we should not try to deal with it for now:
- Normally, temp files will go in /tmp - so it should be fine to construct paths of entirely ascii characters.
Does pip in fact use /tmp for temporary directories? (It's not always the right choice, because /tmp has limited space on some systems, e.g. b/c it's on a ramdisk. If we still had build_directory= then this could be an issue, since build directories can be arbitrarily large; maybe it's not a big deal now that we only need the tmpdir to handle a single sdist/wheel/dist-info.)
- Frontends that want the wheel to end up elsewhere can ask for it in a tmp directory first and then move it, so there's a workaround if it becomes an issue. - We already have workarounds for the commonest case of UTF-8 paths + C locale: ignore the locale and treat paths as UTF-8.
Only in 3.7, I think? Or do you mean, backends should be doing this manually on Python 2? (To be clear, I think the current text is potentially fine, I just want to make sure I/we understand the full consequences instead of discovering them a year from now when we're stuck with them :-).) -n -- Nathaniel J. Smith -- https://vorpus.org
![](https://secure.gravatar.com/avatar/ebf132362b622423ed5baca2988911b8.jpg?s=120&d=mm&r=g)
On Sep 5, 2017, at 4:36 AM, Nathaniel Smith <njs@pobox.com> wrote:
Does pip in fact use /tmp for temporary directories? (It's not always the right choice, because /tmp has limited space on some systems, e.g. b/c it's on a ramdisk. If we still had build_directory= then this could be an issue, since build directories can be arbitrarily large; maybe it's not a big deal now that we only need the tmpdir to handle a single sdist/wheel/dist-info.)
It does by default yes. It just uses the tempfile module so it respects $TMPDIR directory and everything if people want to point it to a different directory.
![](https://secure.gravatar.com/avatar/f3ba3ecffd20251d73749afbfa636786.jpg?s=120&d=mm&r=g)
On 5 September 2017 at 01:36, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Sep 5, 2017 at 1:00 AM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
- We already have workarounds for the commonest case of UTF-8 paths + C locale: ignore the locale and treat paths as UTF-8.
Only in 3.7, I think? Or do you mean, backends should be doing this manually on Python 2?
The frontend controls the locale that the backend runs in, so the frontend can set C.UTF-8 even if the frontend itself is launched in the C locale. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (6)
-
Donald Stufft
-
Nathaniel Smith
-
Nick Coghlan
-
Paul Moore
-
Thomas Kluyver
-
xoviat