[issue11230] "Full unicode import system" not in 3.2

New submission from John <jh45dev@gmail.com>: A few months ago I read that in 3.2 it will be possible to import modules that are located on paths containing any unicode character. (more precisely, with chars not in the local code page) After an hour or two trying to get this to work in 3.2rc3, I went looking for clues, and found these 2 messages in which Victor Stinner says this feature is delayed until Python 3.3: http://bugs.python.org/issue3080#msg126514 http://bugs.python.org/issue10828#msg125787 Could you please make it clear in documentation and web pages, that this feature is not working yet. The Python 3.2 download page includes this: "countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)" and I guessed this must cover importing from any unicode path, as there was no mention that such importing had been abandoned for this version. -- jh ---------- assignee: docs@python components: Documentation messages: 128711 nosy: docs@python, jh45 priority: normal severity: normal status: open title: "Full unicode import system" not in 3.2 versions: Python 3.2 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Changes by Antoine Pitrou <pitrou@free.fr>: ---------- nosy: +haypo _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

STINNER Victor <victor.stinner@haypocalc.com> added the comment: Short answer: In Python 3.2, « import héhé » doesn't work on Windows, but you can have non-ASCII paths in sys.path. Longer answer: I fixed the import machinery to handle correctly non-ASCII characters in module *paths*. But the import machinery is unable to handle non-ASCII characters in module *names*: it fails if the filesystem encoding is not UTF-8 (eg. it fails on Windows). There is another exception: Python doesn't support (yet) non encodable module paths on Windows. On Windows, you can use any character in directory names, but Python 3.2 encodes paths to the filesystem encoding (ANSI code page) which is a smaller charset. In practical, this Windows specific limitation on module paths doesn't really matter. I plan to fix all these issues in Python 3.3: see #3080. --
Could you please make it clear in documentation and web pages, that this feature is not working yet.
What's New in Python 3.2 documentation has this sentence: "Python’s import mechanism can now load modules installed in directories with non-ASCII characters in the path name. This solved an aggravating problem with home directories for users with non-ASCII characters in their usernames." which is correct. Which web page should updated/fixed? ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

John <jh45dev@gmail.com> added the comment: Victor asked "Which web page should updated/fixed?" Answer: The Python 3.2 download page. But what should it say? The main point is that people like me, who remember seeing a statement about this a few months ago, may expect unicode to work in every conceivable situation, and a prominent warning that it's not *all* fixed yet, with a link to details in the documentation, would save them from trying things that don't work. By the way, I hadn't grasped a simple point from issue 3080: I tested on *English* Windows by putting a Greek character in the path to some python modules. But the usual situation is where a *Greek* version of Windows has some Greek characters in the path, and from what you just wrote, that's OK now. -- jh ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

STINNER Victor <victor.stinner@haypocalc.com> added the comment:
Victor asked "Which web page should updated/fixed?" Answer: The Python 3.2 download page.
Sorry, but I don't see which page tells that Python 3.2 has a full Unicode support for import. In http://www.python.org/download/releases/3.2/, I can read "countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)". "full support for a bytes environment" means that Python 3.2 has been fixed on UNIX to support undecodable filenames, but not that Python 3.2 supports unencodable filenames on Windows. Can you propose a sentence which is more clear about bytes/Unicode? Python 3.3 will have a full Unicode support for modules: issue #3080 is already fixed, and I think that #11619 can be fixed (maybe not easily). ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Changes by Éric Araujo <merwok@netwok.org>: ---------- nosy: +eric.araujo _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

John <jh45dev@gmail.com> added the comment: Sorry for the long delay. haypo wrote: Can you propose a sentence which is more clear about bytes/Unicode? On this page: http://www.python.org/download/releases/3.2/ is this line: "- countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)" How about adding to that line something like: " on UNIX; but on Windows the path to and name of each module you import can contain only characters that are in the ANSI codepage that your Windows is using" and maybe " (will be fixed in Python 3.3)" and maybe (or not) also something like: " (ANSI codepage = basic latin + other characters of only your own language group)" -- jh ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Changes by Ned Deily <nad@acm.org>: ---------- nosy: +georg.brandl _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Tom Christiansen <tchrist@perl.com> added the comment: How does this work for modules that have filesystem names different from the one used for import? The issue I'm thinking about is that the Mac HSF+ filesystem keeps its Unicode filenames in (close to) NFD form. That means that a module named "caf\N{LATIN SMALL LETTER E WITH ACUTE}" with 4 graphemes and 4 code points in its name winds up in the filesystem as "cafe\N{COMBINING ACUTE ACCENT}" still with 4 graphemes but now with 5 code points. I believe (well, suspect; I have empirical evidence not proof) Python stores its own identifiers in NFD, so this may not be quite as much of a problem as it might otherwise be. Nonetheless, I have had users complain about what HFS+ does with such filenames, although I am not quite sure why. I think it’s because they access a file with 4 chars but they need a 5-char fileglob to wildcard it, so touch "caf\N{LATIN SMALL LETTER E WITH ACUTE}" and then you need a wildcard of "?????" with an extra ? to find it. Kinda weird. ---------- nosy: +tchrist _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Tom Christiansen <tchrist@perl.com> added the comment: Whoops, I meant that it appears that Python runs its identifiers through NFC. How that gets along with a filesystem that has quasi-NFD filenames I'm not sure, but it seems like it might be a variant of the case-insensitivity issue in filenames. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

STINNER Victor <victor.stinner@haypocalc.com> added the comment:
The issue I'm thinking about is that the Mac HSF+ filesystem
There is no issue with HFS+ normalization. The kernel "normalizes" filenames to its own variant, Python doesn't have to care about this. When you write "import h<é normalized to NFC>" or "import h<é normalized to NFD>", Python tries to open "h<é normalized to NFC>.py": then the HFS+ filename does its own normalization (=> "h<é normalized to its variant of NFD>.py"). ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Serhiy Storchaka added the comment: Python 3.2 is out of maintenance. Full Unicode path support was added in Python 3.3 by issue3080. ---------- nosy: +serhiy.storchaka resolution: -> out of date stage: -> resolved status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Changes by Antoine Pitrou <pitrou@free.fr>: ---------- nosy: +haypo _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

STINNER Victor <victor.stinner@haypocalc.com> added the comment: Short answer: In Python 3.2, « import héhé » doesn't work on Windows, but you can have non-ASCII paths in sys.path. Longer answer: I fixed the import machinery to handle correctly non-ASCII characters in module *paths*. But the import machinery is unable to handle non-ASCII characters in module *names*: it fails if the filesystem encoding is not UTF-8 (eg. it fails on Windows). There is another exception: Python doesn't support (yet) non encodable module paths on Windows. On Windows, you can use any character in directory names, but Python 3.2 encodes paths to the filesystem encoding (ANSI code page) which is a smaller charset. In practical, this Windows specific limitation on module paths doesn't really matter. I plan to fix all these issues in Python 3.3: see #3080. --
Could you please make it clear in documentation and web pages, that this feature is not working yet.
What's New in Python 3.2 documentation has this sentence: "Python’s import mechanism can now load modules installed in directories with non-ASCII characters in the path name. This solved an aggravating problem with home directories for users with non-ASCII characters in their usernames." which is correct. Which web page should updated/fixed? ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

John <jh45dev@gmail.com> added the comment: Victor asked "Which web page should updated/fixed?" Answer: The Python 3.2 download page. But what should it say? The main point is that people like me, who remember seeing a statement about this a few months ago, may expect unicode to work in every conceivable situation, and a prominent warning that it's not *all* fixed yet, with a link to details in the documentation, would save them from trying things that don't work. By the way, I hadn't grasped a simple point from issue 3080: I tested on *English* Windows by putting a Greek character in the path to some python modules. But the usual situation is where a *Greek* version of Windows has some Greek characters in the path, and from what you just wrote, that's OK now. -- jh ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

STINNER Victor <victor.stinner@haypocalc.com> added the comment:
Victor asked "Which web page should updated/fixed?" Answer: The Python 3.2 download page.
Sorry, but I don't see which page tells that Python 3.2 has a full Unicode support for import. In http://www.python.org/download/releases/3.2/, I can read "countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)". "full support for a bytes environment" means that Python 3.2 has been fixed on UNIX to support undecodable filenames, but not that Python 3.2 supports unencodable filenames on Windows. Can you propose a sentence which is more clear about bytes/Unicode? Python 3.3 will have a full Unicode support for modules: issue #3080 is already fixed, and I think that #11619 can be fixed (maybe not easily). ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Changes by Éric Araujo <merwok@netwok.org>: ---------- nosy: +eric.araujo _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

John <jh45dev@gmail.com> added the comment: Sorry for the long delay. haypo wrote: Can you propose a sentence which is more clear about bytes/Unicode? On this page: http://www.python.org/download/releases/3.2/ is this line: "- countless fixes regarding bytes/string issues; among them full support for a bytes environment (filenames, environment variables)" How about adding to that line something like: " on UNIX; but on Windows the path to and name of each module you import can contain only characters that are in the ANSI codepage that your Windows is using" and maybe " (will be fixed in Python 3.3)" and maybe (or not) also something like: " (ANSI codepage = basic latin + other characters of only your own language group)" -- jh ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Changes by Ned Deily <nad@acm.org>: ---------- nosy: +georg.brandl _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Tom Christiansen <tchrist@perl.com> added the comment: How does this work for modules that have filesystem names different from the one used for import? The issue I'm thinking about is that the Mac HSF+ filesystem keeps its Unicode filenames in (close to) NFD form. That means that a module named "caf\N{LATIN SMALL LETTER E WITH ACUTE}" with 4 graphemes and 4 code points in its name winds up in the filesystem as "cafe\N{COMBINING ACUTE ACCENT}" still with 4 graphemes but now with 5 code points. I believe (well, suspect; I have empirical evidence not proof) Python stores its own identifiers in NFD, so this may not be quite as much of a problem as it might otherwise be. Nonetheless, I have had users complain about what HFS+ does with such filenames, although I am not quite sure why. I think it’s because they access a file with 4 chars but they need a 5-char fileglob to wildcard it, so touch "caf\N{LATIN SMALL LETTER E WITH ACUTE}" and then you need a wildcard of "?????" with an extra ? to find it. Kinda weird. ---------- nosy: +tchrist _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Tom Christiansen <tchrist@perl.com> added the comment: Whoops, I meant that it appears that Python runs its identifiers through NFC. How that gets along with a filesystem that has quasi-NFD filenames I'm not sure, but it seems like it might be a variant of the case-insensitivity issue in filenames. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

STINNER Victor <victor.stinner@haypocalc.com> added the comment:
The issue I'm thinking about is that the Mac HSF+ filesystem
There is no issue with HFS+ normalization. The kernel "normalizes" filenames to its own variant, Python doesn't have to care about this. When you write "import h<é normalized to NFC>" or "import h<é normalized to NFD>", Python tries to open "h<é normalized to NFC>.py": then the HFS+ filename does its own normalization (=> "h<é normalized to its variant of NFD>.py"). ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________

Serhiy Storchaka added the comment: Python 3.2 is out of maintenance. Full Unicode path support was added in Python 3.3 by issue3080. ---------- nosy: +serhiy.storchaka resolution: -> out of date stage: -> resolved status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue11230> _______________________________________
participants (7)
-
Antoine Pitrou
-
John
-
Ned Deily
-
Serhiy Storchaka
-
STINNER Victor
-
Tom Christiansen
-
Éric Araujo