Re: [Python-Dev] unicode imports
I don't have specific information on the machines. We didn´t try very hard to get things to work with 2.3 since we simply assumed it would work automatically when we upgraded to a more mature 2.4. I could try to get more info, but it would be 2.3 specific. Have there been any changes since then? Note that it may not go into program files at all. Someone may want to install his modules in a folder named in the honour of his mother. Also, I really would like to see a general solution that doesn´t assume that the path name can somhow be transmuted to an ascii name. Users are unpredictable. When you have a wide distribution , you come up against all kinds of problems (Currently we have around 500.000 users in china.) Also, relying on some locale settings is not acceptable. My machine here has the icelandic locale. Yet, I need to be able to set up and use a chinese install. Likewise, many machines in china will have an english locale. A default encoding and locale is essentially an evil hack in our increasingly global environment. We have converted more or less our entire code base to unicode because keeping track of encoded strings is simply unworkable in a large project. Funny that no other platforms could benefit from a unicode import path. Does that mean that windows will reign supreme? Please explain. Cheers, Kristján -----Original Message----- From: "Martin v. Löwis" [mailto:martin@v.loewis.de] Sent: 17. júní 2006 08:42 To: Kristján V. Jónsson Cc: Python Dev Subject: Re: [Python-Dev] unicode imports Kristján V. Jónsson wrote:
The standard install path in chinese distributions can be with a non-ANSI path, and installing an embedded python application there will break it.
I very much doubt this. On a Chinese system, the Program Files folder likely has a non-*ASCII* name, but it will have a fine *ANSI* name, as the ANSI code page on that system should be either 936 (simplified chinese) or 950 (traditional chinese) - unless the system is misconfigured. Can you please report what the path is, what the precise name of the operating system is, and what the system locale and the system code page are?
A completely parallel implementation on the sys.path[i] level?
You should also take a look at what the 8.3 name of the path is. I really cannot believe that the path is unaccessible to DOS programs.
Are there other platforms beside Windows that would profit from this?
No. Regards, Martin
It should be noted that I once started to convert the import machinery to be fully unicode aware. As far as I can tell, a *lot* has to be changed to make this work. I started with refactoring Python/import.c, but nobody responded to the question whether such a refactoring patch would be accepted or not. Thomas
Thomas Heller wrote:
It should be noted that I once started to convert the import machinery to be fully unicode aware. As far as I can tell, a *lot* has to be changed to make this work.
I started with refactoring Python/import.c, but nobody responded to the question whether such a refactoring patch would be accepted or not.
Perhaps someone should start a PEP on this subject ?! (not me, though :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 19 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Thomas Heller wrote:
It should be noted that I once started to convert the import machinery to be fully unicode aware. As far as I can tell, a *lot* has to be changed to make this work.
Is that code available somewhere still? Does it still work?
I started with refactoring Python/import.c, but nobody responded to the question whether such a refactoring patch would be accepted or not.
I would like to see minimal changes only. I don't see why massive refactoring would be necessary: the structure of the code should persist - only the data types should change from char* to PyObject*. Calls like stat() and open() should be generalized to accept PyObject*, and otherwise keep their interface. Regards, Martin
Martin v. Löwis schrieb:
Thomas Heller wrote:
It should be noted that I once started to convert the import machinery to be fully unicode aware. As far as I can tell, a *lot* has to be changed to make this work.
Is that code available somewhere still? Does it still work?
Available as patch 1093253, I have not tried if it stil works
I started with refactoring Python/import.c, but nobody responded to the question whether such a refactoring patch would be accepted or not.
I would like to see minimal changes only. I don't see why massive refactoring would be necessary: the structure of the code should persist - only the data types should change from char* to PyObject*. Calls like stat() and open() should be generalized to accept PyObject*, and otherwise keep their interface.
To be really useful, wide char versions of other things must also be made available: command line arguments, environment variables (PYTHONPATH), and maybe other stuff. Thomas
Thomas Heller wrote:
Is that code available somewhere still? Does it still work?
Available as patch 1093253, I have not tried if it stil works
I see. It's quite a huge change, that's probably why nobody found the time to review it, yet.
To be really useful, wide char versions of other things must also be made available: command line arguments, environment variables (PYTHONPATH), and maybe other stuff.
While I think these things should eventually be done, I don't think they are that related to import.c. If W9x support gets dropped, we can rewrite PC/getpathp.c to use the Unicode API throughout; that would allow to put non-ANSI path names onto PYTHONPATH. Making os.environ support Unicode is entirely different isusue. I would like to see os.environ return Unicode if the key is Unicode; another option would be to introduce os.uenviron. Regards, Martin
Kristján V. Jónsson wrote:
Funny that no other platforms could benefit from a unicode import path. Does that mean that windows will reign supreme? Please explain.
As near as I can tell, other platforms use encoded strings with the normal (byte-based) posix file API, so the Python interpreter and the file system simply need to agree on the encoding (typically utf-8) in order for both filesystem access and importing from non-ASCII paths to work. On Windows, though, most of the file system interaction code has had to be updated to use the wide-character API where possible. import.c is one of the few holdouts that relies entirely on the byte-based posix API. If I had to put money on what's currently happening on your test machine, it's that import.c is trying to do u'c:/tmp/\u814c'.encode('mbcs'), getting 'c:/tmp/?' and proceeding to do nothing useful with that path entry. Checking the result of sys.getfilesystemencoding() should be able to confirm that. So it looks like it ain't really gonna work properly on Windows unless import.c is rewritten to use the Unicode-aware platform independent IO implementation in posixmodule.c. Until that happens (hopefully by Python 2.6), I like MvL's suggestion - look at the 8.3 DOS name on the command prompt and put that into sys.path. ctypes and/or pywin32 should let you get at that information programmatically. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Kristján V. Jónsson wrote:
I don't have specific information on the machines. We didn´t try very hard to get things to work with 2.3 since we simply assumed it would work automatically when we upgraded to a more mature 2.4. I could try to get more info, but it would be 2.3 specific. Have there been any changes since then?
Not in that respect, no.
Note that it may not go into program files at all. Someone may want to install his modules in a folder named in the honour of his mother.
It's certainly possible to set this up in a way that it won't work, on any localized version: just use a path name that isn't supported in the ANSI code page. However, that should rarely happen: the name of his mother should still be expressable in the ANSI code page, if the system is setup correctly.
Also, I really would like to see a general solution that doesn´t assume that the path name can somhow be transmuted to an ascii name.
(Please don't say ASCII here. Windows *A APIs are named that way because Microsoft Windows has the notion of an "ANSI code page", which, in turn, is just a code page indirection so some selected code page meant to support the characters of the user's locale)
Users are unpredictable. When you have a wide distribution , you come up against all kinds of problems (Currently we have around 500.000 users in china.) Also, relying on some locale settings is not acceptable.
Sure, but stating that doesn't really help. Code contributions would help, but that part of Python has been left out of using the *W API, because it is particularly messy to fix.
Funny that no other platforms could benefit from a unicode import path. Does that mean that windows will reign supreme?
That is the case, more or less. Or, more precisely: - On Linux, Solaris, and most other Unices, file names are bytes on the system API, and are expected to be encoded in the user's locale. So if your locale does not support a character, you can't name a file that way, on Unix. There is a trend towards using UTF-8 locales, so that the locale contains all Unicode characters. - On Mac OS X, all file names are UTF-8, always (unless the user managed to mess it up), so you can have arbitrary Unicode file names That means that the approach of converting a Unicode sys.path element to the file system encoding will always do the right thing on Linux and OS X: the file system encoding will be the locale's encoding on Linux, and will be UTF-8 on OS X. It's only Windows which has valid file names that cannot be represented in the current locale. Regards, Martin
participants (5)
-
"Martin v. Löwis"
-
Kristján V. Jónsson
-
M.-A. Lemburg
-
Nick Coghlan
-
Thomas Heller