Inconsistencies if locale and filesystem encodings are different
Hi, A PYTHONFSENCODING environment variable was added to Python 3.2: issue #8622. This variable introduces an inconstency because the filesystem and the locale encodings can now be different. There are (at least) four issues related to this problem. We have 2 choices to fix these issues: (a) use the same encoding to encode and decode values (it can be different for each issue) (b) remove PYTHONFSENCODING variable and raise an error if locale and filesystem encodings are different (ensure that both encodings are the same) Even if choice (a) is not easy to implement, it is feasible and I already wrote some patches. I don't understand how Python interact with other programs who ignore the PYTHONFSENCODING environment variable. It's like Python uses its own "locale". Choice (b) looks easy to implement, but... there is the problem of Mac OS X. Mac OS X uses utf-8 encoding for the filesystem (and not the locale encoding), whereas it looks like the locale encoding is used for the command line arguments. See issue #4388 for more information. There is also maybe an useful usecase of the PYTHONFSENCODING, but I don't remember which one :-) Issues ------ sys.argv: - decoded from the locale encoding - subprocess encodes process arguments to the filesystem encoding => issue #9992 sys.path: - decoded from the locale encoding - import encodes paths to the filesystem encoding => issue #10014 The script name, read on the command line (eg. python script.py), is decoded using the locale encoding, whereas it is used to fill sys.path[0] (without any encoding conversion) and import encodes paths to the filesystem encoding. => issue #10039 PYTHONWARNINGS environment variable: - decoded from the locale encoding - subprocess encodes environment variables to the filesystem encoding => issue #9988 -- Victor Stinner http://www.haypocalc.com/
Victor Stinner wrote:
Hi,
A PYTHONFSENCODING environment variable was added to Python 3.2: issue #8622. This variable introduces an inconstency because the filesystem and the locale encodings can now be different.
There are (at least) four issues related to this problem. We have 2 choices to fix these issues:
(a) use the same encoding to encode and decode values (it can be different for each issue)
(b) remove PYTHONFSENCODING variable and raise an error if locale and filesystem encodings are different (ensure that both encodings are the same)
Even if choice (a) is not easy to implement, it is feasible and I already wrote some patches.
I don't understand how Python interact with other programs who ignore the PYTHONFSENCODING environment variable. It's like Python uses its own "locale".
Choice (b) looks easy to implement, but... there is the problem of Mac OS X. Mac OS X uses utf-8 encoding for the filesystem (and not the locale encoding), whereas it looks like the locale encoding is used for the command line arguments. See issue #4388 for more information.
There is also maybe an useful usecase of the PYTHONFSENCODING, but I don't remember which one :-)
You have to differentiate between the meaning of a file system encoding and the locale: A file system encoding defines how the applications interact with the file system. A locale defines how the user expects to interact with the application. It is well possible that the two are different. Mac OS X is just one example. Another common example is having a Unix account using the C locale (=ASCII) while working on a UTF-8 file system. BTW: We added that because just like I/O encoding, you need to be able to override the setting determined by Python via locale introspection, which may be wrong. The env var is only meant as a way to solve encoding problems in special situations where the local cannot be used to determine the file system or input/output encoding. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 07 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Thu, Oct 07, 2010 at 06:35:09PM +0200, M.-A. Lemburg wrote:
It is well possible that the two are different. Mac OS X is just one example. Another common example is having a Unix account using the C locale (=ASCII) while working on a UTF-8 file system.
My filesystems are always koi8-r, but sometimes I work with programs in utf-8 locale. Just an example... Oleg. -- Oleg Broytman http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.
Le jeudi 07 octobre 2010 18:44:19, Oleg Broytman a écrit :
On Thu, Oct 07, 2010 at 06:35:09PM +0200, M.-A. Lemburg wrote:
It is well possible that the two are different. Mac OS X is just one example. Another common example is having a Unix account using the C locale (=ASCII) while working on a UTF-8 file system.
My filesystems are always koi8-r, but sometimes I work with programs in utf-8 locale. Just an example...
Are programs able to display correctly non-ascii filenames if your locale encoding is different than your filesystem encoding? -- Victor Stinner http://www.haypocalc.com/
On Thu, Oct 07, 2010 at 09:12:13PM +0200, Victor Stinner wrote:
Le jeudi 07 octobre 2010 18:44:19, Oleg Broytman a ?crit :
My filesystems are always koi8-r, but sometimes I work with programs in utf-8 locale. Just an example...
Are programs able to display correctly non-ascii filenames if your locale encoding is different than your filesystem encoding?
Most of them don't because - you are right - most programs assume fs encoding to be the same as stdio locale. But some programs are more clever; for example, one can define G_FILENAME_ENCODING env var to guide GTK2/GLib programs; it can be a fixed encoding or a special value "@locale". On the other side there are programs that ignore locale completely and read/write filenames using their own fixed encoding; for example, Transmission bittorrent client read/write files in the encoding defined in the .torrent metafile. Oleg. -- Oleg Broytman http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.
Le jeudi 07 octobre 2010 18:35:09, M.-A. Lemburg a écrit :
Victor Stinner wrote:
Hi,
A PYTHONFSENCODING environment variable was added to Python 3.2: issue #8622. This variable introduces an inconstency because the filesystem and the locale encodings can now be different.
There are (at least) four issues related to this problem. We have 2 choices to
fix these issues: (a) use the same encoding to encode and decode values (it can be different
for each issue)
(b) remove PYTHONFSENCODING variable and raise an error if locale and
filesystem encodings are different (ensure that both encodings are the same)
Even if choice (a) is not easy to implement, it is feasible and I already wrote some patches.
I don't understand how Python interact with other programs who ignore the PYTHONFSENCODING environment variable. It's like Python uses its own "locale".
Choice (b) looks easy to implement, but... there is the problem of Mac OS X. Mac OS X uses utf-8 encoding for the filesystem (and not the locale encoding), whereas it looks like the locale encoding is used for the command line arguments. See issue #4388 for more information.
There is also maybe an useful usecase of the PYTHONFSENCODING, but I don't remember which one :-)
You have to differentiate between the meaning of a file system encoding and the locale:
A file system encoding defines how the applications interact with the file system.
A locale defines how the user expects to interact with the application.
What is the encoding of the command line arguments? Locale or filesystem encoding? Is it different if an argument is a filename or a path? -- Victor Stinner http://www.haypocalc.com/
participants (3)
-
M.-A. Lemburg
-
Oleg Broytman
-
Victor Stinner