Mailman 3 Filename as byte string in python 2.6 or 3.0? - Python-Dev

newer
Re: [Python-Dev] [Python-3000] New...

Filename as byte string in python 2.6 or 3.0?

older
Re: [Python-Dev] [Python-3000] New...

Victor Stinner

27 Sep 2008 27 Sep '08

12:04 p.m.

Hi, I read that Python 2.6 is planned to Wednesday. One bug is is still open and important for me: Python 2.6/3.0 are unable to use filename as byte strings. http://bugs.python.org/issue3187 The problem =========== On Windows, all filenames are unicode strings (I guess UTF-16-LE), but on UNIX for historical reasons, filenames are byte strings. On Linux, you can expect UTF-8 valid filenames but sometimes (eg. copy from a FAT32 USB key to an ext3 filesystem) you get invalid filename (byte string in a different charset than your default filesystem encoding (utf8)). Python functions using filenames ================================ In Python, you have (incomplete list): - filename producer: os.listdir(), os.walk(), glob.glob() - filename manipulation: os.path.*() - access file: open(), os.unlink(), shutil.rmtree() If you give unicode to producer, they return unicode _or_ byte strings (type may change for each filename :-/). Guido proposed to break this behaviour: raise an exception if unicode conversion fails. We may consider an option like "skip invalid". If you give bytes to producer, they only return byte strings. Great. Filename manipulation: in python 2.6/3.0, os.path.*() is not compatible with the type "bytes". So you can not use os.path.join(<your unicode path>, <bytes filename>) *nor* os.path.join(<your bytes path>, <bytes filename>) because os.path.join() (eg. with the posix version) uses path.endswith('/'). Access file: open() rejects the type bytes (it's just a test, open() supports bytes if you remove the test). As I remember, unlink() is compatible with bytes. But rmtree() fails because it uses os.path.join() (even if you give bytes directory, join() fails). Solutions ========= - producer: unicode => *only* unicode // bytes => bytes - manipulation: support both unicode and bytes but avoid (when it's possible) to mix bytes and characters - open(): allow bytes I implemented these solutions as a patch set attached to the issue #3187: * posix_path_bytes.patch: fix posixpath.join() to support bytes * io_byte_filename.patch: open() allows bytes filename * fnmatch_bytes.patch: patch fnmatch.filter() to accept bytes filenames * glob1_bytes.patch: fix glob.glob() to accept invalid directory name Mmmh, there is no patch for stop os.listdir() on invalid filename. Priority ======== I think that the problem is important because it's a regression from 2.5 to 2.6/3.0. Python 2.5 uses bytes filename, so it was possible to open/unlink "invalid" unicode strings (since it's not unicode but bytes). Well, if it's too late for the final versions, this problem should be at least fixed quickly. Test the problem ================ Example to create invalid filenames on Linux: $ mkdir /tmp/test $ cd /tmp/test $ touch $(echo -e "a\xffb") $ mkdir $(echo -e "dir\xffname") $ touch $(echo -e "dir\xffname/file") $ find . ./a?b ./dir?name ./dir?name/file Python 2.5:

...

...
...
import os os.listdir('.') ['a\xffb', 'dir\xffname'] open(os.listdir('.')[0]).close() # open file: ok os.unlink(os.listdir('.')[0]) # remove file: ok os.listdir('.') ['dir\xffname'] shutil.rmtree(os.listdir('.')[0]) # remove dir: ok

Wrong solutions =============== New type -------- I proposed an ugly type "InvalidFilename" mixing bytes and characters. As everybody using unicode knows, it's a bad idea :-) (and it introduces a new type). Convert bytes to unicode (replace) ---------------------------------- unicode_filename = unicode(bytes_filename, charset, "replace") Ok, you will get valid unicode strings which can be used in os.path.join() & friends, but open() or unlink() will fails because this filename doesn't exist! -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

Show replies by date

Amaury Forgeot d'Arc

27 Sep 27 Sep

12:41 p.m.

Hello Victor, 2008/9/27 Victor Stinner :

...

Hi,

I read that Python 2.6 is planned to Wednesday. One bug is is still open and important for me: Python 2.6/3.0 are unable to use filename as byte strings. http://bugs.python.org/issue3187 [...]

Is it really a 2.6 problem? I could not find any difference with 2.5 in this regard, the tests you propose still pass.

...

Filename manipulation: in python 2.6/3.0, os.path.*() is not compatible with the type "bytes"

With python 2.6,

...

...
...
bytes is str True

But I agree that this is THE unresolved issue of python 3.0. -- Amaury Forgeot d'Arc

Victor Stinner

2:44 p.m.

Le Saturday 27 September 2008 14:04:25 Victor Stinner, vous avez écrit :

...

I read that Python 2.6 is planned to Wednesday. One bug is is still open and important for me: Python 2.6/3.0 are unable to use filename as byte strings. http://bugs.python.org/issue3187

Ooops, as amaury noticed, the problem is specific to Python 3.0. My example works correctly with Python 2.6: ---------- $ find . ./a?b ./dir?name ./dir?name/file $ ~/prog/python-trunk/python Python 2.6rc2+ (trunk:66627M, Sep 26 2008, 19:03:31)

...

...
...
import os, shutil os.listdir('.') ['a\xffb', 'dir\xffname'] open(os.listdir('.')[0]).close() os.unlink(os.listdir('.')[0]) os.listdir('.') ['dir\xffname'] shutil.rmtree(os.listdir('.')[0])

Same test with Python 3.0: ---------- $ pwd /tmp/test $ find . ./a?b ./dir?name ./dir?name/file $ ~/prog/py3k/python Python 3.0rc1+ (py3k:66627M, Sep 26 2008, 18:10:03)

...

...
...
import os, shutil os.listdir('.') [b'a\xffb', b'dir\xffname'] open(os.listdir('.')[0]).close() Traceback (most recent call last): File "<stdin>", line 1, in <module> NOT FOUNT os.unlink(os.listdir('.')[0]) os.listdir('.') [b'dir\xffname'] shutil.rmtree(os.listdir('.')[0]) Traceback (most recent call last): File "<stdin>", line 1, in <module> NOT FOUNT

Results: * open() doesn't support bytes * unlink() supports bytes * shutil.rmtree() doesn't support bytes Another example to test chdir()/getcwd(): ---------- $ pwd /tmp/test $ ~/prog/py3k/python Python 3.0rc1+ (py3k:66627M, Sep 26 2008, 18:10:03)

...

...
...
import os, shutil os.getcwd() '/tmp/test' os.chdir(b'/tmp/test/dir\xffname') os.getcwd() Traceback (most recent call last): File "<stdin>", line 1, in <module> NOT FOUNT

Results: * chdir() supports byte filename * getcwd() fails -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

"Martin v. Löwis"

5:41 p.m.

...

I think that the problem is important because it's a regression from 2.5 to 2.6/3.0. Python 2.5 uses bytes filename, so it was possible to open/unlink "invalid" unicode strings (since it's not unicode but bytes).

I'd like to stress that the problem is *not* a regression from 2.5 to 2.6. As for 3.0, I'd like to argue that the problem is a minor issue. Even though you may run into file names that can't be decoded, that happening really indicates some bigger problem in the management of the system where this happens, and the proper solution (IMO) should be to change the system (leaving open the question whether or not Python should be also changed to work with such broken systems). Regards, Martin

Simon Cross

11:13 p.m.

On Sat, Sep 27, 2008 at 7:41 PM, "Martin v. Löwis" wrote:

...

As for 3.0, I'd like to argue that the problem is a minor issue. Even though you may run into file names that can't be decoded, that happening really indicates some bigger problem in the management of the system where this happens, and the proper solution (IMO) should be to change the system (leaving open the question whether or not Python should be also changed to work with such broken systems).

I can't agree here. File handling is a fundamental operation and I would expect something like:

...

...
...
for fname in os.listdir('.'): ... if os.path.isfile(fname): ... file(fname)

to work for all files. To have to know to put in special handling for certain corner case filenames or worse to not be able to open some files at all would be a serious loss. It would also complicate migrating code correctly to 3.0. Regardless of whose fault the underlying issue is, someone has to deal with the problem and if core Python doesn't, each developer who encounters the problem will have to come up with his/her own solution.

"Martin v. Löwis"

28 Sep 28 Sep

9:19 p.m.

...

I can't agree here. File handling is a fundamental operation and I would expect something like:

...
...
...
for fname in os.listdir('.'): ... if os.path.isfile(fname): ... file(fname)

to work for all files.

I agree. However, if it fails: is the bug of the Python, or of the system administrator maintaining it?

...

To have to know to put in special handling for certain corner case filenames or worse to not be able to open some files at all would be a serious loss. It would also complicate migrating code correctly to 3.0.

I agree completely. Unfortunately, all proposed solutions *do* require special handling for certain corner cases.

...

Regardless of whose fault the underlying issue is, someone has to deal with the problem and if core Python doesn't, each developer who encounters the problem will have to come up with his/her own solution.

This is quite in the abstract. Can you be more specific? Regards, Martin

Victor Stinner

3:14 p.m.

Le Saturday 27 September 2008 19:41:50 Martin v. Löwis, vous avez écrit :

...

...
I think that the problem is important because it's a regression from 2.5 to 2.6/3.0. Python 2.5 uses bytes filename, so it was possible to open/unlink "invalid" unicode strings (since it's not unicode but bytes).

I'd like to stress that the problem is *not* a regression from 2.5 to 2.6.

Sorry, 2.6 has no problem. This issue is a regression from Python2 to Python3.

...

Even though you may run into file names that can't be decoded, that happening really indicates some bigger problem in the management of the system where this happens, and the proper solution (IMO) should be to change the system

In the *real world*, people are using different file systems, different operations systems, and some broken programs and/or operating system create invalid filenames. It could be a configuration problem (wrong charset definition in /etc/fstab) or the charset autodetection failure, but who cares? Sometimes you don't care that your music directory contains some strange filenames, you just want to hear the music. Or maybe you would like to *fix* the encoding problem, which is not possible using Python3 trunk. People having this problem are, for example, people who write or use a backup program. This week someone asked me (on IRC) how to manage filenames in pure unicode with python 2.5 and Linux... which was impossible because on of his filename was invalid (maybe a file from a Windows system). So he switched to raw (bytes) filenames. In a perfect world, everybody uses Linux with utf-8 filenames and only programs in Python using space indentation :-D -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

Gregory P. Smith

8:34 p.m.

On 9/27/08, "Martin v. Löwis" wrote:

...

...
I think that the problem is important because it's a regression from 2.5 to 2.6/3.0. Python 2.5 uses bytes filename, so it was possible to open/unlink "invalid" unicode strings (since it's not unicode but bytes).

I'd like to stress that the problem is *not* a regression from 2.5 to 2.6.

As for 3.0, I'd like to argue that the problem is a minor issue. Even though you may run into file names that can't be decoded, that happening really indicates some bigger problem in the management of the system where this happens, and the proper solution (IMO) should be to change the system (leaving open the question whether or not Python should be also changed to work with such broken systems).

Regards, Martin

Note: bcc python-dev,cc: python-3000 "broken" systems will always exist. Code to deal with them must be possible to write in python 3.0. since any given path (not just fs) can have its own encoding it makes the most sense to me to let the OS deal with the errors and not try to enforce bytes vs string encoding type at the python lib. level. -gps

Tristan Seligmann

29 Sep 29 Sep

9:21 a.m.

* Gregory P. Smith [2008-09-28 13:34:50 -0700]:

...

since any given path (not just fs) can have its own encoding it makes the most sense to me to let the OS deal with the errors and not try to enforce bytes vs string encoding type at the python lib. level.

But the underlying APIs differ; Linux uses bytestrings for filenames, whereas I believe the native Windows APIs take "wide" (ie. Unicode) strings. -- mithrandi, i Ainil en-Balandor, a faer Ambar

Ulrich Eckhardt

10:50 a.m.

On Sunday 28 September 2008, Gregory P. Smith wrote:

...

"broken" systems will always exist. Code to deal with them must be possible to write in python 3.0.

since any given path (not just fs) can have its own encoding it makes the most sense to me to let the OS deal with the errors and not try to enforce bytes vs string encoding type at the python lib. level.

Actually I'm afraid that that isn't really useful. I, too, would like to kick peoples' back in order to get the to fix their systems or use the proper codepage while mounting etc, etc, but that is not going to happen soon. Just ignoring those broken systems is tempting, but alienating a large group of users isn't IMHO worth it. Instead, I'd like to present a different approach: 1. For POSIX platforms (using a byte string for the path): Here, the first approach is to convert the path to Unicode, according to the locale's CTYPE category. Hopefully, it will be UTF-8, but also codepages should work. If there is a segment (a byte sequence between two path separators) where it doesn't work, it uses an ASCII mapping where possible and codepoints from the "Private Use Area" (PUA) of Unicode for the non-decodable bytes. In order to pass this path to fopen(), each segment would be converted to a byte string again, using the locale's CTYPE category except for segments which use the PUA where it simply encodes the original bytes. 2. For win32 platforms, the path is already Unicode (UTF-16) and the whole problem is solved or not solved by the OS. In the end, both approaches yield a path represented by a Unicode string for intermediate use, which provides maximum flexibility. Further, it preserves "broken" encodings by simply mapping their byte-values to the PUA of Unicode. Maybe not using a string to represent a path would be a good idea, too. At least it would make it very clear that the string is not completely free-form. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ************************************************************************************** Visit our website at http://www.satorlaser.de/ ************************************************************************************** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. **************************************************************************************

M.-A. Lemburg

11:16 a.m.

On 2008-09-29 12:50, Ulrich Eckhardt wrote:

...

On Sunday 28 September 2008, Gregory P. Smith wrote:

...
"broken" systems will always exist. Code to deal with them must be possible to write in python 3.0.

since any given path (not just fs) can have its own encoding it makes the most sense to me to let the OS deal with the errors and not try to enforce bytes vs string encoding type at the python lib. level.

Actually I'm afraid that that isn't really useful. I, too, would like to kick peoples' back in order to get the to fix their systems or use the proper codepage while mounting etc, etc, but that is not going to happen soon. Just ignoring those broken systems is tempting, but alienating a large group of users isn't IMHO worth it.

Instead, I'd like to present a different approach:

1. For POSIX platforms (using a byte string for the path): Here, the first approach is to convert the path to Unicode, according to the locale's CTYPE category. Hopefully, it will be UTF-8, but also codepages should work. If there is a segment (a byte sequence between two path separators) where it doesn't work, it uses an ASCII mapping where possible and codepoints from the "Private Use Area" (PUA) of Unicode for the non-decodable bytes. In order to pass this path to fopen(), each segment would be converted to a byte string again, using the locale's CTYPE category except for segments which use the PUA where it simply encodes the original bytes.

I'm not sure how this would work. How would you map the private use code points back to bytes ? Using a special codec that knows about these code points ? How would the fopen() know to use that special codec instead of e.g. the UTF-8 codec ? BTW: Private use areas in Unicode are meant for e.g. company specific code points. Using them for escaping purposes is likely to cause problems due to assignment clashes. Regarding the subject of file names: On Unix, it's well possible to have to deal with 2-3 different file systems mounted on a machine. Each of those may use a different file name encoding or not support file name encoding at all. If the OS doesn't guarantee a consistent file name encoding, then why should Python try to emulate this on top of the OS ? I think it's more important to be able to open a file, than to have a readable file name when printing it to stdout, e.g. I wouldn't be able to tell whether some Chinese file name makes sense or not, but if I know that all files in a directory are meant for processing I should be able to iterate over them regardless of whether they make sense or not.

...

2. For win32 platforms, the path is already Unicode (UTF-16) and the whole problem is solved or not solved by the OS.

In the end, both approaches yield a path represented by a Unicode string for intermediate use, which provides maximum flexibility. Further, it preserves "broken" encodings by simply mapping their byte-values to the PUA of Unicode. Maybe not using a string to represent a path would be a good idea, too. At least it would make it very clear that the string is not completely free-form.

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 29 2008)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611

Ulrich Eckhardt

11:59 a.m.

On Monday 29 September 2008, M.-A. Lemburg wrote:

...

On 2008-09-29 12:50, Ulrich Eckhardt wrote:

...
1. For POSIX platforms (using a byte string for the path): Here, the first approach is to convert the path to Unicode, according to the locale's CTYPE category. Hopefully, it will be UTF-8, but also codepages should work. If there is a segment (a byte sequence between two path separators) where it doesn't work, it uses an ASCII mapping where possible and codepoints from the "Private Use Area" (PUA) of Unicode for the non-decodable bytes. In order to pass this path to fopen(), each segment would be converted to a byte string again, using the locale's CTYPE category except for segments which use the PUA where it simply encodes the original bytes.

I'm not sure how this would work. How would you map the private use code points back to bytes ? Using a special codec that knows about these code points ? How would the fopen() know to use that special codec instead of e.g. the UTF-8 codec ?

Sorry, I wasn't clear enough. I'll try to explain further... Let's assume we have a filename like this: 0xc2 0xa9 0x2f 0x7f The first two bytes are the copyright sign encoded in UTF-8, followed by a slash (0x2f, path separator) and a character encoded in an unknown codepage (0x7f is not ASCII!). The first thing when receiving that path from the system would be to split it into segments, here we would get two of them, one with 0xc2 0xa9 and the other with 0x7f. This uses the fact that the separator (slash/0x2f) is rather universal (Note: I'm not sure about encodings like BIG5, i.e. ones that are neither UTF-8 nor derived from ASCII). For each segment, we would apply the locale's CTYPE facet and get the Unicode codepoint 0xa9 for the first segment, while the second one fails to convert. So, for the second one, we simply check for each byte if it is valid and printable ASCII (0x7f isn't). If it is, we emit the byte as Unicode codepoint. Otherwise, we map to the PUA. The PUA reserves 0xe000 to 0xf8ff for private uses. I would simply encode the byte 0x7f as 0xe07f, i.e. map it to the beginning of that range. Eventually, we would end up with the following Unicode codepoints: 0xa9, 0x2f, 0xe07f When converting to a byte string for use with fopen(), we simply inspect the supplied string again. If a segment contains elements of the PUA, we simply reverse the mapping for those and leave the others in that segment as-is. For all other segments, we apply the CTYPE conversion. Notes: * This effectively converts the current path representation (a string) into a sequence of segments where each segment can either be a fully Unicode-capable string or a raw byte string without any known interpretation. However, instead of using an array for that, it uses a string, which is what most people's code expects anyway. * You could also work on a byte-base instead of splitting the path in segments first. I just assumed that a single segment will not contain valid UTF-8 sequences mixed with invalid ones. A path however can contain both correctly and incorrectly encoded segments.

...

BTW: Private use areas in Unicode are meant for e.g. company specific code points. Using them for escaping purposes is likely to cause problems due to assignment clashes.

I'm not sure if the use I proposed is correct according to the intended use of the PUA. I know that ideally no such string would escape from Python, i.e. it should only be visible internally. I would guess that that is something the PUA was intended for. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ************************************************************************************** Visit our website at http://www.satorlaser.de/ ************************************************************************************** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. **************************************************************************************

glyph＠divmod.com

2:01 p.m.

On 11:59 am, eckhardt@satorlaser.com wrote:

...

Sorry, I wasn't clear enough. I'll try to explain further...

Let's assume we have a filename like this:

0xc2 0xa9 0x2f 0x7f

The first two bytes are the copyright sign encoded in UTF-8, followed by a slash (0x2f, path separator) and a character encoded in an unknown codepage (0x7f is not ASCII!).

Originally I thought that this was a valid idea, but then it became clear that this could be a problem. Consider a filename which includes a UTF-8 encoding of a PUA code point.

...

I'm not sure if the use I proposed is correct according to the intended use of the PUA. I know that ideally no such string would escape from Python, i.e. it should only be visible internally. I would guess that that is something the PUA was intended for.

Viewing the PUA with GNOME charmap, I can see that many code points there have character renderings on my Ubuntu system. I have to assume, therefore, that there are other (and potentially conflicting) uses for this unicode feature.

"Martin v. Löwis"

10:49 p.m.

...

Originally I thought that this was a valid idea, but then it became clear that this could be a problem. Consider a filename which includes a UTF-8 encoding of a PUA code point.

I still think it's a valid idea. For non-UTF-8 file system encodings, use PUA characters, and generate them through an error handler. If the file system encoding is UTF-8, use UTF-8b instead as the file system encoding.

...

Viewing the PUA with GNOME charmap, I can see that many code points there have character renderings on my Ubuntu system. I have to assume, therefore, that there are other (and potentially conflicting) uses for this unicode feature.

Depends on how you use it. If you use the PUA block 1 (i.e. U+E000..U+F8FF), there is a realistic chance of collision. If you use the Plane 15 or Plane 16 PUA blocks, there is currently zero chance of collision (AFAIK). PUA has a wide use for additional characters in TrueType, but I don't think many tools even support plane 15 and 16 for generating fonts, or rendering them (it may even that the TrueType/OpenType format doesn't support them in the first place). However, Python can make use of these planes fairly easily, even in 2-byte mode (through UTF-16). Regards, Martin

Adam Olsen

11:31 p.m.

On Mon, Sep 29, 2008 at 4:49 PM, "Martin v. Löwis" wrote:

...

...
Originally I thought that this was a valid idea, but then it became clear that this could be a problem. Consider a filename which includes a UTF-8 encoding of a PUA code point.

I still think it's a valid idea. For non-UTF-8 file system encodings, use PUA characters, and generate them through an error handler.

If the file system encoding is UTF-8, use UTF-8b instead as the file system encoding.

...
Viewing the PUA with GNOME charmap, I can see that many code points there have character renderings on my Ubuntu system. I have to assume, therefore, that there are other (and potentially conflicting) uses for this unicode feature.

Depends on how you use it. If you use the PUA block 1 (i.e. U+E000..U+F8FF), there is a realistic chance of collision.

If you use the Plane 15 or Plane 16 PUA blocks, there is currently zero chance of collision (AFAIK). PUA has a wide use for additional characters in TrueType, but I don't think many tools even support plane 15 and 16 for generating fonts, or rendering them (it may even that the TrueType/OpenType format doesn't support them in the first place). However, Python can make use of these planes fairly easily, even in 2-byte mode (through UTF-16).

An example where lossy conversion fails: 1) create file using UTF-8 app with PUA (or ambiguous scalar of choice) filename. 2) list dir in python. file name is now a unicode object with PUA. 3) attempt to open. file name gets converted to malformed UTF-8 sequence. Doesn't match the name on disk, so opening fails Lossy conversion just moves around what gets treated as garbage. As all valid unicode scalars can be round tripped, there's no way to create a valid unicode file name without being lossy. The alternative is not be valid unicode, but since we can't use such objects with external libs, can't even print them, we might as well call them something else. We already have a name for that: bytes. -- Adam Olsen, aka Rhamphoryncus

Victor Stinner

30 Sep 30 Sep

12:09 a.m.

Le Tuesday 30 September 2008 01:31:45 Adam Olsen, vous avez écrit :

...

The alternative is not be valid unicode, but since we can't use such objects with external libs, can't even print them, we might as well call them something else. We already have a name for that: bytes.

:-)

Nick Coghlan

9:45 a.m.

Adam Olsen wrote:

...

Lossy conversion just moves around what gets treated as garbage. As all valid unicode scalars can be round tripped, there's no way to create a valid unicode file name without being lossy. The alternative is not be valid unicode, but since we can't use such objects with external libs, can't even print them, we might as well call them something else. We already have a name for that: bytes.

To my mind, there are two kinds of app in the world when it comes to file paths: 1) "Normal" apps (e.g. a word processor), that are only interested in files with sane, well-formed file names that can be properly decoded to Unicode with the filesystem encoding identified by Python. If there is invalid data on the filesystem, they don't care and don't want to see it or have to deal with it. 2) "Filesystem" apps (e.g. a filesystem explorer), that need to be able to deal with malformed filenames that may not decode properly using the identified filesystem encoding. For the former category of apps, the presence of a malformed filename should NOT disrupt the processing of well-formed files and directories. Those applications should "just work", even if the underlying filesystem has a few broken filenames. The latter category of applications need some way of defining their own application-specific handling of malformed names. That screams "callback" to me - and one mechanism to achieve that would be to expose the unicode "errors" argument for filesystem operations that return file paths (e.g. os.getcwd(), os.listdir(), os.readlink(), os.walk()). Once that was exposed, the existing error handling machinery in the codecs module could be used to allow applications to define their own custom error handling for Unicode decode errors in these operations. (e.g. set "codecs.register_error('bad_filepath', handle_filepath_error)", then use "errors='bad_filepath'" in the relevant os API calls) The default handling could be left at "strict", with os.listdir() and os.walk() specifically ignoring path entries that trigger UnicodeDecodeError. getcwd() and readlink() could just propagate the exception, since they have no other information to return. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Hrvoje Nikšić

10:52 a.m.

On Tue, 2008-09-30 at 19:45 +1000, Nick Coghlan wrote:

...

To my mind, there are two kinds of app in the world when it comes to file paths: 1) "Normal" apps (e.g. a word processor), that are only interested in files with sane, well-formed file names that can be properly decoded to Unicode with the filesystem encoding identified by Python. If there is invalid data on the filesystem, they don't care and don't want to see it or have to deal with it.

I am not convinced that a word processor can just ignore files with (what it thinks are) undecodable file names. In countries with a history of incompatible national encodings, such file names crop up very often, sometimes as a natural consequence of data migrating from older systems to newer ones. You can and do encounter "invalid" file names in the filesystems of mainstream users even without them using buggy or obsolete software.

Guido van Rossum

2:26 p.m.

On Tue, Sep 30, 2008 at 3:52 AM, Hrvoje Nikšić wrote:

...

On Tue, 2008-09-30 at 19:45 +1000, Nick Coghlan wrote:

...
To my mind, there are two kinds of app in the world when it comes to file paths: 1) "Normal" apps (e.g. a word processor), that are only interested in files with sane, well-formed file names that can be properly decoded to Unicode with the filesystem encoding identified by Python. If there is invalid data on the filesystem, they don't care and don't want to see it or have to deal with it.

I am not convinced that a word processor can just ignore files with (what it thinks are) undecodable file names. In countries with a history of incompatible national encodings, such file names crop up very often, sometimes as a natural consequence of data migrating from older systems to newer ones. You can and do encounter "invalid" file names in the filesystems of mainstream users even without them using buggy or obsolete software.

This is a quality of implementation issue. Either the word processor is written to support "undecodable" files, or it isn't. If it isn't, there's nothing that can be done about it (short of buying another wordprocessor) and it shouldn't be crippled by the mere *presence* of an undecodable file in a directory. I can think of lots of apps that have a sufficiently small or homogeneous audience (e.g. lots of in-house apps) that they don't need to care about such files, and these shouldn't break when they are used in the vicinity of an undecodable filename -- it's enough if they just ignore it. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Hrvoje Nikšić

2:39 p.m.

On Tue, 2008-09-30 at 07:26 -0700, Guido van Rossum wrote:

...

...
I am not convinced that a word processor can just ignore files with (what it thinks are) undecodable file names. In countries with a history of incompatible national encodings, such file names crop up very often, sometimes as a natural consequence of data migrating from older systems to newer ones. You can and do encounter "invalid" file names in the filesystems of mainstream users even without them using buggy or obsolete software.

This is a quality of implementation issue. Either the word processor is written to support "undecodable" files, or it isn't. If it isn't, there's nothing that can be done about it (short of buying another wordprocessor)

I agree with this. I just believe the underlying python APIs shouldn't make it impossible (or unnecessarily hard) for the word processor to implement showing of files with undecodable names. For example, implementing os.listdir to return the file names as Unicode subclasses with ability to access the underlying bytes (automatically recognized by open and friends) sounds like a good compromise that allows the word processor to both have the cake and eat it.

glyph＠divmod.com

6:12 p.m.

On 02:39 pm, hrvoje.niksic@avl.com wrote:

...

For example, implementing os.listdir to return the file names as Unicode subclasses with ability to access the underlying bytes (automatically recognized by open and friends) sounds like a good compromise that allows the word processor to both have the cake and eat it.

It really seems like the strategy of the current patch (which I believe Guido proposed) makes the most sense. Programs pass different arguments for different things: listdir(text) -> I am thinking in unicode and I do not know about encodings, please give me only things that are proper unicode, because I don't want to deal with that. listdir(bytes) -> I am thinking about bytes, I know about encodings. Just give me filenames as bytes and I will decode them myself or do other fancy things. You can argue about whether this should really be 'listdiru' or 'globu' for explicitness, but when such a simple strategy with unambiguous types works, there's no reason to introduce some weird hybrid bytes/text type that will inevitably be a bug attractor. Python's path abstractions have never been particularly high level, nor do I think they necessarily should be - at least, not until there's some community consensus about what a "high level path abstraction" really looks like. We're still wrestling with it in Twisted, and I can think of at least three ways that ours is wrong. And ours is the one that's doing the best, as far as I can tell :). This proposal gives higher level software the information that it needs to construct appropriate paths. The one thing it doesn't do is expose the decoding rules for the higher- level applications to deal with. I am pretty sure I don't understand how the interaction between filesystem encoding and user locale works in that case, though, so I can't immediately recommend a way to do it.

Guido van Rossum

6:16 p.m.

On Tue, Sep 30, 2008 at 11:12 AM, wrote:

...

The one thing it doesn't do is expose the decoding rules for the higher- level applications to deal with. I am pretty sure I don't understand how the interaction between filesystem encoding and user locale works in that case, though, so I can't immediately recommend a way to do it.

You can ask what the filesystem encoding is with sys.getfilesystemencoding(). On my Linux box I can make this return anything I like by setting LC_CTYPE=en_US.<whatever> (as long as <whatever> is a recognized encoding). There are probably 5 other environment variables to influence this. :-( Of course that doesn't help for undecodable filenames, and in that case I don't think *anything* can help you unless you have a lot of additional knowledge about what the user might be doing, e.g. you know a few other encodings to try that make sense for their environment. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

glyph＠divmod.com

6:42 p.m.

On 06:16 pm, guido@python.org wrote:

...

On Tue, Sep 30, 2008 at 11:12 AM, wrote:

...
The one thing it doesn't do is expose the decoding rules for the higher- level applications to deal with. I am pretty sure I don't understand how the interaction between filesystem encoding and user locale works in that case, though, so I can't immediately recommend a way to do it.

You can ask what the filesystem encoding is with sys.getfilesystemencoding(). On my Linux box I can make this return anything I like by setting LC_CTYPE=en_US.<whatever> (as long as <whatever> is a recognized encoding). There are probably 5 other environment variables to influence this. :-(

Only 5? Great! :-)

...

Of course that doesn't help for undecodable filenames, and in that case I don't think *anything* can help you unless you have a lot of additional knowledge about what the user might be doing, e.g. you know a few other encodings to try that make sense for their environment.

There are other ways to glean this knowledge; for example, looking at the 'iocharset' or 'nls' mount options supplied to mount various filesystems. I thought maybe Python (or some C library call) might be invoking some logic that did something with data like that; if not, great, one day when I have some free time (meaning: never) I can implement that logic myself without duplicating a bunch of work.

Guido van Rossum

9:37 p.m.

On Tue, Sep 30, 2008 at 11:42 AM, wrote:

...

There are other ways to glean this knowledge; for example, looking at the 'iocharset' or 'nls' mount options supplied to mount various filesystems. I thought maybe Python (or some C library call) might be invoking some logic that did something with data like that; if not, great, one day when I have some free time (meaning: never) I can implement that logic myself without duplicating a bunch of work.

I know we could do a better job, but absent anyone who knows what they're doing we've chosen a fairly conservative approach. I certainly hope that someone will contribute some mean encoding-guessing code to the stdlib that users can use. I'm not sure if I'll ever endorse doing this automatically in io.open(), though I'd be fine with a convention like passing encoding="guess". -- --Guido van Rossum (home page: http://www.python.org/~guido/)

glyph＠divmod.com

1 Oct 1 Oct

1:27 a.m.

On 30 Sep, 09:37 pm, guido@python.org wrote:

...

On Tue, Sep 30, 2008 at 11:42 AM, wrote:

...
There are other ways to glean this knowledge; for example, looking at the 'iocharset' or 'nls' mount options supplied to mount various filesystems.

...

I know we could do a better job, but absent anyone who knows what they're doing we've chosen a fairly conservative approach. I certainly hope that someone will contribute some mean encoding-guessing code to the stdlib that users can use. I'm not sure if I'll ever endorse doing this automatically in io.open(), though I'd be fine with a convention like passing encoding="guess".

I think the conservative approach is actually correct, or rather, as close to correct as it is possible to get in this mess. Inspecting these fantastically obscure options is only likely to be helpful in a tool which tries to correct filesystem encoding errors on legacy data. I wouldn't even know about them if I hadn't written several such tools (well, just little scripts, really) in the past. I was just verifying that I wasn't missing some "right way" which would let someone else do the guesswork for me. In reality, you have two options for filesystem encoding on Linux: * UTF-8 * fall in a well and die The OS will happily let you create a completely nonsensical environment where no application can possibly do anything reasonable: set LC_ALL to KOI8R, mount your USB keychain as Shift_JIS and your windows partition as ISO-8859-8. Of course nobody would actually _do_ this, because they want things to work, so everything is gradually evolving to a default of UTF-8 everywhere. In practice, however, there are still problems with CIFS/SMB shares where other clients have different ideas about encoding. I've experienced this most commonly when sharing with Macs, which have very particular and different ideas about normalization, as has already been discussed in this thread.

Guido van Rossum

30 Sep 30 Sep

2:22 p.m.

On Tue, Sep 30, 2008 at 2:45 AM, Nick Coghlan wrote:

...

Adam Olsen wrote:

...
Lossy conversion just moves around what gets treated as garbage. As all valid unicode scalars can be round tripped, there's no way to create a valid unicode file name without being lossy. The alternative is not be valid unicode, but since we can't use such objects with external libs, can't even print them, we might as well call them something else. We already have a name for that: bytes.

To my mind, there are two kinds of app in the world when it comes to file paths: 1) "Normal" apps (e.g. a word processor), that are only interested in files with sane, well-formed file names that can be properly decoded to Unicode with the filesystem encoding identified by Python. If there is invalid data on the filesystem, they don't care and don't want to see it or have to deal with it. 2) "Filesystem" apps (e.g. a filesystem explorer), that need to be able to deal with malformed filenames that may not decode properly using the identified filesystem encoding.

For the former category of apps, the presence of a malformed filename should NOT disrupt the processing of well-formed files and directories. Those applications should "just work", even if the underlying filesystem has a few broken filenames.

Right. Totally agreed.

...

The latter category of applications need some way of defining their own application-specific handling of malformed names.

Agreed again.

...

That screams "callback" to me - and one mechanism to achieve that would be to expose the unicode "errors" argument for filesystem operations that return file paths (e.g. os.getcwd(), os.listdir(), os.readlink(), os.walk()).

Hm. This doesn't scream callback to me at all. I would never have thought of callbacks for this use case -- and I don't think it's a good idea. The callback would either be an extra argument to all system calls (bad, ugly etc., and why not go with the existing unicode encoding and error flags if we're adding extra args?) or would be global, where I'd be worried that it might interfere with the proper operation of library code that is several abstractions away from whoever installed the callback, not under their control, and not expecting the callback. I suppose I may have totally misunderstood your proposal, but in general I find callbacks unwieldy.

...

Once that was exposed, the existing error handling machinery in the codecs module could be used to allow applications to define their own custom error handling for Unicode decode errors in these operations. (e.g. set "codecs.register_error('bad_filepath', handle_filepath_error)", then use "errors='bad_filepath'" in the relevant os API calls)

The default handling could be left at "strict", with os.listdir() and os.walk() specifically ignoring path entries that trigger UnicodeDecodeError.

getcwd() and readlink() could just propagate the exception, since they have no other information to return.

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Nick Coghlan

9:43 p.m.

Guido van Rossum wrote:

...

The callback would either be an extra argument to all system calls (bad, ugly etc., and why not go with the existing unicode encoding and error flags if we're adding extra args?) or would be global, where I'd be worried that it might interfere with the proper operation of library code that is several abstractions away from whoever installed the callback, not under their control, and not expecting the callback.

I suppose I may have totally misunderstood your proposal, but in general I find callbacks unwieldy.

Not really - later in the email, I actually pointed out that exposing the unicode errors flag for the implicit PyUnicode_Decode invocations would be enough to enable a callback mechanism. However, James's post pointing out that this is a problem that also affects environment variables and command line arguments, not just file paths completely kills any hope of purely callback based approach - that processing needs to "just work" without any additional intervention from the application. Of the suggestions I've seen so far, I like Marcin's Mono-inspired NULL-escape codec idea the best. Since these strings all come from parts of the environment where NULLs are not permitted, a simple "'\0' in text" check will immediately identify any strings where decoding failed (for applications which care about the difference and want to try to do better), while applications which don't care will receive perfectly valid Python strings that can be passed around and manipulated as if the decoding error never happened. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Adam Olsen

10:04 p.m.

On Tue, Sep 30, 2008 at 3:43 PM, Nick Coghlan wrote:

...

Guido van Rossum wrote:

...
The callback would either be an extra argument to all system calls (bad, ugly etc., and why not go with the existing unicode encoding and error flags if we're adding extra args?) or would be global, where I'd be worried that it might interfere with the proper operation of library code that is several abstractions away from whoever installed the callback, not under their control, and not expecting the callback.

I suppose I may have totally misunderstood your proposal, but in general I find callbacks unwieldy.

Not really - later in the email, I actually pointed out that exposing the unicode errors flag for the implicit PyUnicode_Decode invocations would be enough to enable a callback mechanism.

However, James's post pointing out that this is a problem that also affects environment variables and command line arguments, not just file paths completely kills any hope of purely callback based approach - that processing needs to "just work" without any additional intervention from the application.

Of the suggestions I've seen so far, I like Marcin's Mono-inspired NULL-escape codec idea the best. Since these strings all come from parts of the environment where NULLs are not permitted, a simple "'\0' in text" check will immediately identify any strings where decoding failed (for applications which care about the difference and want to try to do better), while applications which don't care will receive perfectly valid Python strings that can be passed around and manipulated as if the decoding error never happened.

It avoids the technical problems, but it's still magical behaviour that users have to learn, whereas bytes/unicode polymorphism uses the distinctions you should already know about. There's also a problem of how to turn it on. I'm against automatically Python changing the filesystem encoding, no matter how well intentioned. Better to let the app do that, which is easy and could be done for all apps (not just python!) if someone defined a libc encoding of "null-escaped UTF-8". On the whole I'm only -0 on it (compared to -1 for UTF-8b). -- Adam Olsen, aka Rhamphoryncus

Nick Coghlan

10:18 p.m.

Adam Olsen wrote:

...

On Tue, Sep 30, 2008 at 3:43 PM, Nick Coghlan wrote:

...
Of the suggestions I've seen so far, I like Marcin's Mono-inspired NULL-escape codec idea the best. Since these strings all come from parts of the environment where NULLs are not permitted, a simple "'\0' in text" check will immediately identify any strings where decoding failed (for applications which care about the difference and want to try to do better), while applications which don't care will receive perfectly valid Python strings that can be passed around and manipulated as if the decoding error never happened.

It avoids the technical problems, but it's still magical behaviour that users have to learn, whereas bytes/unicode polymorphism uses the distinctions you should already know about.

There's also a problem of how to turn it on. I'm against automatically Python changing the filesystem encoding, no matter how well intentioned. Better to let the app do that, which is easy and could be done for all apps (not just python!) if someone defined a libc encoding of "null-escaped UTF-8".

On the whole I'm only -0 on it (compared to -1 for UTF-8b).

For the decoding side, you wouldn't need to do it as a codec - you could do it as a 'nullescape' error handler (since NULLs can't be present in the byte sequences being decoded, there is no need to worry about escaping anything when decoding is successful). Converting those NULL escaped strings back into something the filesystem can understand would obviously need a custom codec though, but some kind of application level handling of bad filenames is going to be needed no matter how we deal with bad encoding on the input side. That said, I don't think this is something we (or, more to the point, Guido) need to make a decision on right now - for 3.0, having bytes-level APIs that can see everything, and Unicode APIs that ignore badly encoded filenames is worth trying. If it proves inadequate, then we can revisit the idea of some kind of implicit escaping mechanism in the Unicode APIs for 3.1 when there is more time for a proper PEP. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Guido van Rossum

10:23 p.m.

On Tue, Sep 30, 2008 at 3:18 PM, Nick Coghlan wrote:

...

That said, I don't think this is something we (or, more to the point, Guido) need to make a decision on right now - for 3.0, having bytes-level APIs that can see everything, and Unicode APIs that ignore badly encoded filenames is worth trying. If it proves inadequate, then we can revisit the idea of some kind of implicit escaping mechanism in the Unicode APIs for 3.1 when there is more time for a proper PEP.

Right. Given that most syscalls already support both bytes and (unicode) str, the simplest thing to do is to take this a bit further, along the lines of Victor's patches, which I'm reviewing in Rietveld right now: http://codereview.appspot.com/3055 -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum

10:20 p.m.

On Tue, Sep 30, 2008 at 2:43 PM, Nick Coghlan wrote:

...

Of the suggestions I've seen so far, I like Marcin's Mono-inspired NULL-escape codec idea the best. Since these strings all come from parts of the environment where NULLs are not permitted, a simple "'\0' in text" check will immediately identify any strings where decoding failed (for applications which care about the difference and want to try to do better), while applications which don't care will receive perfectly valid Python strings that can be passed around and manipulated as if the decoding error never happened.

I'm not so sure. While it maintains *internal* consistency, printing and displaying those filenames isn't likely going to give useful results. E.g. on the terminal emulator I happen to be using right now null bytes are simply ignored. Another danger might be that the null character may be seen as the end of a string by some other library. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

glyph＠divmod.com

29 Sep 29 Sep

11:28 a.m.

On 10:50 am, eckhardt@satorlaser.com wrote:

...

On Sunday 28 September 2008, Gregory P. Smith wrote:

...
"broken" systems will always exist. Code to deal with them must be possible to write in python 3.0.

...

...
since any given path (not just fs) can have its own encoding it makes the most sense to me to let the OS deal with the errors and not try to enforce bytes vs string encoding type at the python lib. level.

...

Actually I'm afraid that that isn't really useful. I, too, would like to kick peoples' back in order to get the to fix their systems or use the proper codepage while mounting etc, etc, but that is not going to happen soon. Just ignoring those broken systems is tempting, but alienating a large group of users isn't IMHO worth it.

Instead, I'd like to present a different approach:

...

1. For POSIX platforms (using a byte string for the path): Here, the first approach is to convert the path to Unicode, according to the locale's CTYPE category. Hopefully, it will be UTF-8, but also codepages should work. If there is a segment (a byte sequence between two path separators) where it doesn't work, it uses an ASCII mapping where possible and codepoints from the "Private Use Area" (PUA) of Unicode for the non-decodable bytes.

...

In order to pass this path to fopen(), each segment would be converted to a byte string again, using the locale's CTYPE category except for segments which use the PUA where it simply encodes the original bytes.

That's a cool idea, but this encoding hack would need to be clearly documented and exposed for when you need to talk to another piece of software about pathnames. Consider a Python implementation of "xargs". Right now this can be implemented as a pretty simple for loop which eventually invokes 'subprocess.call' or similar. http://docs.python.org/dev/3.0/library/os.html#process-management doesn't say what the type of the arguments to the various 'exec' variants are - one presumes they'd have to be bytes. Not all arguments to subprocesses need to be filenames, but when they are they need to be encoded appropriately. Also, consider the following nightmare scenario: a system which has two users with incompatible locales. One wishes to write a "text" (ha ha) file with a list of pathnames in it to share with the other. What encoding should that file be in? How should the other user know how to interpret it? (And of course: what if that user is going to be piping that file to "xargs", or the original file came out of "find"?) I don't think that you can do encoding a segment at a time here, at least not at the API level; however, the whole file could be written in the py-posix- paths encoding which does exactly what you propose.

...

2. For win32 platforms, the path is already Unicode (UTF-16) and the whole problem is solved or not solved by the OS.

If the "or not solved" part of that is true then this probably bears further investigation. I suspect that the OS *always* provides some solution, even if it's the wrong solution, though. Also, what about MacOS X?

...

In the end, both approaches yield a path represented by a Unicode string for intermediate use, which provides maximum flexibility. Further, it preserves "broken" encodings by simply mapping their byte-values to the PUA of Unicode. Maybe not using a string to represent a path would be a good idea, too. At least it would make it very clear that the string is not completely free-form.

Personally, I plan to use this: http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePat... for all of my file I/O in the future. For what it's worth, this object _doesn't_ handle unicode properly and it's been a thorn in our side for quite a while. We have plans to implement some kind of unicode-friendly API which is compatible with 2.6; if we have any brilliant ideas I'll let you know, but I doubt they'll be in time. The general idea right now is that we'll keep around the original bytes returned from filesystem inspection and provide some context-sensitive encoding/decoding APIs for different applications. The PUA approach would allow us to maintain an API compatible with that. I would not actually mind if there were a POSIX-specific module we had to use to get every arcane nuance of brokenness of writing pathnames into text files to be correct, since Windows needs to come up with _some_ valid unicode filename for every file in the system (even if it's improperly decoded).

Ulrich Eckhardt

12:34 p.m.

On Monday 29 September 2008, glyph@divmod.com wrote:

...

Also, what about MacOS X?

AFAIK, OS X guarantees UTF-8 for filesystem encodings. So the OS also provides Unicode filenames and how it deals with broken or legacy media is left up to the OS. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ************************************************************************************** Visit our website at http://www.satorlaser.de/ ************************************************************************************** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. **************************************************************************************

Jean-Paul Calderone

12:46 p.m.

On Mon, 29 Sep 2008 14:34:07 +0200, Ulrich Eckhardt wrote:

...

On Monday 29 September 2008, glyph@divmod.com wrote:

...
Also, what about MacOS X?

AFAIK, OS X guarantees UTF-8 for filesystem encodings. So the OS also provides Unicode filenames and how it deals with broken or legacy media is left up to the OS.

Read Jack Jansen's recent email about NFC vs NFD. Jean-Paul

Greg Ewing

11:32 p.m.

Ulrich Eckhardt wrote:

...

AFAIK, OS X guarantees UTF-8 for filesystem encodings. So the OS also provides Unicode filenames and how it deals with broken or legacy media is left up to the OS.

Does this mean that the OS always returns valid utf-8 strings from filesystem calls, even if the media is broken or legacy? -- Greg

Stephen J. Turnbull

30 Sep 30 Sep

1:13 a.m.

Greg Ewing writes:

...

Ulrich Eckhardt wrote:

...
AFAIK, OS X guarantees UTF-8 for filesystem encodings. So the OS also provides Unicode filenames and how it deals with broken or legacy media is left up to the OS.

Does this mean that the OS always returns valid utf-8 strings from filesystem calls, even if the media is broken or legacy?

No, this means Ulrich is wrong. NFD-normalized UTF-8 is more or less enforced by the default filesystem, but Mac OS X up to 10.4 at least also supports the FreeBSD filesystems, and some of those can have any encoding you like or none at all (ie, KOI8-R and Shift JIS in the same directory is possible). If you have a Mac it's easy enough to test by creating a disk image with a non-default file system.

Victor Stinner

29 Sep 29 Sep

11:37 a.m.

Le Monday 29 September 2008 12:50:03 Ulrich Eckhardt, vous avez écrit :

...

(...) uses an ASCII mapping where possible and codepoints from the "Private Use Area" (PUA) of Unicode for the non-decodable bytes.

That sounds to me like a very *ugly* hack. It remembers me my proposition to create an object have the API of both bytes and str types: str(<Filename object>) = human representation of the filename, bytes(<Filename>) = original bytes filename. As I wrote in the first email of this thread, it's not a good idea to mix bytes and characters. Why trying to convert bytes to characters when the operating system expects bytes? To get the best compatibility, we have to use the native types, at least when str(filename, fs_encoding) fails and/or str(filename, fs_encoding).encode(fs_encoding) != filename. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

Jack Jansen

12:24 p.m.

I'm a bit late to join in this discussion, but if unicode filenames are going to be the normal mode, how about this whole normalized/ canonical business? This is a headache that has shown up on the Mac a couple of times, because MacOS prefers filenames to be NFC, whereas Python prefers its Unicode to be NFD (or the other way around, I keep forgetting the details). To make the problem worse, even though MacOS prefers its filenames in the one form, it will allow filenames in the other form (which can happen if you mount a foreign filesystem, for example over the net). The fact that "incorrect" filenames can exist mean that the simple solution of converting NFC<->NFD in Python's open() and friends won't work (or, at least, it'll make some filenames inaccessible, and listdir() may return filenames that don't exist). -- Jack Jansen, , http://www.cwi.nl/~jack If I can't dance I don't want to be part of your revolution -- Emma Goldman

"Martin v. Löwis"

10:51 p.m.

Jack Jansen wrote:

...

I'm a bit late to join in this discussion, but if unicode filenames are going to be the normal mode, how about this whole normalized/canonical business?

I don't think there is a change in the current implementation. Users interested in this issue should contribute code that normalizes file names appropriately on systems that require such normalization. Regards, Martin

5685

Age (days ago)

5689

Last active (days ago)

List overview

Download

38 comments

17 participants

participants (17)

"Martin v. Löwis"
Adam Olsen
Amaury Forgeot d'Arc
glyph＠divmod.com
Greg Ewing
Gregory P. Smith
Guido van Rossum
Hrvoje Nikšić
Jack Jansen
Jean-Paul Calderone
M.-A. Lemburg
Nick Coghlan
Simon Cross
Stephen J. Turnbull
Tristan Seligmann
Ulrich Eckhardt
Victor Stinner

Filename as byte string in python 2.6 or 3.0?

Tristan Seligmann

Ulrich Eckhardt

Ulrich Eckhardt

glyph＠divmod.com

Adam Olsen

glyph＠divmod.com

glyph＠divmod.com

glyph＠divmod.com

Adam Olsen

glyph＠divmod.com

Ulrich Eckhardt

tags

participants (17)