[Twisted-Python] Python 3: bytes vs. str in twisted.python.filepath
![](https://secure.gravatar.com/avatar/7ea0679fb517f5c674e8456f4c34e272.jpg?s=120&d=mm&r=g)
Hi all, My name is Harry Bock. I'm interested in helping out porting Twisted to Python 3, and I've popped in IRC a few times to introduce myself and ask a few questions. A few developers agreed that working on trial dependencies would be a big help. In doing some porting work on trial, I stumbled upon a previous porting effort (possibly by Itamar?) for twisted.python.filepath and related modules. It seemed like the porting effort included forcing all pathname inputs to be byte strings instead of native strings. After some investigation, I believe this is the wrong approach, but I wanted to start a discussion here first. Some thoughts: (a) As of Python 3.3, use of the ANSI API in Windows is deprecated[1], so many functions in os and os.path raise DeprecationWarning when given byte strings as input. Although win32 is not an initial target of the porting effort, we'll have to support it eventually and the API should be supported before then. (b) Misunderstandings at the application level about the underlying filesystem's path encoding is not the problem of the Twisted API. Correct me if I'm wrong, but that's the responsibility of the system administrator or individual user (at least on UNIX) to set the LANG environment variable, or for the application to call setlocale(3) to explicitly override it. (c) If we do not allow unicode strings, we will be forcing the application developer to decide how to encode paths when using the FilePath API. Per (b) above, the user will have to call sys.getfilesystemencoding()[2] to divine what encoding to use before using the API at all, which to me is terribly annoying and would just add str.encode calls everywhere. Thus, my vote is that on Python 2.x, Twisted should accept either the native str or unicode types for path names, and on Python 3.x, only accept the str type to prevent deprecation issues with system calls. I have a patch set that will make this happen including unittest modifications; if there's a consensus I'm happy to open a ticket and submit the patches. Thanks! [1] http://bugs.python.org/issue13374 [2] http://docs.python.org/3/library/sys.html#sys.getfilesystemencoding
![](https://secure.gravatar.com/avatar/869963bdf99b541c9f0bbfb04b0320f1.jpg?s=120&d=mm&r=g)
On Sun, Jul 14, 2013 at 4:00 AM, Harry Bock <bock.harryw@gmail.com> wrote:
There is no way to enforce a particular setting of the LANG environment variable globally; multiple users could use filenames encoded in different encodings (in fact even a single user could do this), and files could be transferred from other systems using different encodings. While a reasonable person might insist on the use of UTF-8 everywhere, there is no way to guarantee that UNIX filenames are all in the same encoding, or are even in any particular encoding at all (they might be binary non-text garbage), and the inability to deal with filenames like this would be somewhat of a serious defect. On Windows, the reverse situation obtains, of course. -- mithrandi, i Ainil en-Balandor, a faer Ambar
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On 07/13/2013 10:00 PM, Harry Bock wrote:
You imply that this was a change, somehow, but it wasn't. The API was *always* bytes and it continues to be bytes on Python 3. It's a common Python 3 porting mistake to change everything from bytes to unicode just because. E.g. Python standard library does this in many places for no good reason, resulting in bugs that are still being fixed (http://bugs.python.org/issue12411) or APIs that are less useful (zipfile docs explicitly state that there is no standard encoding in zip files, but Python 3 zipfile module only supports one specific encoding because they switched to Unicode and didn't bother reading the module's own docs). Our goal in porting was backwards compatibility with Python 2 code, so porters don't have to change everything, and correctness. And, in this particular case, to get something working in the minimal amount of time - *adding* Unicode support is useful and should be done.
It is indeed a problem that we only support bytes in FilePath on Python 3. As I mentioned above, Unicode support is missing only due to lack of time in the initial port.
The ideal situation would be to support bytes and Unicode on Python 2 *and* Python 3, for maximum compatibility. Even if deprecated on Windows, filesystem operations on Python 3 still do accept bytes (and they're not deprecated elsewhere). Given existing code that already takes bytes, switching to only doing Unicode on Python 3 would not be backwards compatible, so we can't really do that without a bunch of deprecation warnings and a few releases. Instead we should just do what Python does: if you start with bytes path you always get back bytes, if you start with Unicode path you always get back Unicode.
![](https://secure.gravatar.com/avatar/7ea0679fb517f5c674e8456f4c34e272.jpg?s=120&d=mm&r=g)
On Sun, Jul 14, 2013 at 8:16 AM, Itamar Turner-Trauring <itamar@itamarst.org
wrote:
Ah, I understand now. Since the native string type was used in Python 2, it follows that in Python 3 the API should be bytes.
This is very true and I didn't consider it in my initial investigation. While I think it would be uncommon to have files in multiple encodings on the same filesystem, it certainly would not be rare - to Tristan's point, copying names from filesystem to filesystem could easily result in multiple encodings. The operating system may not need to understand the encodings, but applications do to display them correctly, Which leads to your last point...
Yes, you're right, that's probably the best solution. It would not be terribly hard to do so - then application developers can choose whether to defer to the local user's interpretation of the setting, or explicitly use byte paths. Thanks so much for your input! Is this something I can open a ticket for?
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On 07/14/2013 10:18 AM, Harry Bock wrote:
Is this something I can open a ticket for?
I believe there's already a ticket of sorts, with an old defunct branch starting working on this - https://twistedmatrix.com/trac/ticket/2366 - it would be really great if you could revive it and add support for this feature. Using FilePath is definitely annoying on Python 3, and in general Unicode makes more sense in many (most?) situations. -Itamar
![](https://secure.gravatar.com/avatar/e1554622707bedd9202884900430b838.jpg?s=120&d=mm&r=g)
First off, hi Harry! I am super glad that someone has taken an interest in this. Please let me know if I can be helpful in your effort to fix this. FilePath totally has the right sort of shape to handle all these problems very gracefully, but its current implementation is (as you have noticed!) a disaster, regardless of python 2/3 issues, it doesn't handle text/bytes correctly on python 2. Also, sorry for being a bit late to the party, been on vacation for a week :-). On Jul 14, 2013, at 7:18 AM, Harry Bock <bock.harryw@gmail.com> wrote:
It doesn't really make sense to talk about "native strings" unless you're talking about Python code objects; __doc__ and func_name are "native strings"; the inputs to FilePath are bytes, pure and simple. This is mostly just because FilePath was designed way back when I only really knew about the way path names worked on Linux. Among several of the design errors in Python 3's allegedly superior unicode support was to call the text type "str", when this was a confusing name in the first place, and is now ambiguous, confusing, and arguably wrong all at once; at the cost of one additional letter, it could have been "text", which is both a whole word and a more accurate description of what it does. I generally use "text" rather than "string" to describe the text type anyway, because it's a lot less ambiguous and requires less backtracking ("oh I was talking about python 2 there, let me rephrase").
This is not really true. This is how Linux and BSD handle file names; it is not how OS X handle file names. (Nor is it how Windows works, as you've mentioned above.) On OS X, file names are normalized (I forget the normalization at the moment, but you can look it up) UTF-8. They _must_ be normalized UTF-8; it doesn't matter what $LANG is. If you try to deal with filenames that are invalid UTF-8 byte sequences, the OS will URL-encode portions of the filename for you and _force_ its name (as returned by listdir() at least) to be a valid UTF-8 sequence. If you give it something non-normalized, it will normalize it for you.
The design should not be as naive as "support bytes" or "support unicode", or even "support both". In order to deal with some of these nastier edge-cases, you need a method that can give you a name to display to a user that's "human readable", a weird-Python-broken-surrogates-trick unicode object, and some bytes. Then there's possibly some extra methods that could be added which are only sometimes available, like "driveLetter()" or somesuch. (Maybe we could do better and have some kind of general mount-point object, but I digress.) In other words, we need to give the developer an expressive enough API to clearly indicate their intent, and then have clear enough API documentation for them to figure out what their intent is :). At the implementation level, these potential methods are both platform-specific and subtly distinctive. For example, the "human readable name" implementation of a broken FilePath should include replacement characters rather than broken-surrogate hacks. Replacement characters have a defined method for displaying them; since broken surrogates are just invalid garbage, some software might elect not to display the string at all, or throw an error. It might also be sensible (as a future enhancement, this is not something we should try to do as a basic part of proper unicode support) to do some encoding-guessing and mojibake detection when trying to compute the human-readable name, since this name is just for display and it makes sense to work as hard as possible to display something sensible, since it does NOT need to be able to be fed back in to FilePath. But of course on OS X, the thing to do would just be to convert to the percent-escaped version, since that's what the platform presents. And on Windows, it might be sensible for the thing that gives you bytes to give you a faithful UTF-8 version of the filename rather than some platform-dependent ANSI junk, since as far as I can tell there's no need to ever get a byte sequence you could pass back to some other ANSI API. If it were, that could be an explicitly separate API. Finally, the fact that FilePath exposes the internal representation of the path (as ".path") is sort of a design error, and we should eventually deprecate that attribute, since there are multiple use-cases you might want that string for and we should return the appropriate version depending on which one you want. I wouldn't worry about getting that attribute to do anything useful beyond a very rudimentary level of compatibility; in fact it would be great if the internal storage of the path were always unicode on Windows and always bytes on UNIX-ish platforms, and ".path" were just a proxy that always gave you bytes. (Although possibly the internal representation should just be unicode too on OS X, I keep finding myself on the fence about that.)
Is this something I can open a ticket for?
Hopefully the existing ticket is sufficient, but, open as many as you need :). There might be a bunch of methods that need modification here, and at least e.g. the ZipPath work could be done separately. -glyph
![](https://secure.gravatar.com/avatar/8ebdb2638dbd7849787b9edb6e3f3509.jpg?s=120&d=mm&r=g)
Hello, Harry! I just noticed this thread. I opened a ticket for this a while back: https://twistedmatrix.com/trac/ticket/5203# FilePath.children() should return FilePath objects with unicodes in them instead of strs There is some discussion on that ticket. For what it is worth, I agree with Itamar that porting to Python3 shouldn't be combined with changing the functionality or API, but I also agree with Harry (at least what Harry originally said) that FilePath objects should not carry around a "path" that is just bytes and doesn't specify what encoding those bytes are in. I know this is a subtle topic, in the sense that I can see the argument on the other side, too, and I don't think either approach can satisfy all users, but I still think it is a better idea to require unicode-only, and so I'd like to try to explain why a little bit, below, in addition to the discussion that is recorded on #5203. Here's my basic argument: a sequence of bytes without an accompanying encoding is an *insufficiently typed* thing. That is, there is no way to use it safely without first restoring a type, and that being the *correct* type. The traditional way to handle pathnames in Linux has been to let them be under-typed, and then restore the type heuristically. This traditionally worked most of the time, because the most common thing you would do with a sequence of bytes like that is plug it back into the same filesystem from which it came. However, I make two claims: 1. In the modern world, it is very common to send it over the network instead of to plug it back into the same filesystem from which it came, and 2. there's not very much need for this "forget what type it was, guess the type later, and guess correctly" hack! We can instead *require* the user to supply a type with the bytestring originally, and then remember the type that the user supplied. This breaks only a few use cases that are probably very rare, and in fact might be unfixable anyway, but it prevents failures which are very common, which is what happens when you guess the wrong type during the restore. This is what we've done in Tahoe-LAFS, and we've had few or no complaints from users about it. Certainly if there were any, it was in the early days, of Tahoe-LAFS, around 5 years ago, when ill-typed Linux filesystems hadn't quite finished dying out (i.e. the bytes on there are actually encoded in iso8859, but sys.getfilesystemencoding() returns 'utf-8'). We wrote unit tests and did careful code-review when we converted Tahoe-LAFS from bytes to unicode-only a few years ago, and so I'd be happy to share the knowledge I gleaned from that experience. Regards, Zooko
![](https://secure.gravatar.com/avatar/e1554622707bedd9202884900430b838.jpg?s=120&d=mm&r=g)
On Sep 11, 2013, at 10:48 AM, Zooko Wilcox-OHearn <zooko@leastauthority.com> wrote:
Just to be specific about this, the use-case that it breaks is the notion that you have a USB key formatted on a Linux machine in KOI-8 and you plug it into a system where the host encoding is Shift-JIS. You can then have a path which is partially in one encoding and partially in another. The problem with the "bytes-with-encoding" idea is that it doesn't apply to paths, it applies to path segments - which is why FilePath is (well, ought to be) a data *structure*, and not just some methods around existing data (a string). -glyph
![](https://secure.gravatar.com/avatar/869963bdf99b541c9f0bbfb04b0320f1.jpg?s=120&d=mm&r=g)
On Sun, Jul 14, 2013 at 4:00 AM, Harry Bock <bock.harryw@gmail.com> wrote:
There is no way to enforce a particular setting of the LANG environment variable globally; multiple users could use filenames encoded in different encodings (in fact even a single user could do this), and files could be transferred from other systems using different encodings. While a reasonable person might insist on the use of UTF-8 everywhere, there is no way to guarantee that UNIX filenames are all in the same encoding, or are even in any particular encoding at all (they might be binary non-text garbage), and the inability to deal with filenames like this would be somewhat of a serious defect. On Windows, the reverse situation obtains, of course. -- mithrandi, i Ainil en-Balandor, a faer Ambar
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On 07/13/2013 10:00 PM, Harry Bock wrote:
You imply that this was a change, somehow, but it wasn't. The API was *always* bytes and it continues to be bytes on Python 3. It's a common Python 3 porting mistake to change everything from bytes to unicode just because. E.g. Python standard library does this in many places for no good reason, resulting in bugs that are still being fixed (http://bugs.python.org/issue12411) or APIs that are less useful (zipfile docs explicitly state that there is no standard encoding in zip files, but Python 3 zipfile module only supports one specific encoding because they switched to Unicode and didn't bother reading the module's own docs). Our goal in porting was backwards compatibility with Python 2 code, so porters don't have to change everything, and correctness. And, in this particular case, to get something working in the minimal amount of time - *adding* Unicode support is useful and should be done.
It is indeed a problem that we only support bytes in FilePath on Python 3. As I mentioned above, Unicode support is missing only due to lack of time in the initial port.
The ideal situation would be to support bytes and Unicode on Python 2 *and* Python 3, for maximum compatibility. Even if deprecated on Windows, filesystem operations on Python 3 still do accept bytes (and they're not deprecated elsewhere). Given existing code that already takes bytes, switching to only doing Unicode on Python 3 would not be backwards compatible, so we can't really do that without a bunch of deprecation warnings and a few releases. Instead we should just do what Python does: if you start with bytes path you always get back bytes, if you start with Unicode path you always get back Unicode.
![](https://secure.gravatar.com/avatar/7ea0679fb517f5c674e8456f4c34e272.jpg?s=120&d=mm&r=g)
On Sun, Jul 14, 2013 at 8:16 AM, Itamar Turner-Trauring <itamar@itamarst.org
wrote:
Ah, I understand now. Since the native string type was used in Python 2, it follows that in Python 3 the API should be bytes.
This is very true and I didn't consider it in my initial investigation. While I think it would be uncommon to have files in multiple encodings on the same filesystem, it certainly would not be rare - to Tristan's point, copying names from filesystem to filesystem could easily result in multiple encodings. The operating system may not need to understand the encodings, but applications do to display them correctly, Which leads to your last point...
Yes, you're right, that's probably the best solution. It would not be terribly hard to do so - then application developers can choose whether to defer to the local user's interpretation of the setting, or explicitly use byte paths. Thanks so much for your input! Is this something I can open a ticket for?
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On 07/14/2013 10:18 AM, Harry Bock wrote:
Is this something I can open a ticket for?
I believe there's already a ticket of sorts, with an old defunct branch starting working on this - https://twistedmatrix.com/trac/ticket/2366 - it would be really great if you could revive it and add support for this feature. Using FilePath is definitely annoying on Python 3, and in general Unicode makes more sense in many (most?) situations. -Itamar
![](https://secure.gravatar.com/avatar/e1554622707bedd9202884900430b838.jpg?s=120&d=mm&r=g)
First off, hi Harry! I am super glad that someone has taken an interest in this. Please let me know if I can be helpful in your effort to fix this. FilePath totally has the right sort of shape to handle all these problems very gracefully, but its current implementation is (as you have noticed!) a disaster, regardless of python 2/3 issues, it doesn't handle text/bytes correctly on python 2. Also, sorry for being a bit late to the party, been on vacation for a week :-). On Jul 14, 2013, at 7:18 AM, Harry Bock <bock.harryw@gmail.com> wrote:
It doesn't really make sense to talk about "native strings" unless you're talking about Python code objects; __doc__ and func_name are "native strings"; the inputs to FilePath are bytes, pure and simple. This is mostly just because FilePath was designed way back when I only really knew about the way path names worked on Linux. Among several of the design errors in Python 3's allegedly superior unicode support was to call the text type "str", when this was a confusing name in the first place, and is now ambiguous, confusing, and arguably wrong all at once; at the cost of one additional letter, it could have been "text", which is both a whole word and a more accurate description of what it does. I generally use "text" rather than "string" to describe the text type anyway, because it's a lot less ambiguous and requires less backtracking ("oh I was talking about python 2 there, let me rephrase").
This is not really true. This is how Linux and BSD handle file names; it is not how OS X handle file names. (Nor is it how Windows works, as you've mentioned above.) On OS X, file names are normalized (I forget the normalization at the moment, but you can look it up) UTF-8. They _must_ be normalized UTF-8; it doesn't matter what $LANG is. If you try to deal with filenames that are invalid UTF-8 byte sequences, the OS will URL-encode portions of the filename for you and _force_ its name (as returned by listdir() at least) to be a valid UTF-8 sequence. If you give it something non-normalized, it will normalize it for you.
The design should not be as naive as "support bytes" or "support unicode", or even "support both". In order to deal with some of these nastier edge-cases, you need a method that can give you a name to display to a user that's "human readable", a weird-Python-broken-surrogates-trick unicode object, and some bytes. Then there's possibly some extra methods that could be added which are only sometimes available, like "driveLetter()" or somesuch. (Maybe we could do better and have some kind of general mount-point object, but I digress.) In other words, we need to give the developer an expressive enough API to clearly indicate their intent, and then have clear enough API documentation for them to figure out what their intent is :). At the implementation level, these potential methods are both platform-specific and subtly distinctive. For example, the "human readable name" implementation of a broken FilePath should include replacement characters rather than broken-surrogate hacks. Replacement characters have a defined method for displaying them; since broken surrogates are just invalid garbage, some software might elect not to display the string at all, or throw an error. It might also be sensible (as a future enhancement, this is not something we should try to do as a basic part of proper unicode support) to do some encoding-guessing and mojibake detection when trying to compute the human-readable name, since this name is just for display and it makes sense to work as hard as possible to display something sensible, since it does NOT need to be able to be fed back in to FilePath. But of course on OS X, the thing to do would just be to convert to the percent-escaped version, since that's what the platform presents. And on Windows, it might be sensible for the thing that gives you bytes to give you a faithful UTF-8 version of the filename rather than some platform-dependent ANSI junk, since as far as I can tell there's no need to ever get a byte sequence you could pass back to some other ANSI API. If it were, that could be an explicitly separate API. Finally, the fact that FilePath exposes the internal representation of the path (as ".path") is sort of a design error, and we should eventually deprecate that attribute, since there are multiple use-cases you might want that string for and we should return the appropriate version depending on which one you want. I wouldn't worry about getting that attribute to do anything useful beyond a very rudimentary level of compatibility; in fact it would be great if the internal storage of the path were always unicode on Windows and always bytes on UNIX-ish platforms, and ".path" were just a proxy that always gave you bytes. (Although possibly the internal representation should just be unicode too on OS X, I keep finding myself on the fence about that.)
Is this something I can open a ticket for?
Hopefully the existing ticket is sufficient, but, open as many as you need :). There might be a bunch of methods that need modification here, and at least e.g. the ZipPath work could be done separately. -glyph
![](https://secure.gravatar.com/avatar/8ebdb2638dbd7849787b9edb6e3f3509.jpg?s=120&d=mm&r=g)
Hello, Harry! I just noticed this thread. I opened a ticket for this a while back: https://twistedmatrix.com/trac/ticket/5203# FilePath.children() should return FilePath objects with unicodes in them instead of strs There is some discussion on that ticket. For what it is worth, I agree with Itamar that porting to Python3 shouldn't be combined with changing the functionality or API, but I also agree with Harry (at least what Harry originally said) that FilePath objects should not carry around a "path" that is just bytes and doesn't specify what encoding those bytes are in. I know this is a subtle topic, in the sense that I can see the argument on the other side, too, and I don't think either approach can satisfy all users, but I still think it is a better idea to require unicode-only, and so I'd like to try to explain why a little bit, below, in addition to the discussion that is recorded on #5203. Here's my basic argument: a sequence of bytes without an accompanying encoding is an *insufficiently typed* thing. That is, there is no way to use it safely without first restoring a type, and that being the *correct* type. The traditional way to handle pathnames in Linux has been to let them be under-typed, and then restore the type heuristically. This traditionally worked most of the time, because the most common thing you would do with a sequence of bytes like that is plug it back into the same filesystem from which it came. However, I make two claims: 1. In the modern world, it is very common to send it over the network instead of to plug it back into the same filesystem from which it came, and 2. there's not very much need for this "forget what type it was, guess the type later, and guess correctly" hack! We can instead *require* the user to supply a type with the bytestring originally, and then remember the type that the user supplied. This breaks only a few use cases that are probably very rare, and in fact might be unfixable anyway, but it prevents failures which are very common, which is what happens when you guess the wrong type during the restore. This is what we've done in Tahoe-LAFS, and we've had few or no complaints from users about it. Certainly if there were any, it was in the early days, of Tahoe-LAFS, around 5 years ago, when ill-typed Linux filesystems hadn't quite finished dying out (i.e. the bytes on there are actually encoded in iso8859, but sys.getfilesystemencoding() returns 'utf-8'). We wrote unit tests and did careful code-review when we converted Tahoe-LAFS from bytes to unicode-only a few years ago, and so I'd be happy to share the knowledge I gleaned from that experience. Regards, Zooko
![](https://secure.gravatar.com/avatar/e1554622707bedd9202884900430b838.jpg?s=120&d=mm&r=g)
On Sep 11, 2013, at 10:48 AM, Zooko Wilcox-OHearn <zooko@leastauthority.com> wrote:
Just to be specific about this, the use-case that it breaks is the notion that you have a USB key formatted on a Linux machine in KOI-8 and you plug it into a system where the host encoding is Shift-JIS. You can then have a path which is partially in one encoding and partially in another. The problem with the "bytes-with-encoding" idea is that it doesn't apply to paths, it applies to path segments - which is why FilePath is (well, ought to be) a data *structure*, and not just some methods around existing data (a string). -glyph
participants (5)
-
Glyph
-
Harry Bock
-
Itamar Turner-Trauring
-
Tristan Seligmann
-
Zooko Wilcox-OHearn