Mailman 3 Unicode Imports - Python-Dev

Unicode Imports

Kristján V. Jónsson

7 Sep 2006 7 Sep '06

10:26 p.m.

Hello All. I just added patch 1552880 to sourceforge. It is a patch for 2.6 (and 2.5) which allows unicode paths in sys.path and uses the unicode file api on windows. This is tried and tested on 2.5, and backported to 2.3 and is currently running on clients in china and esewhere. It is minimally intrusive to the inporting mechanism, at the cost of some string conversion overhead (to utf8 and then back to unicode). Cheers, Kristján

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Anthony Baxter

7 Sep 7 Sep

11:23 p.m.

On Friday 08 September 2006 02:56, Kristján V. Jónsson wrote:

...

Hello All. I just added patch 1552880 to sourceforge. It is a patch for 2.6 (and 2.5) which allows unicode paths in sys.path and uses the unicode file api on windows. This is tried and tested on 2.5, and backported to 2.3 and is currently running on clients in china and esewhere. It is minimally intrusive to the inporting mechanism, at the cost of some string conversion overhead (to utf8 and then back to unicode).

As this can't be considered a bugfix (that I can see), I'd be against it being checked into 2.5.

Steve Holden

8 Sep 8 Sep

1:54 p.m.

Anthony Baxter wrote:

...

On Friday 08 September 2006 02:56, Kristján V. Jónsson wrote:

...
Hello All. I just added patch 1552880 to sourceforge. It is a patch for 2.6 (and 2.5) which allows unicode paths in sys.path and uses the unicode file api on windows. This is tried and tested on 2.5, and backported to 2.3 and is currently running on clients in china and esewhere. It is minimally intrusive to the inporting mechanism, at the cost of some string conversion overhead (to utf8 and then back to unicode).

As this can't be considered a bugfix (that I can see), I'd be against it being checked into 2.5.

Are you suggesting that Python's inability to correctly handle Unicode path elements isn't a bug? Or simply that this inability isn't currently described in a bug report on Sourceforge? I agree it's a relatively large patch for a release candidate but if prudence suggests deferring it, it should be a *definite* for 2.5.1 and subsequent releases. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden

Anthony Baxter

2:28 p.m.

On Friday 08 September 2006 18:24, Steve Holden wrote:

...

...
As this can't be considered a bugfix (that I can see), I'd be against it being checked into 2.5.

Are you suggesting that Python's inability to correctly handle Unicode path elements isn't a bug? Or simply that this inability isn't currently described in a bug report on Sourceforge?

I'm suggesting that adding the ability to handle unicode paths is a *new* *feature*. If people actually want to see 2.5 final ever released, they're going to have to accept that "oh, but just this _one_ _more_ _thing_" is not going to fly. We're _well_ past beta1, where new features should have been added. At this point, we have to cut another release candidate. This is far too much to add during the release candidate stage.

...

I agree it's a relatively large patch for a release candidate but if prudence suggests deferring it, it should be a *definite* for 2.5.1 and subsequent releases.

Possibly. I remain unconvinced. -- Anthony Baxter It's never too late to have a happy childhood.

Steve Holden

2:49 p.m.

Anthony Baxter wrote:

...

On Friday 08 September 2006 18:24, Steve Holden wrote:

...
...
As this can't be considered a bugfix (that I can see), I'd be against it being checked into 2.5.

Are you suggesting that Python's inability to correctly handle Unicode path elements isn't a bug? Or simply that this inability isn't currently described in a bug report on Sourceforge?

I'm suggesting that adding the ability to handle unicode paths is a *new* *feature*.

That's certainly true.

...

If people actually want to see 2.5 final ever released, they're going to have to accept that "oh, but just this _one_ _more_ _thing_" is not going to fly.

We're _well_ past beta1, where new features should have been added. At this point, we have to cut another release candidate. This is far too much to add during the release candidate stage.

Right. I couldn't argue for putting this in to 2.5 - it would certainly represent unwarranted feature creep at the rc2 stage.

...

...
I agree it's a relatively large patch for a release candidate but if prudence suggests deferring it, it should be a *definite* for 2.5.1 and subsequent releases.

Possibly. I remain unconvinced.

But it *is* a desirable, albeit new, feature, so I'm surprised that you don't appear to perceive it as such for a downstream release. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden

Anthony Baxter

3:18 p.m.

On Friday 08 September 2006 19:19, Steve Holden wrote:

...

But it *is* a desirable, albeit new, feature, so I'm surprised that you don't appear to perceive it as such for a downstream release.

Point releases (2.x.1 and suchlike) are absolutely not for new features. They're for bugfixes, only. It's possible that this could be considered a bugfix, but as I said right now I'm dubious. Anthony -- Anthony Baxter It's never too late to have a happy childhood.

Steve Holden

3:58 p.m.

Anthony Baxter wrote:

...

On Friday 08 September 2006 19:19, Steve Holden wrote:

...
But it *is* a desirable, albeit new, feature, so I'm surprised that you don't appear to perceive it as such for a downstream release.

Point releases (2.x.1 and suchlike) are absolutely not for new features. They're for bugfixes, only. It's possible that this could be considered a bugfix, but as I said right now I'm dubious.

OK, in that case I'm going to argue that the current behaviour is buggy. I suppose your point is that, assuming the patch is correct (and it seems the authors are relying on it for production purposes in tens of thousands of installations), it doesn't change the behaviour of the interpreter in existing cases, and therefore it is providing a new feature. I don't regard this as the provision of a new feature but as the removal of an unnecessary restriction (which I would prefer to call a bug). If it was *documented* somewhere that Unicode paths aren't legal I would find your arguments more convincing. As things stand new Python users would, IMHO, be within their rights to assume that arbitrary directories could be added to the path without breakage. Ultimately, your call, I guess. Would it help if I added "inability to import from Unicode directories" as a bug? Or would you prefer to change the documentation to state that some directories can't be used as path elements <0.3 wink>? regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden

Guido van Rossum

9:59 p.m.

On 9/8/06, Steve Holden wrote:

...

Anthony Baxter wrote:

...
On Friday 08 September 2006 19:19, Steve Holden wrote:

...
But it *is* a desirable, albeit new, feature, so I'm surprised that you don't appear to perceive it as such for a downstream release.

Point releases (2.x.1 and suchlike) are absolutely not for new features. They're for bugfixes, only. It's possible that this could be considered a bugfix, but as I said right now I'm dubious.

OK, in that case I'm going to argue that the current behaviour is buggy.

I suppose your point is that, assuming the patch is correct (and it seems the authors are relying on it for production purposes in tens of thousands of installations), it doesn't change the behaviour of the interpreter in existing cases, and therefore it is providing a new feature.

I don't regard this as the provision of a new feature but as the removal of an unnecessary restriction (which I would prefer to call a bug). If it was *documented* somewhere that Unicode paths aren't legal I would find your arguments more convincing. As things stand new Python users would, IMHO, be within their rights to assume that arbitrary directories could be added to the path without breakage.

Ultimately, your call, I guess. Would it help if I added "inability to import from Unicode directories" as a bug? Or would you prefer to change the documentation to state that some directories can't be used as path elements <0.3 wink>?

We've all heard the arguments for both sides enough times I think. IMO it's the call of the release managers. Board members ought to trust the release managers and not apply undue pressure. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

skip＠pobox.com

10:11 p.m.

Guido> IMO it's the call of the release managers. Board members ought to Guido> trust the release managers and not apply undue pressure. Indeed. Let's not go whacking people with boards. The Perl people would just laugh at us... Skip

Giovanni Bajo

9 Sep 9 Sep

12:21 a.m.

Guido van Rossum wrote:

...

IMO it's the call of the release managers. Board members ought to trust the release managers and not apply undue pressure.

+1, but I would love to see a more formal definition of what a "bugfix" is, which would reduce the ambiguous cases, and thus reduce the number of times the release managers are called to pronounce. Other projects, for instance, describe point releases as "open for regression fixes only", which means that a patch, to be eligible for a point release, must fix a regression (something which used to work before, and doesn't anymore). Regressions are important because they affect people wanting to upgrade Python. If something never worked before (like this unicode path thingie), surely existing Python users are not affected by the bug (or they have already workarounds in place), so that NOT having the bug fixed in a point release is not a problem. Anyway, I'm not pushing for this specific policy (even if I like it): I'm just suggesting Release Managers to more formally define what should and what should not go in a point release. Giovanni Bajo

Raymond Hettinger

12:30 a.m.

Giovanni Bajo wrote:

...

+1, but I would love to see a more formal definition of what a "bugfix" is, which would reduce the ambiguous cases, and thus reduce the number of times the release managers are called to pronounce.

Sorry, that is just a pipe-dream. To some degree, all bug-fixes are new features in that there is some behavioral difference, something will now work that wouldn't work before. While some cases are clear-cut (such as API changes), the ones that are interesting will defy definition and need a human judgment call as to whether a given change will help more than it hurts. The RMs are also strongly biased against extensive patches than haven't had a chance to go through a beta-cycle -- they don't want their releases mucked-up. Raymond

"Martin v. Löwis"

2:29 a.m.

Giovanni Bajo schrieb:

...

+1, but I would love to see a more formal definition of what a "bugfix" is, which would reduce the ambiguous cases, and thus reduce the number of times the release managers are called to pronounce.

Other projects, for instance, describe point releases as "open for regression fixes only", which means that a patch, to be eligible for a point release, must fix a regression (something which used to work before, and doesn't anymore).

In Python, the tradition has excepted bug fixes beyond that. For example, fixing a memory leak would also count as a bug fix. In general, I think a "bug" is a deviation from the specification (it might be necessary to interpret the specification first to find out whether the implementation deviates). A bug fix is then a behavior change so that the new behavior follows the specification, or a specification change so that it correctly describes the behavior. Regards, Martin

"Martin v. Löwis"

2:26 a.m.

Steve Holden schrieb:

...

I don't regard this as the provision of a new feature but as the removal of an unnecessary restriction (which I would prefer to call a bug).

You got the definition of "bug" wrong. Primarily, a bug is a deviation from the specification. Extending the domain of an argument to an existing function is a new feature. Regards, Martin

Nick Coghlan

8 Sep 8 Sep

3:26 p.m.

Steve Holden wrote:

...

Anthony Baxter wrote:

...
On Friday 08 September 2006 18:24, Steve Holden wrote:

...
I agree it's a relatively large patch for a release candidate but if prudence suggests deferring it, it should be a *definite* for 2.5.1 and subsequent releases.

Possibly. I remain unconvinced.

But it *is* a desirable, albeit new, feature, so I'm surprised that you don't appear to perceive it as such for a downstream release.

And unlike 2.2's True/False problem, it is an *environmental* feature, rather than a programmatic one. So while it's a new feature, it would merely mean that 2.5.1 works correctly in more environments than 2.5. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

"Martin v. Löwis"

9 Sep 9 Sep

2:24 a.m.

Nick Coghlan schrieb:

...

...
But it *is* a desirable, albeit new, feature, so I'm surprised that you don't appear to perceive it as such for a downstream release.

And unlike 2.2's True/False problem, it is an *environmental* feature, rather than a programmatic one.

Not sure what you mean by that; if you mean "thus existing applications cannot break": this is not true. In fact, it seems that some applications are extremely susceptible to the types of objects on sys.path. Some applications apparently know exactly what you can and cannot find on sys.path; changing that might break them. Regards, Martin

"Martin v. Löwis"

2:22 a.m.

Steve Holden schrieb:

...

...
...
I agree it's a relatively large patch for a release candidate but if prudence suggests deferring it, it should be a *definite* for 2.5.1 and subsequent releases.

Possibly. I remain unconvinced.

But it *is* a desirable, albeit new, feature, so I'm surprised that you don't appear to perceive it as such for a downstream release.

Because 2.5.1 shouldn't include any new features. If it is a new feature (which it is), it should go into 2.6. Regards, Martin

"Martin v. Löwis"

2:21 a.m.

Steve Holden schrieb:

...

...
As this can't be considered a bugfix (that I can see), I'd be against it being checked into 2.5.

Are you suggesting that Python's inability to correctly handle Unicode path elements isn't a bug?

Not sure whether Anthony suggests it, but I do.

...

Or simply that this inability isn't currently described in a bug report on Sourceforge?

No: sys.path is specified (originally) as containing a list of byte strings; it was extended to also support path importers (or whatever that PEP calls them). It was never extended to support Unicode strings. That other PEP e

...

I agree it's a relatively large patch for a release candidate but if prudence suggests deferring it, it should be a *definite* for 2.5.1 and subsequent releases.

I'm not so sure it should. It *is* a new feature: it makes applications possible which aren't possible today, and the documentation does not ever suggest that these applications should have been possible. In fact, it is common knowledge that this currently isn't supported. Regards, Martin

Nick Coghlan

11:25 a.m.

Martin v. Löwis wrote:

...

Steve Holden schrieb:

...
Or simply that this inability isn't currently described in a bug report on Sourceforge?

No: sys.path is specified (originally) as containing a list of byte strings; it was extended to also support path importers (or whatever that PEP calls them). It was never extended to support Unicode strings. That other PEP e

That other PEP being PEP 302. That said, Unicode strings *are* permitted on sys.path - the import system will automatically encode them to an 8-bit string using the default filesystem encoding as part of the import process. This works fine on Unix systems that use UTF-8 encoded strings to handle Unicode paths at the C API level, but is screwed on Windows because the default mbcs filesystem encoding can't handle the full range of possible Unicode path names (such as the Chinese directories that originally gave Kristján grief). To get Unicode path names to work on Windows, you have to use the Windows-specific wide character API instead of the normal C API, and the import machinery doesn't do that. So this is taking something that *already works properly on POSIX systems* and making it work on Windows as well.

...

...
I agree it's a relatively large patch for a release candidate but if prudence suggests deferring it, it should be a *definite* for 2.5.1 and subsequent releases.

I'm not so sure it should. It *is* a new feature: it makes applications possible which aren't possible today, and the documentation does not ever suggest that these applications should have been possible. In fact, it is common knowledge that this currently isn't supported.

It should already work fine on POSIX filesystems that use the default filesystem encoding for path names. As far as I am aware, it is only Windows where it doesn't work. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

"Martin v. Löwis"

12:53 p.m.

Nick Coghlan schrieb:

...

So this is taking something that *already works properly on POSIX systems* and making it work on Windows as well.

I doubt it does without side effects. For example, an application that would go through sys.path, and encode everything with sys.getfilesystemencoding() currently works, but will break if the patch is applied and non-mbcs strings are put on sys.path. Also, what will be the effect on __file__? What value will it have if the module originates from a sys.path entry that is a non-mbcs unicode string? I haven't tested the patch, but it looks like __file__ becomes a unicode string on Windows, and remains a byte string encoded with the file system encoding elsewhere. That's also a change in behavior. Regards, Martin

Steve Holden

6:03 p.m.

Martin v. Löwis wrote:

...

Nick Coghlan schrieb:

...
So this is taking something that *already works properly on POSIX systems* and making it work on Windows as well.

I doubt it does without side effects. For example, an application that would go through sys.path, and encode everything with sys.getfilesystemencoding() currently works, but will break if the patch is applied and non-mbcs strings are put on sys.path.

Also, what will be the effect on __file__? What value will it have if the module originates from a sys.path entry that is a non-mbcs unicode string? I haven't tested the patch, but it looks like __file__ becomes a unicode string on Windows, and remains a byte string encoded with the file system encoding elsewhere. That's also a change in behavior.

Just to summarise my feeling having read the words of those more familiar with the issues than me: it looks like this should be a 2.6 enhancement if it's included at all. I'd like to see it go in, but there do seem to be problems ensuring consistent behaviour across inconsistent platforms. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden

David Hopwood

8:56 p.m.

Martin v. Löwis wrote:

...

Nick Coghlan schrieb:

...
So this is taking something that *already works properly on POSIX systems* and making it work on Windows as well.

I doubt it does without side effects. For example, an application that would go through sys.path, and encode everything with sys.getfilesystemencoding() currently works, but will break if the patch is applied and non-mbcs strings are put on sys.path.

Huh? It won't break on any path for which it is not already broken. You seem to be saying "Paths with non-mbcs strings shouldn't work on Windows, because they haven't worked in the past." -- David Hopwood

"Martin v. Löwis"

9:04 p.m.

David Hopwood schrieb:

...

...
I doubt it does without side effects. For example, an application that would go through sys.path, and encode everything with sys.getfilesystemencoding() currently works, but will break if the patch is applied and non-mbcs strings are put on sys.path.

Huh? It won't break on any path for which it is not already broken.

You seem to be saying "Paths with non-mbcs strings shouldn't work on Windows, because they haven't worked in the past."

That's not what I'm saying. I'm saying that it shouldn't work in 2.5.x, because it didn't in 2.5.0. Changing it in 2.6 is fine, along with the incompatibilities it causes. Regards, Martin

Nick Coghlan

10:35 p.m.

David Hopwood wrote:

...

Martin v. Löwis wrote:

...
Nick Coghlan schrieb:

...
So this is taking something that *already works properly on POSIX systems* and making it work on Windows as well. I doubt it does without side effects. For example, an application that would go through sys.path, and encode everything with sys.getfilesystemencoding() currently works, but will break if the patch is applied and non-mbcs strings are put on sys.path.

Huh? It won't break on any path for which it is not already broken.

You seem to be saying "Paths with non-mbcs strings shouldn't work on Windows, because they haven't worked in the past."

I think MvL is looking at it from the point of view of consumers of the list of strings in sys.path, such as PEP 302 importer and loader objects, and tools like module_finder. Currently, the list of values in sys.path is limited to: 1. 8-bit strings 2. Unicode strings containing only characters which can be encoded using the default file system encoding For PEP 302 loaders, it is currently correct for them to take the 8-bit string they receive and do "path.decode(sys.getfilesystemencoding())" Kristján's patch works nicely for his application because he doesn't have to worry about compatibility with existing loaders and utilities. The core doesn't have that luxury. We *might* be able to find a backwards compatible way to do it that could be put into 2.5.x, but that is effort that could more profitably be spent elsewhere, particularly since the state of the import system in Py3k will be for it to be based entirely on Unicode (as GvR pointed out last time this topic came up [1]). Cheers, Nick. http://mail.python.org/pipermail/python-dev/2006-June/066225.html -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

"Martin v. Löwis"

11:12 p.m.

Nick Coghlan schrieb:

...

I think MvL is looking at it from the point of view of consumers of the list of strings in sys.path, such as PEP 302 importer and loader objects, and tools like module_finder. Currently, the list of values in sys.path is limited to:

That, and all kinds of inspection tools. For example, when __file__ of a module object changes to be a Unicode string (which it does under the proposed patch), then these tools break. They currently don't break in that way because putting arbitrary Unicode strings on sys.path doesn't work in the first place. Regards, Martin

David Hopwood

10 Sep 10 Sep

12:22 a.m.

Nick Coghlan wrote:

...

David Hopwood wrote:

...
Martin v. Löwis wrote:

...
Nick Coghlan schrieb:

...
So this is taking something that *already works properly on POSIX systems* and making it work on Windows as well.

I doubt it does without side effects. For example, an application that would go through sys.path, and encode everything with sys.getfilesystemencoding() currently works, but will break if the patch is applied and non-mbcs strings are put on sys.path.

Huh? It won't break on any path for which it is not already broken.

You seem to be saying "Paths with non-mbcs strings shouldn't work on Windows, because they haven't worked in the past."

I think MvL is looking at it from the point of view of consumers of the list of strings in sys.path, such as PEP 302 importer and loader objects, and tools like module_finder. Currently, the list of values in sys.path is limited to:

1. 8-bit strings 2. Unicode strings containing only characters which can be encoded using the default file system encoding

On Windows, file system pathnames can contain arbitrary Unicode characters (well, almost). Despite the existence of "ANSI" filesystem APIs, and regardless of what 'sys.getfilesystemencoding()' returns, the underlying file system encoding for NTFS and FAT filesystems is UTF-16LE. Thus, either: - the fact that sys.getfilesystemencoding() returns a non-Unicode encoding on Windows is a bug, or - any program that relies on sys.getfilesystemencoding() being able to encode arbitrary Windows pathnames has a bug. We need to decide which of these is the case. -- David Hopwood

"Martin v. Löwis"

12:46 a.m.

David Hopwood schrieb:

...

On Windows, file system pathnames can contain arbitrary Unicode characters (well, almost). Despite the existence of "ANSI" filesystem APIs, and regardless of what 'sys.getfilesystemencoding()' returns, the underlying file system encoding for NTFS and FAT filesystems is UTF-16LE.

Thus, either: - the fact that sys.getfilesystemencoding() returns a non-Unicode encoding on Windows is a bug, or - any program that relies on sys.getfilesystemencoding() being able to encode arbitrary Windows pathnames has a bug.

We need to decide which of these is the case.

There is a third option: - the operating system has a bug It is actually this option that rules out the other two. sys.getfilesystemencoding() returns "mbcs" on Windows, which means CP_ACP. The file system encoding is an encoding that converts a file name into a byte string. Unfortunately, on Windows, there are file names which cannot be converted into a byte string in a standard manner. This is an operating system bug (or mis-design; they should have chosen UTF-8 as the byte encoding of file names, instead of making it depend on the system locale, but they of course did so for backwards compatibility with Windows 3.1 and 9x). As a side note: every encoding in Python is a Unicode encoding; so there aren't any "non-Unicode encodings". Programs that rely on sys.getfilesystemencoding() being able to represent arbitrary file names on Windows might have a bug; programs that rely on sys.getfilesystemencoding() being able to encode all elements of sys.path do not (atleast not for Python 2.5 and earlier). Regards, Martin

David Hopwood

2:52 a.m.

Martin v. Löwis wrote:

...

David Hopwood schrieb:

...
On Windows, file system pathnames can contain arbitrary Unicode characters (well, almost). Despite the existence of "ANSI" filesystem APIs, and regardless of what 'sys.getfilesystemencoding()' returns, the underlying file system encoding for NTFS and FAT filesystems is UTF-16LE.

Thus, either: - the fact that sys.getfilesystemencoding() returns a non-Unicode encoding on Windows is a bug, or - any program that relies on sys.getfilesystemencoding() being able to encode arbitrary Windows pathnames has a bug.

We need to decide which of these is the case.

There is a third option: - the operating system has a bug

This behaviour is by design. If it is a bug, then it is a "won't ever fix -- no way, no how" bug, that Python must accomodate if it is to properly support Unicode on Windows.

...

It is actually this option that rules out the other two. sys.getfilesystemencoding() returns "mbcs" on Windows, which means CP_ACP. The file system encoding is an encoding that converts a file name into a byte string. Unfortunately, on Windows, there are file names which cannot be converted into a byte string in a standard manner. This is an operating system bug (or mis-design; they should have chosen UTF-8 as the byte encoding of file names, instead of making it depend on the system locale, but they of course did so for backwards compatibility with Windows 3.1 and 9x).

Although UTF-8 was invented (in September 1992) technically before the release of the first version of NT supporting NTFS (NT 3.1 in July 1993), it had not been invented before the decision to use Unicode in NTFS, or in Windows NT's file APIs, had been made. (I believe OS/2 HPFS had not supported Unicode, even though NTFS was otherwise almost identical to it.) At that time, the decision to use Unicode at all was quite forward-looking; the final version of Unicode 1.0 had only been published in June 1992 (although it had been approved earlier; see http://www.unicode.org/history/). UTF-8 was only officially added to the Unicode standard in an appendix of Unicode 2.0 (published July 1996), and only given essentially equal status to UTF-16 and UTF-32 in Unicode 3.0 (September 1999).

...

As a side note: every encoding in Python is a Unicode encoding; so there aren't any "non-Unicode encodings".

It was clear from context that I meant "encoding capable of representing all Unicode characters".

...

Programs that rely on sys.getfilesystemencoding() being able to represent arbitrary file names on Windows might have a bug; programs that rely on sys.getfilesystemencoding() being able to encode all elements of sys.path do not (at least not for Python 2.5 and earlier).

Elements of sys.path can be Unicode strings in Python 2.5, and should be pathnames supported by the underlying OS. Where is it documented that there is any further restriction on them? And why should there be any further restriction on them? -- David Hopwood

"Martin v. Löwis"

3:25 a.m.

David Hopwood schrieb:

...

Elements of sys.path can be Unicode strings in Python 2.5, and should be pathnames supported by the underlying OS. Where is it documented that there is any further restriction on them? And why should there be any further restriction on them?

It's not documented in that detail; if people think it should be documented more thoroughly, that should be done (contributions are welcome). Changing the import machinery to deal with Unicode strings differently cannot be done for Python 2.5, though: it cannot be done for 2.5.0 as the release candidate has already been published, and there is no acceptable patch available at this moment. It cannot be added to 2.5.x as it may reasonably break existing applications. Regards, Martin

Nick Coghlan

7:54 a.m.

David Hopwood wrote:

...

Martin v. Löwis wrote:

...
Programs that rely on sys.getfilesystemencoding() being able to represent arbitrary file names on Windows might have a bug; programs that rely on sys.getfilesystemencoding() being able to encode all elements of sys.path do not (at least not for Python 2.5 and earlier).

Elements of sys.path can be Unicode strings in Python 2.5, and should be pathnames supported by the underlying OS. Where is it documented that there is any further restriction on them? And why should there be any further restriction on them?

There's no suggestion that this limitation shouldn't be fixed - merely that fixing it is likely to break some applications which rely on sys.path for importing or introspection purposes. A 2.5.x maintenance release typically shouldn't break anything that worked correctly on 2.5.0, hence fixing this becomes a project for either 2.6 or 3.0. To put it another way: fixing this is likely to require changes to more than just the interpreter core. It will also potentially require changes to all applications which currently expect to be able to use 's.encode(sys.getfilesystemencoding())' to convert any Unicode path entry or __file__ attribute to an 8-bit string. Doing that qualifies as correcting a language design error or limitation, but it would require a real stretch of the definition to qualify as a bug fix. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

M.-A. Lemburg

9 Sep 9 Sep

12:42 a.m.

Kristján V. Jónsson wrote:

...

Hello All. I just added patch 1552880 to sourceforge. It is a patch for 2.6 (and 2.5) which allows unicode paths in sys.path and uses the unicode file api on windows. This is tried and tested on 2.5, and backported to 2.3 and is currently running on clients in china and esewhere. It is minimally intrusive to the inporting mechanism, at the cost of some string conversion overhead (to utf8 and then back to unicode).

+1 on adding it to Python 2.6. -0 for Python 2.5.x: Applications/modules written for Python 2.4 and 2.5 won't be expecting Unicode strings in sys.path with all the consequences that go with it, so this is a true change in semantics, not just a nice to have additional feature or "bug" fix. OTOH, those applications will just break in a different place with the patch applied :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 08 2006)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

6437

Age (days ago)

6440

Last active (days ago)

List overview

Download

29 comments

11 participants

participants (11)

"Martin v. Löwis"
Anthony Baxter
David Hopwood
Giovanni Bajo
Guido van Rossum
Kristján V. Jónsson
M.-A. Lemburg
Nick Coghlan
Raymond Hettinger
skip＠pobox.com
Steve Holden

Unicode Imports

Kristján V. Jónsson

Raymond Hettinger

David Hopwood

David Hopwood

David Hopwood

tags

participants (11)