Mailman 3 Some RFE for review - Python-Dev

newer
Re: [Python-Dev] Terminology for...

Some RFE for review

Reinhold Birkenfeld

June 26, 2005

4:57 p.m.

Hi, while bugs and patches are sometimes tricky to close, RFE can be very easy to decide whether to implement in the first place. So what about working a bit on this front? Here are several RFE reviewed, perhaps some can be closed ("should" is always from submitter's point of view): 1193128: str.translate(None, delchars) should be allowed and only delete delchars from the string. Review: may be a good feature, although string._idmap can be passed as the first parameter too. 1226256: The "path" module by Jason Orendorff should be in the standard library. http://www.jorendorff.com/articles/python/path/ Review: the module is great and seems to have a large user base. On c.l.py there are frequent praises about it. 1216944: urllib(2) should gain a dict mapping HTTP status codes to the correspondig status/error text. Review: I can't see anything speaking against it. 1214675: warnings should get a removefilter() method. An alternative would be to fully document the "filters" attribute to allow direct tinkering with it. Review: As mwh said in a comment, this isn't Java, so the latter may be the way to go. 1205239: Shift operands should be allowed to be negative integers, so e.g. a << -2 is the same as a >> 2. Review: Allowing this would open a source of bugs previously well identifiable. 1152248: In order to read "records" separated by something other than newline, file objects should either support an additional parameter (the separator) to (x)readlines(), or gain an additional method which does this. Review: The former is a no-go, I think, because what is read won't be lines. The latter is further complicating the file interface, so I would follow the principle that not every 3-line function should be builtin. 1110010: A function "attrmap" should be introduced which is used as follows: attrmap(x)['att'] == getattr(x, 'att') The submitter mentions the use case of new-style classes without a __dict__ used at the right of %-style string interpolation. Review: I don't know whether this is worth it. 1052098: An environment variable should be supported to set the default encoding. Review: If one wants this for a single application, he can still implement it himself. 985094: getattr should support a callable as the second argument, used as follows: getattr(obj, func) == func(obj) Review: Highly unintuitive to me. That's all for today; sorry if it was too much ;) Reinhold -- Mail address is perfectly valid!

Show replies by date

Phillip J. Eby

June 2005

6:17 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 06:57 PM 6/26/2005 +0200, Reinhold Birkenfeld wrote:

...

1226256: The "path" module by Jason Orendorff should be in the standard library. http://www.jorendorff.com/articles/python/path/ Review: the module is great and seems to have a large user base. On c.l.py there are frequent praises about it.

I would note that there are some things in the interface that should be cleaned up before it becomes a stdlib module. It has many ways to do the same thing, and many of its property and method names are confusing because they either do the same thing as a standard function, but have a different name (like the 'parent' property that is os.path.dirname in disguise), or they have the same name as a standard function but do something different (like the 'listdir()' method that returns full paths rather than just filenames). I'm also not keen on the fact that it makes certain things properties whose value can change over time; i.e. ctime/mtime/atime and size really shouldn't be properties, but rather methods. I'm also not sure how I feel about all the read/write methods that hide the use of file objects; these seem like they should remain file object methods, especially since PEP 343 will allow easy closing with something like: with closing(some_path.open('w')) as f: f.write(data) Granted, this is more verbose than: some_path.write_bytes(data) But brevity isn't always everything. If these methods are kept I would suggest using different names, like "set_bytes()", "set_text()", and "set_lines()", because "write" sounds like something you do on an ongoing basis to a stream, while these methods just replace the file's entire contents. Aside from all these concerns, I'm +1 on adding the module. Here's my list of suggested changes: * path.joinpath(*args) -> path.subpath(*args) * path.listdir() -> path.subpaths() * path.splitall() -> path.parts() * path.parent -> path.dirname (and drop dirname() method) * path.name -> path.filename (and drop basename() method) * path.namebase -> path.filebase (maybe something more descriptive?) * path.atime/mtime/ctime -> path.atime(), path.mtime(), path.ctime() * path.size -> path.filesize() * drop getcwd(); it makes no sense on a path instance * add a samepath() method that compares the absolute, case and path-normalized versions of two paths, and a samerealpath() method that does the same but with symlinks resolved. And, assuming these file-content methods are kept: * path.bytes() -> path.get_file_bytes() * path.write_bytes() -> path.set_file_bytes() and path.append_file_bytes() * path.text() -> path.get_file_text() * path.write_text() -> path.set_file_text() and path.append_file_text() * path.lines() -> path.get_file_lines() * path.write_lines() -> path.set_file_lines() and path.append_file_lines()

Reinhold Birkenfeld

7 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Phillip J. Eby wrote:

...

At 06:57 PM 6/26/2005 +0200, Reinhold Birkenfeld wrote:

...
1226256: The "path" module by Jason Orendorff should be in the standard library. http://www.jorendorff.com/articles/python/path/ Review: the module is great and seems to have a large user base. On c.l.py there are frequent praises about it.

[...]

...

Aside from all these concerns, I'm +1 on adding the module.

Here's my list of suggested changes:

[...] I agree with your changes list. One more issue is open: the one of naming. As "path" is already the name of a module, what would the new object be called to avoid confusion? pathobj? objpath? Path? Reinhold -- Mail address is perfectly valid!

Phillip J. Eby

12:46 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 09:00 PM 6/26/2005 +0200, Reinhold Birkenfeld wrote:

...

One more issue is open: the one of naming. As "path" is already the name of a module, what would the new object be called to avoid confusion? pathobj? objpath? Path?

I was thinking os.Path, myself.

Michael Hoffman

7:19 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On Sun, 26 Jun 2005, Phillip J. Eby wrote:

...

* drop getcwd(); it makes no sense on a path instance

Personally I use path.getcwd() as a class method all the time. It makes as much sense as fromkeys() does on a dict instance, which is technically possible but non-sensical.

...

And, assuming these file-content methods are kept:

* path.bytes() -> path.get_file_bytes() * path.write_bytes() -> path.set_file_bytes() and path.append_file_bytes() * path.text() -> path.get_file_text() * path.write_text() -> path.set_file_text() and path.append_file_text() * path.lines() -> path.get_file_lines() * path.write_lines() -> path.set_file_lines() and path.append_file_lines()

I don't know how often these are used. I don't use them myself. I am mainly interested in this module so that I don't have to use os.path anymore. Reinhold Birkenfeld wrote:

...

One more issue is open: the one of naming. As "path" is already the name of a module, what would the new object be called to avoid confusion? pathobj? objpath? Path?

I would argue for Path. It fits with the recent cases of: from sets import Set from decimal import Decimal -- Michael Hoffman <hoffman@ebi.ac.uk> European Bioinformatics Institute

Phillip J. Eby

12:49 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 08:19 PM 6/26/2005 +0100, Michael Hoffman wrote:

...

On Sun, 26 Jun 2005, Phillip J. Eby wrote:

...
* drop getcwd(); it makes no sense on a path instance

Personally I use path.getcwd() as a class method all the time. It makes as much sense as fromkeys() does on a dict instance, which is technically possible but non-sensical.

It's also duplication with os.path; I'm -1 on creating a new staticmethod for it.

...

Reinhold Birkenfeld wrote:

...
One more issue is open: the one of naming. As "path" is already the name of a module, what would the new object be called to avoid confusion? pathobj? objpath? Path?

I would argue for Path. It fits with the recent cases of:

from sets import Set from decimal import Decimal

I like it too, as a class in the os module.

Michael Hoffman

7:20 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On Sun, 26 Jun 2005, Phillip J. Eby wrote:

...

At 08:19 PM 6/26/2005 +0100, Michael Hoffman wrote:

...
On Sun, 26 Jun 2005, Phillip J. Eby wrote:

...
* drop getcwd(); it makes no sense on a path instance

Personally I use path.getcwd() as a class method all the time. It makes as much sense as fromkeys() does on a dict instance, which is technically possible but non-sensical.

It's also duplication with os.path; I'm -1 on creating a new staticmethod for it.

os.getcwd() returns a string, but path.getcwd() returns a new path object. Almost everything in path is a duplication of os.path--the difference is that the path methods start and end with path objects. -- Michael Hoffman <hoffman@ebi.ac.uk> European Bioinformatics Institute

Reinhold Birkenfeld

7:29 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Michael Hoffman wrote:

...

On Sun, 26 Jun 2005, Phillip J. Eby wrote:

...
At 08:19 PM 6/26/2005 +0100, Michael Hoffman wrote:

...
On Sun, 26 Jun 2005, Phillip J. Eby wrote:

...
* drop getcwd(); it makes no sense on a path instance

Personally I use path.getcwd() as a class method all the time. It makes as much sense as fromkeys() does on a dict instance, which is technically possible but non-sensical.

It's also duplication with os.path; I'm -1 on creating a new staticmethod for it.

os.getcwd() returns a string, but path.getcwd() returns a new path object. Almost everything in path is a duplication of os.path--the difference is that the path methods start and end with path objects.

+1. Reinhold -- Mail address is perfectly valid!

Phillip J. Eby

3:06 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 08:20 AM 6/27/2005 +0100, Michael Hoffman wrote:

...

os.getcwd() returns a string, but path.getcwd() returns a new path object.

In that case, I'd expect it to be 'path.fromcwd()' or 'path.cwd()'; i.e. a constructor classmethod by analogy with 'dict.fromkeys()' or 'datetime.now()'. 'getcwd()' looks like it's getting a property of a path instance, and doesn't match stdlib conventions for constructors. So, +1 as long as it's called cwd() or something better (i.e. clearer and/or more consistent with stdlib constructor conventions).

Reinhold Birkenfeld

3:10 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Phillip J. Eby wrote:

...

At 08:20 AM 6/27/2005 +0100, Michael Hoffman wrote:

...
os.getcwd() returns a string, but path.getcwd() returns a new path object.

In that case, I'd expect it to be 'path.fromcwd()' or 'path.cwd()'; i.e. a constructor classmethod by analogy with 'dict.fromkeys()' or 'datetime.now()'. 'getcwd()' looks like it's getting a property of a path instance, and doesn't match stdlib conventions for constructors.

So, +1 as long as it's called cwd() or something better (i.e. clearer and/or more consistent with stdlib constructor conventions).

You're right. +1 for calling it fromcwd(). With that settled, should I rewrite the module? Should I write a PEP? Reinhold -- Mail address is perfectly valid!

Phillip J. Eby

3:35 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 05:10 PM 6/27/2005 +0200, Reinhold Birkenfeld wrote:

...

Phillip J. Eby wrote:

...
At 08:20 AM 6/27/2005 +0100, Michael Hoffman wrote:

...
os.getcwd() returns a string, but path.getcwd() returns a new path object.

In that case, I'd expect it to be 'path.fromcwd()' or 'path.cwd()'; i.e. a constructor classmethod by analogy with 'dict.fromkeys()' or 'datetime.now()'. 'getcwd()' looks like it's getting a property of a path instance, and doesn't match stdlib conventions for constructors.

So, +1 as long as it's called cwd() or something better (i.e. clearer and/or more consistent with stdlib constructor conventions).

You're right. +1 for calling it fromcwd().

I'm leaning slightly towards .cwd() for symmetry with datetime.now(), but not enough to argue about it if nobody has objections to fromcwd().

...

With that settled, should I rewrite the module? Should I write a PEP?

I think the only questions remaining open are where to put it and what to call the class. I think we should put it in os.path, such that 'from os.path import path' gives you the path class for your platform, and using one of the path modules directly (e.g. 'from posixpath import path') gives you the specific platform's version. This is useful because sometimes you need to manipulate paths that are foreign to your current OS. For example, the distutils and other packages sometimes use POSIX paths for input and then convert them to local OS paths. Also, POSIX path objects would be useful for creating or parsing the "path" portion of many kinds of URLs, and I have often used functions from posixpath for that myself. As for a PEP, I doubt a PEP is really required for something this simple; I have never seen anyone say, "no, we shouldn't have this in the stdlib". I think it would be more important to write reference documentation and a complete test suite. By the way, it also occurs to me that for the sake of subclassability, the methods should not return 'path(somestr)' when creating new objects, but instead use self.__class__(somestr).

Guido van Rossum

6:43 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On 6/27/05, Phillip J. Eby <pje@telecommunity.com> wrote:

...

I think the only questions remaining open are where to put it and what to call the class.

Whoa! Do we really need a completely different mechanism for doing the same stuff we can already do? The path module seems mostly useful for folks coming from Java who are used to the Java Path class. With the massive duplication of functionality we should also consider what to recommend for the future: will the old os.path module be deprecated, or are we going to maintain both alternatives forever? (And what about all the duplication with the os module itself, like the cwd() constructor?) Remember TOOWTDI.

...

I think we should put it in os.path, such that 'from os.path import path' gives you the path class for your platform, and using one of the path modules directly (e.g. 'from posixpath import path') gives you the specific platform's version.

Aargh! Call it anything except path. Having two things nested inside each other with the same name is begging for confusion forever. We have a few of these in the stdlib now (StringIO, UserDict etc.) and they were MISTAKES.

...

This is useful because sometimes you need to manipulate paths that are foreign to your current OS. For example, the distutils and other packages sometimes use POSIX paths for input and then convert them to local OS paths. Also, POSIX path objects would be useful for creating or parsing the "path" portion of many kinds of URLs, and I have often used functions from posixpath for that myself.

Right. That's why posixpath etc. always exists, not only when os.name == "posix".

...

As for a PEP, I doubt a PEP is really required for something this simple; I have never seen anyone say, "no, we shouldn't have this in the stdlib". I think it would be more important to write reference documentation and a complete test suite.

"No, we shouldn't have this in the stdlib." At least, not without more motivation than "it gets high praise".

...

By the way, it also occurs to me that for the sake of subclassability, the methods should not return 'path(somestr)' when creating new objects, but instead use self.__class__(somestr).

Clearly it needs a PEP. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Gerrit Holl

10:15 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Guido van Rossum wrote:

...

...
By the way, it also occurs to me that for the sake of subclassability, the methods should not return 'path(somestr)' when creating new objects, but instead use self.__class__(somestr).

Clearly it needs a PEP.

I haven't read the rest of the thread Once, there was a path-related pre-PEP. http://topjaklont.student.utwente.nl/creaties/path/pep-xxxx.html It was never finished, nor am I working on it, but it's public domain. Just wanted to remind potential PEP-authors. regards, Gerrit. -- Weather in Twenthe, Netherlands 29/06 10:55: 18.0°C Few clouds mostly cloudy wind 1.3 m/s None (57 m above NAP) -- In the councils of government, we must guard against the acquisition of unwarranted influence, whether sought or unsought, by the military-industrial complex. The potential for the disastrous rise of misplaced power exists and will persist. -Dwight David Eisenhower, January 17, 1961

Neil Hodgson

11:46 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Guido van Rossum:

...

Whoa! Do we really need a completely different mechanism for doing the same stuff we can already do?

One benefit I see for the path module is that it makes it easier to write code that behaves correctly with unicode paths on Windows. Currently, to implement code that may see unicode paths, you must first understand that unicode paths may be an issue, then write conditional code that uses either a string or unicode string to hold paths whenever a new path is created. Neil

Thomas Heller

July 2005

6:45 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

...

Guido van Rossum:

...
Whoa! Do we really need a completely different mechanism for doing the same stuff we can already do?

Neil Hodgson <nyamatongwe@gmail.com> writes:

...

One benefit I see for the path module is that it makes it easier to write code that behaves correctly with unicode paths on Windows. Currently, to implement code that may see unicode paths, you must first understand that unicode paths may be an issue, then write conditional code that uses either a string or unicode string to hold paths whenever a new path is created.

Indeed. This would probably handle the cases where you have to manipulate file paths in code. OTOH, Python is lacking a lot when you have to handle unicode strings on sys.path, in command line arguments, environment variables and maybe other places. See, for example http://mail.python.org/pipermail/python-list/2004-December/256969.html I had started to work on the sys.path unicode issues, but it seems a considerable rewrite of (not only) Python/import.c is required. But I fear the patch http://python.org/sf/1093253 is slowly getting out of date. Thomas

Neil Hodgson

2:08 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Thomas Heller:

...

OTOH, Python is lacking a lot when you have to handle unicode strings on sys.path, in command line arguments, environment variables and maybe other places.

A new patch #1231336 "Add unicode for sys.argv, os.environ, os.system" is now in SourceForge. New parallel features sys.argvu and os.environu are provided and os.system accepts unicode arguments similar to PEP 277. A screenshot showing why the existing features are inadequate and the new features an enhancement are at http://www.scintilla.org/pyunicode.png One problem is that when using "python -c cmd args", sys.argvu includes the "cmd" but sys.argv does not. They both contain the "-c". os.system was changed to make it easier to add some test cases but then that looked like too much trouble. There are far too many variants on exec*, spawn* and popen* to write a quick patch for these. Neil

Thomas Heller

6:58 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson <nyamatongwe@gmail.com> writes:

...

Thomas Heller:

...
OTOH, Python is lacking a lot when you have to handle unicode strings on sys.path, in command line arguments, environment variables and maybe other places.

A new patch #1231336 "Add unicode for sys.argv, os.environ, os.system" is now in SourceForge. New parallel features sys.argvu and os.environu are provided and os.system accepts unicode arguments similar to PEP 277. A screenshot showing why the existing features are inadequate and the new features an enhancement are at http://www.scintilla.org/pyunicode.png One problem is that when using "python -c cmd args", sys.argvu includes the "cmd" but sys.argv does not. They both contain the "-c".

Not only that, all the other flags like -O and -E are also in sys.argvu but not in sys.argv.

...

os.system was changed to make it easier to add some test cases but then that looked like too much trouble. There are far too many variants on exec*, spawn* and popen* to write a quick patch for these.

Those are nearly obsoleted by the subprocess module (although I do not know how that handles unicode. Thomas

Neil Hodgson

4:40 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Thomas Heller:

...

Not only that, all the other flags like -O and -E are also in sys.argvu but not in sys.argv.

OK, new patch fixes these and the "-c" issue.

...

Those are nearly obsoleted by the subprocess module (although I do not know how that handles unicode.

...

...
...
z = subprocess.Popen(u"cmd /c echo \u0417") Traceback (most recent call last): File "<stdin>", line 1, in ? File "c:\zed\python\dist\src\lib\subprocess.py", line 600, in __init__ errread, errwrite) File "c:\zed\python\dist\src\lib\subprocess.py", line 791, in _execute_child startupinfo) UnicodeEncodeError: 'ascii' codec can't encode character u'\u0417' in

It breaks. The argspec is zzOOiiOzO:CreateProcess. position 12: ordinal not in range(128) Neil

Guido van Rossum

6:28 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On 6/30/05, Neil Hodgson <nyamatongwe@gmail.com> wrote:

...

One benefit I see for the path module is that it makes it easier to write code that behaves correctly with unicode paths on Windows. Currently, to implement code that may see unicode paths, you must first understand that unicode paths may be an issue, then write conditional code that uses either a string or unicode string to hold paths whenever a new path is created.

Then maybe the code that handles Unicode paths in arguments should be fixed rather than adding a module that encapsulates a work-around... -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Hodgson

12:18 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Guido van Rossum:

...

Then maybe the code that handles Unicode paths in arguments should be fixed rather than adding a module that encapsulates a work-around...

It isn't clear whether you are saying this should be fixed by the user or in the library. For a quick example, say someone wrote some code for counting lines in a directory: import os root = "docs" lines = 0 for p in os.listdir(root): lines += len(file(os.path.join(root,p)).readlines()) print lines, "document lines" Quite common code. Running it now with one file "abc" in the directory yields correct behaviour:

...

pythonw -u "xlines.py" 1 document lines

Now copy the file "Здравствуйте" into the directory and run it again:

...

pythonw -u "xlines.py" Traceback (most recent call last): File "xlines.py", line 5, in ? lines += len(file(os.path.join(root,p)).readlines()) IOError: [Errno 2] No such file or directory: 'docs\\????????????'

Changing line 2 to [root = u"docs"] will make the code work. If this is the correct fix then all file handling code should be written using unicode names. Contrast this to using path: import path root = "docs" lines = 0 for p in path.path(root).files(): lines += len(file(p).readlines()) print lines, "document lines" The obvious code works with only "abc" in the directory and also when "Здравствуйте" is added. Now, if you are saying it is a library failure, then there are multiple ways to fix it. 1) os.listdir should always return unicode. The problem with this is that people will see breakage of existing scripts because of promotion issues. Much existing code assumes a fixed locale, often 8859-1 and combining unicode and accented characters will raise UnicodeDecodeError. 2) os.listdir should not return "???????" garbage, instead promoting to unicode whenever it sees garbage. This may also lead to UnicodeDecodeError as in (1). 3) This is an exceptional situation but the exception should be more explicit and raised earlier when os.listdir first encounters name garbage. Neil

Guido van Rossum

7:44 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

...

Guido van Rossum:

...
Then maybe the code that handles Unicode paths in arguments should be fixed rather than adding a module that encapsulates a work-around...

On 7/3/05, Neil Hodgson <nyamatongwe@gmail.com> wrote:

...

It isn't clear whether you are saying this should be fixed by the user or in the library.

I meant the library.

...

For a quick example, say someone wrote some code for counting lines in a directory: [deleted]

Ah, sigh. I didn't know that os.listdir() behaves differently when the argument is Unicode. Does os.listdir(".") really behave differently than os.listdir(u".")? Bah! I don't think that's a very good design (although I see where it comes from). Promoting only those entries that need it seems the right solution -- user code that can't deal with the Unicode entries shouldn't be used around directories containing unicode -- if it needs to work around unicode it should be fixed to support that! Mapping Unicode names to "?????" seems the wrong behavior (and doesn't work very well once you try to do anything with those names except for printing). Face it. Unicode stinks (from the programmer's POV). But we'll have to live with it. In Python 3.0 I want str and unicode to be the same data type (like String in Java) and I want a separate data type to hold a byte array. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Hodgson

11:11 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Guido van Rossum:

...

Ah, sigh. I didn't know that os.listdir() behaves differently when the argument is Unicode. Does os.listdir(".") really behave differently than os.listdir(u".")?

Yes:

...

...
...
os.listdir(".") ['abc', '????????????'] os.listdir(u".") [u'abc', u'\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435']

...

Bah! I don't think that's a very good design (although I see where it comes from).

Partly my fault. At the time I was more concerned with making functionality possible rather than convenient.

...

Promoting only those entries that need it seems the right solution -- user code that can't deal with the Unicode entries shouldn't be used around directories containing unicode -- if it needs to work around unicode it should be fixed to support that!

OK, I'll work on a patch for that but I'd like to see the opinions of the usual unicode guys as this will produce more opportunities for UnicodeDecodeError. The modification will probably work in the opposite way, asking for all the names in unicode and then attempting to convert to the default code page with failures retaining the unicode name. Neil

Thomas Heller

9:54 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson <nyamatongwe@gmail.com> writes:

...

Guido van Rossum:

...
Ah, sigh. I didn't know that os.listdir() behaves differently when the argument is Unicode. Does os.listdir(".") really behave differently than os.listdir(u".")?

Yes:

...
...
...
os.listdir(".") ['abc', '????????????'] os.listdir(u".") [u'abc', u'\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435']

...
Bah! I don't think that's a very good design (although I see where it comes from).

Partly my fault. At the time I was more concerned with making functionality possible rather than convenient.

...
Promoting only those entries that need it seems the right solution -- user code that can't deal with the Unicode entries shouldn't be used around directories containing unicode -- if it needs to work around unicode it should be fixed to support that!

I'm sorry but that's not my opinion. Code that can't deal with unicode entries is broken, imo. The programmer does not know where the user runs this code at what he throws at it. I think that this will hide bugs. When I installed the first game written in Python with pygame on my daughter's PC it didn't run, simply because there was a font listed in the registry which contained umlauts somewhere. OTOH, I once had a bug report from a py2exe user who complained that the program didn't start when installed in a path with japanese characters on it. I tried this out, the bug existed (and still exists), but I was astonished how many programs behaved the same: On a PC with english language settings, you cannot start WinZip or Acrobat Reader (to give just some examples) on a .zip or .pdf file contained in such a directory.

...

OK, I'll work on a patch for that but I'd like to see the opinions of the usual unicode guys as this will produce more opportunities for UnicodeDecodeError. The modification will probably work in the opposite way, asking for all the names in unicode and then attempting to convert to the default code page with failures retaining the unicode name.

Thomas

Neil Hodgson

12:43 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Thomas Heller:

...

OTOH, I once had a bug report from a py2exe user who complained that the program didn't start when installed in a path with japanese characters on it. I tried this out, the bug existed (and still exists), but I was astonished how many programs behaved the same: On a PC with english language settings, you cannot start WinZip or Acrobat Reader (to give just some examples) on a .zip or .pdf file contained in such a directory.

Much of the time these sorts of bugs don't make themselves too hard to live with because most non-ASCII names that any user encounters are still in the user's locale and so get mapped by Windows. It can be a lot of work supporting wide file names. I have just added wide file name support to my editor, SciTE, for the second time and am about to rip it out again as it complicates too much code for too few beneficiaries. (I want one executable for both Windows NT+ and 9x, so wide file names has to be a runtime choice leading to maybe 50 new branches in the code). If returning a mixture of unicode and narrow strings from os.listdir is the right thing to do then maybe it better for sys.argv and os.environ to also be mixtures. In patch #1231336 I added parallel attributes, sys.argvu and os.environu to hold unicode versions of this information. The alternative, placing unicode items in the existing attributes minimises API size. One question here is whether unicode items should be added only when the element is outside the user's locale (the CP_ACP code page) or whenever the item is outside ASCII. The former is more similar to existing behaviour but the latter is safer as it makes it harder to implicitly treat the data as being in an incorrect encoding. Neil

Thomas Heller

2:48 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson <nyamatongwe@gmail.com> writes:

...

Thomas Heller:

...
OTOH, I once had a bug report from a py2exe user who complained that the program didn't start when installed in a path with japanese characters on it. I tried this out, the bug existed (and still exists), but I was astonished how many programs behaved the same: On a PC with english language settings, you cannot start WinZip or Acrobat Reader (to give just some examples) on a .zip or .pdf file contained in such a directory.

Much of the time these sorts of bugs don't make themselves too hard to live with because most non-ASCII names that any user encounters are still in the user's locale and so get mapped by Windows.

...

It can be a lot of work supporting wide file names. I have just added wide file name support to my editor, SciTE, for the second time and am about to rip it out again as it complicates too much code for too few beneficiaries. (I want one executable for both Windows NT+ and 9x, so wide file names has to be a runtime choice leading to maybe 50 new branches in the code).

In python, the basic support for unicode file and pathnames is already there. No problem to open a file named u'\u5b66\u6821\u30c7\u30fc\\blah.py on WinXP with german locale. But adding u'\u5b66\u6821\u30c7\u30fc' to sys.path won't allow to import this file as module. Internally Python\import.c converts everything to strings. I started to refactor import.c to work with PyStringObjects instead of char buffers as a first step - PyUnicodeObjects could have been added later, but I gave up because there seems absolute zero interest in it. Ok - it makes no sense to have Python modules in directories with these filenames, but Python (especially when frozen or py2exe'd) itself could easily live itself in such a directory.

...

If returning a mixture of unicode and narrow strings from os.listdir is the right thing to do then maybe it better for sys.argv and os.environ to also be mixtures. In patch #1231336 I added parallel attributes, sys.argvu and os.environu to hold unicode versions of this information. The alternative, placing unicode items in the existing attributes minimises API size.

One question here is whether unicode items should be added only when the element is outside the user's locale (the CP_ACP code page) or whenever the item is outside ASCII. The former is more similar to existing behaviour but the latter is safer as it makes it harder to implicitly treat the data as being in an incorrect encoding.

I can't judge on this - but it's easy to experiment with it, even in current Python releases since sys.argvu, os.environu can also be provided by extension modules. But thanks that you care about this stuff - I'm a little bit worried because all the other folks seem to think everything's ok (?). Thomas

Neil Hodgson

3:50 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Thomas Heller:

...

But adding u'\u5b66\u6821\u30c7\u30fc' to sys.path won't allow to import this file as module. Internally Python\import.c converts everything to strings. I started to refactor import.c to work with PyStringObjects instead of char buffers as a first step - PyUnicodeObjects could have been added later, but I gave up because there seems absolute zero interest in it.

Well, most people when confronted with this will rename the directory to something simple like "ulib" and continue.

...

I can't judge on this - but it's easy to experiment with it, even in current Python releases since sys.argvu, os.environu can also be provided by extension modules.

It is the effect of this on the non-unicode-savvy that is important: if os.environu goes into prereleases of 2.5 then the only people that will use it are likely to be those who already try to keep their code unicode compliant. There is only likely to be (negative) feedback if existing features are made unicode-only or use unicode for non-ASCII.

...

But thanks that you care about this stuff - I'm a little bit worried because all the other folks seem to think everything's ok (?).

Unicode is becoming more of an issue: many Linux distributions now install by default with a UTF8 locale and other tools are starting to use this: GCC 4 now delivers error messages using Unicode quote characters like 'these' rather than `these'. There are 131 threads found by Google Groups for (UnicodeEncodeError OR UnicodeDecodeError) and 21 of these were in this June. A large proportion of the threads are in language-specific groups so are not as visible to core developers. Neil

M.-A. Lemburg

11:06 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson wrote:

...

Thomas Heller:

...
But adding u'\u5b66\u6821\u30c7\u30fc' to sys.path won't allow to import this file as module. Internally Python\import.c converts everything to strings. I started to refactor import.c to work with PyStringObjects instead of char buffers as a first step - PyUnicodeObjects could have been added later, but I gave up because there seems absolute zero interest in it.

Well, most people when confronted with this will rename the directory to something simple like "ulib" and continue.

I don't really buy this "trick": what if you happen to have a home directory with Unicode characters in it ?

...

...
I can't judge on this - but it's easy to experiment with it, even in current Python releases since sys.argvu, os.environu can also be provided by extension modules.

It is the effect of this on the non-unicode-savvy that is important: if os.environu goes into prereleases of 2.5 then the only people that will use it are likely to be those who already try to keep their code unicode compliant. There is only likely to be (negative) feedback if existing features are made unicode-only or use unicode for non-ASCII.

I don't like the idea of creating a parallel universe for Unicode - OSes are starting to integrate Unicode filenames rather quickly (UTF-8 on Unix, UTF-16-LE on Windows), so it's much better to follow them and start accepting Unicode in sys.path. Wouldn't it be easy to have the import logic convert Unicode entries in sys.path to whatever the OS uses internally (UTF-8 or UTF-16-LE) and then keep the char buffers in place ?

...

...
But thanks that you care about this stuff - I'm a little bit worried because all the other folks seem to think everything's ok (?).

Unicode is becoming more of an issue: many Linux distributions now install by default with a UTF8 locale and other tools are starting to use this: GCC 4 now delivers error messages using Unicode quote characters like 'these' rather than `these'. There are 131 threads found by Google Groups for (UnicodeEncodeError OR UnicodeDecodeError) and 21 of these were in this June. A large proportion of the threads are in language-specific groups so are not as visible to core developers.

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 09 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Neil Hodgson

1:04 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

M.-A. Lemburg:

...

I don't really buy this "trick": what if you happen to have a home directory with Unicode characters in it ?

Most people choose account names and thus home directory names that are compatible with their preferred locale settings: German users are unlikely to choose an account name that uses Japanese characters. Unicode is only necessary for file names that are outside your default locale. An administration utility may need to visit multiple user's home directories and so is more likely to encounter files with names that can not be represented in its default locale. I think it would be better if sys.path could include unicode entries but expect the code will rarely be exercised. Neil

M.-A. Lemburg

5:43 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson wrote:

...

M.-A. Lemburg:

...
I don't really buy this "trick": what if you happen to have a home directory with Unicode characters in it ?

Most people choose account names and thus home directory names that are compatible with their preferred locale settings: German users are unlikely to choose an account name that uses Japanese characters.

It's naive to assume that all people in Germany using the German locale have German names ;-) E.g. we have a large Japanese community living here in Düsseldorf. If that example does not convince you, just have a look at all the Chinese restaurants in cities around the world - I'm sure that quite a few of the owners will want to use their correctly written name as account name. Unicode makes this possible and while it may not be in wide-spread use nowadays, things will definitely change over the next few years as more and more OSes and platforms will introduce native Unicode support.

...

Unicode is only necessary for file names that are outside your default locale. An administration utility may need to visit multiple user's home directories and so is more likely to encounter files with names that can not be represented in its default locale.

I'm not sure why you bring up an administration tool: isn't the discussion about being able to load Python modules from directories with Unicode path components ?

...

I think it would be better if sys.path could include unicode entries but expect the code will rarely be exercised.

I think that sys.path should always use Unicode for non-ASCII path names - this would make it locale setting independent, which is what we should strive for in Py3k: locales are much easier to handle at the application level and only introduce portability problems if used at the OS or C lib level. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Neil Hodgson

1:55 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

M.-A. Lemburg:

...

It's naive to assume that all people in Germany using the German locale have German names ;-)

That is not an assumption I would make. The assumption I would make is that if it is important to you to have your account name in a particular character set then you will normally set your locale to enable easy use of that account.

...

I'm not sure why you bring up an administration tool: isn't the discussion about being able to load Python modules from directories with Unicode path components ?

The discussion has moved between various aspects of unicode support in Python. There are many areas of the Python library which are not compatible with unicode and having an idea of the incidence of particular situations helps define where effort is most effectively spent. My experience has been that because of the way Windows handles character set conversions, problems are less common on individual's machines than they are on servers. Neil

Guido van Rossum

6:21 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On 7/9/05, Neil Hodgson <nyamatongwe@gmail.com> wrote:

...

M.-A. Lemburg:

...
I don't really buy this "trick": what if you happen to have a home directory with Unicode characters in it ?

Most people choose account names and thus home directory names that are compatible with their preferred locale settings: German users are unlikely to choose an account name that uses Japanese characters. Unicode is only necessary for file names that are outside your default locale. An administration utility may need to visit multiple user's home directories and so is more likely to encounter files with names that can not be represented in its default locale.

I think it would be better if sys.path could include unicode entries but expect the code will rarely be exercised.

Another problem is that if you can return 8-bit strings encoded in the local code page, and also Unicode, combining the two using string operations (e.g. a directory using the local code page containing a file using Unicode, and then combining the two using os.path.join()) will fail unless the local code page is also Python's global default encoding (which it usually isn't -- we really try hard to keep the default encoding 'ascii' at all times). In some sense the safest approach from this POV would be to return Unicode as soon as it can't be encoded using the global default encoding. IOW normally this would return Unicode for all names containing non-ASCII characters. The problem is of course that while the I/O functions will handle this fine, *printing* Unicode still doesn't work by default. :-( I can't wait until we switch everything to Unicode and have encoding on all streams... -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Hodgson

1:55 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Guido van Rossum:

...

In some sense the safest approach from this POV would be to return Unicode as soon as it can't be encoded using the global default encoding. IOW normally this would return Unicode for all names containing non-ASCII characters.

On unicode versions of Windows, for attributes like os.listdir, os.getcwd, sys.argv, and os.environ, which can usefully return unicode strings, there are 4 options I see: 1) Always return unicode. This is the option I'd be happiest to use, myself, but expect this choice would change the behaviour of existing code too much and so produce much unhappiness. 2) Return unicode when the text can not be represented in ASCII. This will cause a change of behaviour for existing code which deals with non-ASCII data. 3) Return unicode when the text can not be represented in the default code page. While this change can lead to breakage because of combining byte string and unicode strings, it is reasonably safe from the point of view of data integrity as current code is returning garbage strings that look like '?????'. 4) Provide two versions of the attribute, one with the current name returning byte strings and a second with a "u" suffix returning unicode. This is the least intrusive, requiring explicit changes to code to receive unicode data. For patch #1231336 I chose this approach producing sys.argvu and os.environu. For os.listdir the current behaviour of returning unicode when its argument is unicode can be retained but that is not extensible to, for example, sys.argv. Since this issue may affect many attributes a common approach should be chosen. For experimenting with os.listdir, there is a patch for posixmodule.c at http://www.scintilla.org/difft.txt which implements (2). To specify the US-ASCII code page, the number 20127 is used as there is no definition for this in the system headers. To change to (3) comment out the line with 20127 and uncomment the line with CP_ACP. Unicode arguments produce unicode results. Neil

M.-A. Lemburg

2:29 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson wrote:

...

On unicode versions of Windows, for attributes like os.listdir, os.getcwd, sys.argv, and os.environ, which can usefully return unicode strings, there are 4 options I see:

1) Always return unicode. This is the option I'd be happiest to use, myself, but expect this choice would change the behaviour of existing code too much and so produce much unhappiness.

Would be nice, but will likely break too much code - if you let Unicode object enter non-Unicode aware code, it is likely that you'll end up getting stuck in tons of UnicodeErrors. If you want to get a feeling for this, try running Python with -U command line switch.

...

2) Return unicode when the text can not be represented in ASCII. This will cause a change of behaviour for existing code which deals with non-ASCII data.

+1 on this one (s/ASCII/Python's default encoding).

...

3) Return unicode when the text can not be represented in the default code page. While this change can lead to breakage because of combining byte string and unicode strings, it is reasonably safe from the point of view of data integrity as current code is returning garbage strings that look like '?????'.

-1: code pages are evil and the reason why Unicode was invented in the first place. This would be a step back in history.

...

4) Provide two versions of the attribute, one with the current name returning byte strings and a second with a "u" suffix returning unicode. This is the least intrusive, requiring explicit changes to code to receive unicode data. For patch #1231336 I chose this approach producing sys.argvu and os.environu.

-1 - this is what Microsoft did for many of their APIs. The result is two parallel universes with two sets of features, bugs, documentation, etc.

...

For os.listdir the current behaviour of returning unicode when its argument is unicode can be retained but that is not extensible to, for example, sys.argv.

I don't think that using the parameter type as "parameter" to function is a good idea. However, accepting both strings and Unicode will make it easier to maintain backwards compatibility.

...

Since this issue may affect many attributes a common approach should be chosen.

Indeed. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 11 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Guido van Rossum

4:06 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

I'm in full agreement with Marc-Andre below, except I don't like (1) at all -- having used other APIs that always return Unicode (like the Python XML parsers) it bothers me to get Unicode for no reason at all. OTOH I think Python 3.0 should be using a Unicode model closer to Java's. On 7/11/05, M.-A. Lemburg <mal@egenix.com> wrote:

...

Neil Hodgson wrote:

...
On unicode versions of Windows, for attributes like os.listdir, os.getcwd, sys.argv, and os.environ, which can usefully return unicode strings, there are 4 options I see:

1) Always return unicode. This is the option I'd be happiest to use, myself, but expect this choice would change the behaviour of existing code too much and so produce much unhappiness.

Would be nice, but will likely break too much code - if you let Unicode object enter non-Unicode aware code, it is likely that you'll end up getting stuck in tons of UnicodeErrors. If you want to get a feeling for this, try running Python with -U command line switch.

...
2) Return unicode when the text can not be represented in ASCII. This will cause a change of behaviour for existing code which deals with non-ASCII data.

+1 on this one (s/ASCII/Python's default encoding).

...
3) Return unicode when the text can not be represented in the default code page. While this change can lead to breakage because of combining byte string and unicode strings, it is reasonably safe from the point of view of data integrity as current code is returning garbage strings that look like '?????'.

-1: code pages are evil and the reason why Unicode was invented in the first place. This would be a step back in history.

...
4) Provide two versions of the attribute, one with the current name returning byte strings and a second with a "u" suffix returning unicode. This is the least intrusive, requiring explicit changes to code to receive unicode data. For patch #1231336 I chose this approach producing sys.argvu and os.environu.

-1 - this is what Microsoft did for many of their APIs. The result is two parallel universes with two sets of features, bugs, documentation, etc.

...
For os.listdir the current behaviour of returning unicode when its argument is unicode can be retained but that is not extensible to, for example, sys.argv.

I don't think that using the parameter type as "parameter" to function is a good idea. However, accepting both strings and Unicode will make it easier to maintain backwards compatibility.

...
Since this issue may affect many attributes a common approach should be chosen.

Indeed.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Jul 11 2005)

...
...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

-- --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Hodgson

6:53 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

M.-A. Lemburg:

...

...
2) Return unicode when the text can not be represented in ASCII. This will cause a change of behaviour for existing code which deals with non-ASCII data.

+1 on this one (s/ASCII/Python's default encoding).

I assume you mean the result of sys.getdefaultencoding() here. Unless much of the Python library is modified to use the default encoding, this will break. The problem is that different implicit encodings are being used for reading data and for accessing files. When calling a function, such as open, with a byte string, Python passes that byte string through to Windows which interprets it as being encoded in CP_ACP. When this differs from sys.getdefaultencoding() there will be a mismatch. Say I have been working on a machine set up for Australian English (or other Western European locale) but am working with Russian data so have set Python's default encoding to cp1251. With this simple script, g.py: import sys print file(sys.argv[1]).read() I process a file called '€.txt' with contents "European Euro" to produce C:\zed>python_d g.py €.txt European Euro With the proposed modification, sys.argv[1] u'\u20ac.txt' is converted through cp1251 to '\x88.txt' as the Euro is located at 0x88 in CP1251. The operating system is then asked to open '\x88.txt' which it interprets through CP_ACP to be u'\u02c6.txt' ('ˆ.txt') which then fails. If you are very unlucky there will be a file called 'ˆ.txt' so the call will succeed and produce bad data. Simulating with str(sys.argvu[1]): C:\zed>python_d g.py €.txt Traceback (most recent call last): File "g.py", line 2, in ? print file(str(sys.argvu[1])).read() IOError: [Errno 2] No such file or directory: '\x88.txt'

...

-1: code pages are evil and the reason why Unicode was invented in the first place. This would be a step back in history.

Features used to specify files (sys.argv, os.environ, ...) should match functions used to open and perform other operations with files as they do currently. This means their encodings should match. Neil

M.-A. Lemburg

8:37 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Hi Neil,

...

...
...
2) Return unicode when the text can not be represented in ASCII. This will cause a change of behaviour for existing code which deals with non-ASCII data.

+1 on this one (s/ASCII/Python's default encoding).

I assume you mean the result of sys.getdefaultencoding() here.

Yes. The default encoding is the encoding that Python assumes when auto-converting a string to Unicode. It is normally set to ASCII, but a user may want to use a different encoding. However, we've always made it very clear that the user is on his own when chainging the ASCII default to something else.

...

Unless much of the Python library is modified to use the default encoding, this will break. The problem is that different implicit encodings are being used for reading data and for accessing files. When calling a function, such as open, with a byte string, Python passes that byte string through to Windows which interprets it as being encoded in CP_ACP. When this differs from sys.getdefaultencoding() there will be a mismatch.

As I said: code pages are evil :-)

...

Say I have been working on a machine set up for Australian English (or other Western European locale) but am working with Russian data so have set Python's default encoding to cp1251. With this simple script, g.py:

import sys print file(sys.argv[1]).read()

I process a file called '€.txt' with contents "European Euro" to produce

C:\zed>python_d g.py €.txt European Euro

With the proposed modification, sys.argv[1] u'\u20ac.txt' is converted through cp1251

Actually, it is not: if you pass in a Unicode argument to one of the file I/O functions and the OS supports Unicode directly or at least provides the notion of a file system encoding, then the file I/O should use the Unicode APIs of the OS or convert the Unicode argument to the file system encoding. AFAIK, this is how posixmodule.c already works (more or less). I was suggesting that OS filename output APIs such as os.listdir() should return strings, if the filename matches the default encoding, and Unicode, if not. On input, file I/O APIs should accept both strings using the default encoding and Unicode. How these inputs are then converted to suit the OS is up to the OS abstraction layer, e.g. posixmodule.c. Note that the posixmodule currently does not recode string arguments: it simply passes them to the OS as-is, assuming that they are already encoded using the file system encoding. Changing this is easy, though: instead of using the "et" getargs format specifier, you'd have to use "es". The latter recodes strings based on the default encoding assumption to whatever other encoding you specify.

...

to '\x88.txt' as the Euro is located at 0x88 in CP1251. The operating system is then asked to open '\x88.txt' which it interprets through CP_ACP to be u'\u02c6.txt' ('ˆ.txt') which then fails. If you are very unlucky there will be a file called 'ˆ.txt' so the call will succeed and produce bad data.

Simulating with str(sys.argvu[1]):

C:\zed>python_d g.py €.txt Traceback (most recent call last): File "g.py", line 2, in ? print file(str(sys.argvu[1])).read() IOError: [Errno 2] No such file or directory: '\x88.txt'

See above: this is what I'd consider a bug in posixmodule.c

...

...
-1: code pages are evil and the reason why Unicode was invented in the first place. This would be a step back in history.

Features used to specify files (sys.argv, os.environ, ...) should match functions used to open and perform other operations with files as they do currently. This means their encodings should match.

Right. However, most of these APIs currently either don't make any assumption on the strings contents and simply pass them around, or they assume that these strings use the file system encoding - which, like in the example you gave above, can be different from the default encoding. To untie this Gordian Knot, we should use strings and Unicode like they are supposed to be used (in the context of text data): * strings are fine for text data that is encoded using the default encoding * Unicode should be used for all text data that is not or cannot be encoded in the default encoding Later on in Py3k, all text data should be stored in Unicode and all binary data in some new binary type. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 12 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Neil Hodgson

12:57 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Hi Marc-Andre,

...

...
With the proposed modification, sys.argv[1] u'\u20ac.txt' is converted through cp1251

Actually, it is not: if you pass in a Unicode argument to one of the file I/O functions and the OS supports Unicode directly or at least provides the notion of a file system encoding, then the file I/O should use the Unicode APIs of the OS or convert the Unicode argument to the file system encoding. AFAIK, this is how posixmodule.c already works (more or less).

Yes it is. The initial stage is reading the command line arguments. The proposed modification is to change behaviour when constructing sys.argv, os.environ or when calling os.listdir to "Return unicode when the text can not be represented in Python's default encoding". I take this to mean that when the value can be represented in Python's default encoding then it is returned as a byte string in the default encoding. Therefore, for the example, the code that sets up sys.argv has to encode the unicode command line argument into cp1251.

...

On input, file I/O APIs should accept both strings using the default encoding and Unicode. How these inputs are then converted to suit the OS is up to the OS abstraction layer, e.g. posixmodule.c.

This looks to me to be insufficiently compatible with current behaviour whih accepts byte strings outside the default encoding. Existing code may call open("€.txt"). This is perfectly legitimate current Python (with a coding declaration) as "€.txt" is a byte string and file systems will accept byte string names. Since the standard default encoding is ASCII, should such code raise UnicodeDecodeError?

...

Changing this is easy, though: instead of using the "et" getargs format specifier, you'd have to use "es". The latter recodes strings based on the default encoding assumption to whatever other encoding you specify.

Don't you want to convert these into unicode rather than another byte string encoding? It looks to me as though the "es" format always produces byte strings and the only byte string format that can be passed to the operating system is the file system encoding which may not contain all the characters in the default encoding. Neil

M.-A. Lemburg

11:04 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Hi Neil,

...

...
...
With the proposed modification, sys.argv[1] u'\u20ac.txt' is converted through cp1251

Actually, it is not: if you pass in a Unicode argument to one of the file I/O functions and the OS supports Unicode directly or at least provides the notion of a file system encoding, then the file I/O should use the Unicode APIs of the OS or convert the Unicode argument to the file system encoding. AFAIK, this is how posixmodule.c already works (more or less).

Yes it is. The initial stage is reading the command line arguments. The proposed modification is to change behaviour when constructing sys.argv, os.environ or when calling os.listdir to "Return unicode when the text can not be represented in Python's default encoding". I take this to mean that when the value can be represented in Python's default encoding then it is returned as a byte string in the default encoding.

Therefore, for the example, the code that sets up sys.argv has to encode the unicode command line argument into cp1251.

Ok, I missed your point about sys.argv *not* returning Unicode in this particular case. However, with the modification of having posixmodule and fileobject recode string input via Unicode (based on the default encoding) into the file system encoding by basically just changing the parser marker from "et" to "es", you get correct behaviour - even in the above case. Both posixmodule and fileobject would then take the cp1251 default encoded string, convert it to Unicode and then to the file system encoding before opening the file.

...

...
On input, file I/O APIs should accept both strings using the default encoding and Unicode. How these inputs are then converted to suit the OS is up to the OS abstraction layer, e.g. posixmodule.c.

This looks to me to be insufficiently compatible with current behaviour whih accepts byte strings outside the default encoding. Existing code may call open("€.txt"). This is perfectly legitimate current Python (with a coding declaration) as "€.txt" is a byte string and file systems will accept byte string names. Since the standard default encoding is ASCII, should such code raise UnicodeDecodeError?

Yes. The above proposed change is indeed more restrictive than the current pass-through approach. I'm not sure whether we can impose such a change on the users in the 2.x series... perhaps we should have a two phase approach: Phase 1: try "et" and if this fails with an UnicodeDecodeError, revert back to the old "es" pass-through approach, issuing a warning as non-disruptive signal to the user Phase 2: move to "et" for good and issue decode errors

...

...
Changing this is easy, though: instead of using the "et" getargs format specifier, you'd have to use "es". The latter recodes strings based on the default encoding assumption to whatever other encoding you specify.

Don't you want to convert these into unicode rather than another byte string encoding? It looks to me as though the "es" format always produces byte strings and the only byte string format that can be passed to the operating system is the file system encoding which may not contain all the characters in the default encoding.

If the OS support Unicode directly, we can (and do) have a special case that bypasses the recoding altogheter. However, this currently only appears to be available on Windows versions NT, XP and up, where we already support this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 14 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Martin v. Löwis"

5:30 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Guido van Rossum wrote:

...

Ah, sigh. I didn't know that os.listdir() behaves differently when the argument is Unicode. Does os.listdir(".") really behave differently than os.listdir(u".")? Bah! I don't think that's a very good design (although I see where it comes from). Promoting only those entries that need it seems the right solution

Unfortunately, this solution is hard to implement (I don't know whether it is implementable at all correctly; atleast on Windows, I see no way to implement it efficiently). Here are a number of problems/questions: - On Windows, should listdir use the narrow or the wide API? Obviously the wide API, since it is not Python which returns the question marks, but the Windows API. - But then, the wide API gives all results as Unicode. If you want to promote only those entries that need it, it really means that you only want to "demote" those that don't need it. But how can you tell whether an entry needs it? There is no API to find out. You could declare that anything with characters >128 needs it, but that would be an incompatible change: If a character >128 in the system code page is in a file name, listdir currently returns it in the system code page. It then would return a Unicode string. Applications relying on the olde behaviour would break. - On Unix, all file names come out as byte strings. Again, how do you know which ones to promote, and using what encoding? Python currently guesses an encoding, but that may or may not be the one intended for the file name. So the general "Bah!" doesn't really help much: when it comes to a specific algorithm to implement, the options are scarce. Regards, Martin

M.-A. Lemburg

10:21 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Martin v. Löwis wrote:

...

Guido van Rossum wrote:

...
Ah, sigh. I didn't know that os.listdir() behaves differently when the argument is Unicode. Does os.listdir(".") really behave differently than os.listdir(u".")? Bah! I don't think that's a very good design (although I see where it comes from). Promoting only those entries that need it seems the right solution

Unfortunately, this solution is hard to implement (I don't know whether it is implementable at all correctly; atleast on Windows, I see no way to implement it efficiently).

Here are a number of problems/questions: - On Windows, should listdir use the narrow or the wide API? Obviously the wide API, since it is not Python which returns the question marks, but the Windows API.

Right.

...

- But then, the wide API gives all results as Unicode. If you want to promote only those entries that need it, it really means that you only want to "demote" those that don't need it. But how can you tell whether an entry needs it? There is no API to find out. You could declare that anything with characters >128 needs it, but that would be an incompatible change: If a character >128 in the system code page is in a file name, listdir currently returns it in the system code page. It then would return a Unicode string. Applications relying on the olde behaviour would break.

We will need a Python C API that returns: * a string if the Unicode value is representable in the default encoding (usually ASCII) * Unicode if it is not The file system encoding should be hidden in the OS layer (e.g. posixmodule). Python should only return strings with the default encoding and Unicode otherwise. See my suggestion to Neil about making the transition to this new strategy less painful.

...

- On Unix, all file names come out as byte strings. Again, how do you know which ones to promote, and using what encoding? Python currently guesses an encoding, but that may or may not be the one intended for the file name.

This is a tough one: AFAIK the file system encoding in Unix was never really specified, in fact most file systems just stored the names as-is without any encoding information attached to it. Things are moving into the direction of using UTF-8 for filenames, though. To solve this issue, various applications have come up with ways around the problem, e.g. GTK uses the following strategy to find the encoding (in the given order and adjustable using an environment variable): 1. locale based encoding, if given (UTF-8 on most modern Unixes) 2. UTF-8 3. Latin-1 4. CP1252 (Windows Latin-1 version) Perhaps we should add similar support to Python ? We should probably use a file system encoding default of Latin-1 on Unix if no other information can be found. That way we will assure that things don't change on Unix unless explicitly setup by the user (Latin-1 is round-trip safe when converting it to Unicode and back). os.listdir() would then continue to return plain strings and file() will open them just it does now. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 15 2005)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Neil Hodgson

7:30 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Martin v. Löwis:

...

- But then, the wide API gives all results as Unicode. If you want to promote only those entries that need it, it really means that you only want to "demote" those that don't need it. But how can you tell whether an entry needs it? There is no API to find out.

I wrote a patch for os.listdir at http://www.scintilla.org/difft.txt that uses WideCharToMultiByte to check if a wide name can be represented in a particular code page and only uses that representation if it fits. This is good for Windows code pages including ASCII and "mbcs" but since Python's sys.getdefaultencoding() can be something that has no code page equivalent, it would have to try converting using strict mode and interpret failure as leaving the name as unicode.

...

You could declare that anything with characters >128 needs it, but that would be an incompatible change: If a character >128 in the system code page is in a file name, listdir currently returns it in the system code page. It then would return a Unicode string.

I now quite like returning unicode for anything non-ASCII on Windows as there is no ambiguity in what the result means and there will be no need to change all the system calls to translate from the default encoding. It is a change to the API which can lead to code breaking but it should break with an exception. Assuming that byte string arguments are using Python's default encoding looks more dangerous with a behavioural change but no notification. Neil

"Martin v. Löwis"

7:41 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson wrote:

...

...
- But then, the wide API gives all results as Unicode. If you want to promote only those entries that need it, it really means that you only want to "demote" those that don't need it. But how can you tell whether an entry needs it? There is no API to find out.

I wrote a patch for os.listdir at http://www.scintilla.org/difft.txt that uses WideCharToMultiByte to check if a wide name can be represented in a particular code page and only uses that representation if it fits.

This appears to be based on the usedDefault return value of WideCharToMultiByte. I believe this is insufficient: WideCharToMultiByte might convert Unicode characters to codepage characters in a lossy way, without using the default character. For example, it converts U+0308 (combining diaeresis) to U+00A8 (diaeresis) (or something like that, I forgot the exact details). So if you have, say, "p-umlaut" (i.e. U+0070 U+0308), it converts it to U+0070 U+00A8 (in the local code page). Trying to use this as a filename later fails. Regards, Martin

Neil Hodgson

8:26 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Martin v. Löwis:

...

This appears to be based on the usedDefault return value of WideCharToMultiByte. I believe this is insufficient: WideCharToMultiByte might convert Unicode characters to codepage characters in a lossy way, without using the default character. For example, it converts U+0308 (combining diaeresis) to U+00A8 (diaeresis) (or something like that, I forgot the exact details). So if you have, say, "p-umlaut" (i.e. U+0070 U+0308), it converts it to U+0070 U+00A8 (in the local code page). Trying to use this as a filename later fails.

There is WC_NO_BEST_FIT_CHARS to defeat that. It says that it will use the default character if the translation can't be round-tripped. Available on WIndows 2000 and XP but not NT4. We could compare the original against the round-tripped as described at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicod... Neil

"Martin v. Löwis"

5:06 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Neil Hodgson wrote:

...

There is WC_NO_BEST_FIT_CHARS to defeat that. It says that it will use the default character if the translation can't be round-tripped. Available on WIndows 2000 and XP but not NT4.

Ah, ok, that's a useful feature. Of course, limited availability of the feature means that we either need to drop support for some systems, or provide yet another layer of fallback routines. Regards, Martin

Trent Mick

June 2005

6:28 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

...

...
os.getcwd() returns a string, but path.getcwd() returns a new path object.

In that case, I'd expect it to be 'path.fromcwd()' or 'path.cwd()'; i.e. a constructor classmethod by analogy with 'dict.fromkeys()' or 'datetime.now()'. 'getcwd()' looks like it's getting a property of a path instance, and doesn't match stdlib conventions for constructors.

So, +1 as long as it's called cwd() or something better (i.e. clearer and/or more consistent with stdlib constructor conventions).

What about have it just be the default empty constructor? assert path.Path() == os.getcwd() \ or path.Path() == os.getcwdu() Dunno if that causes other weirdnesses with the API, though. Trent -- Trent Mick TrentM@ActiveState.com

Skip Montanaro

8:45 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

We're getting enough discussion about various aspects of Jason's path module that perhaps a PEP is warranted. All this discussion on python-dev is just going to get lost. Skip

Phillip J. Eby

9:25 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 03:45 PM 6/27/2005 -0500, Skip Montanaro wrote:

...

We're getting enough discussion about various aspects of Jason's path module that perhaps a PEP is warranted. All this discussion on python-dev is just going to get lost.

AFAICT, the only unresolved issue outstanding is a compromise or Pronouncement regarding the atime/ctime/mtime members' datatype. This is assuming, of course, that making the "empty path" be os.curdir doesn't receive any objections, and that nobody strongly prefers 'path.fromcwd()' over 'path.cwd()' as the alternate constructor name. Apart from these fairly minor issues, there is a very short to-do list, small enough to do an implementation patch in an evening or two. Documentation might take a similar amount of time after that; mostly it'll be copy-paste from the existing os.path docs, though. As for the open issues, if we can't reach some sane compromise about atime/ctime/mtime, I'd suggest just providing the stat() method and let people use stat().st_mtime et al. Alternately, I'd be okay with creating last_modified(), last_accessed(), and created_on() methods that return datetime objects, as long as there's also atime()/mtime()/ctime() methods that return timestamps.

Andrew Durdin

12:42 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On 6/28/05, Phillip J. Eby <pje@telecommunity.com> wrote:

...

AFAICT, the only unresolved issue outstanding is a compromise or Pronouncement regarding the atime/ctime/mtime members' datatype. This is assuming, of course, that making the "empty path" be os.curdir doesn't receive any objections, and that nobody strongly prefers 'path.fromcwd()' over 'path.cwd()' as the alternate constructor name.

Apart from these fairly minor issues, there is a very short to-do list, small enough to do an implementation patch in an evening or two. Documentation might take a similar amount of time after that; mostly it'll be copy-paste from the existing os.path docs, though.

While we'ew discussing outstanding issues: In a related discussion of the path module on c.l.py, Thomas Heller pointed out that the path module doesn't correctly handle unicode paths: | I have never used the path module before, although I've heard good | things about it. But, it seems to have problems with unicode pathnames, | at least on windows: | | C:\>mkdir späm | | C:\späm>py24 | Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on win32 | Type "help", "copyright", "credits" or "license" for more information. | >>> from path import path | >>> path.getcwd() | | Traceback (most recent call last): | File "<stdin>", line 1, in ? | File "C:\TSS5\components\_Pythonlib\path.py", line 97, in getcwd | return path(os.getcwd()) | UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 5: ordinal not in range(128) He suggests a possible fix in his message: http://mail.python.org/pipermail/python-list/2005-June/287372.html http://groups-beta.google.com/group/comp.lang.python/msg/b3795a2a0c52b93f

Neil Hodgson

1:19 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Andrew Durdin:

...

While we'ew discussing outstanding issues: In a related discussion of the path module on c.l.py, Thomas Heller pointed out that the path module doesn't correctly handle unicode paths: ...

Here is a patch that avoids failure when paths can not be represented in a single 8 bit encoding. It adds a _cwd variable in the initialisation and then calls this rather than os.getcwd. I sent the patch to Jason as well. _base = str _cwd = os.getcwd try: if os.path.supports_unicode_filenames: _base = unicode _cwd = os.getcwdu except AttributeError: pass #... def getcwd(): """ Return the current working directory as a path object. """ return path(_cwd()) Neil

Donovan Baarda

12:48 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On Mon, 2005-06-27 at 14:25, Phillip J. Eby wrote: [...]

...

As for the open issues, if we can't reach some sane compromise about atime/ctime/mtime, I'd suggest just providing the stat() method and let people use stat().st_mtime et al. Alternately, I'd be okay with creating last_modified(), last_accessed(), and created_on() methods that return datetime objects, as long as there's also atime()/mtime()/ctime() methods that return timestamps.

+1 for atime/mtime/ctime being timestamps -1 for redundant duplicates that return DateTimes +1 for a stat() method (there is lots of other goodies in a stat). -- Donovan Baarda <abo@minkirri.apana.org.au>

Just van Rossum

9:41 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Phillip J. Eby wrote:

...

At 03:45 PM 6/27/2005 -0500, Skip Montanaro wrote:

...
We're getting enough discussion about various aspects of Jason's path module that perhaps a PEP is warranted. All this discussion on python-dev is just going to get lost.

AFAICT, the only unresolved issue outstanding is a compromise or Pronouncement regarding the atime/ctime/mtime members' datatype. This is assuming, of course, that making the "empty path" be os.curdir doesn't receive any objections, and that nobody strongly prefers 'path.fromcwd()' over 'path.cwd()' as the alternate constructor name.

Apart from these fairly minor issues, there is a very short to-do list, small enough to do an implementation patch in an evening or two. Documentation might take a similar amount of time after that; mostly it'll be copy-paste from the existing os.path docs, though.

As for the open issues, if we can't reach some sane compromise about atime/ctime/mtime, I'd suggest just providing the stat() method and let people use stat().st_mtime et al. Alternately, I'd be okay with creating last_modified(), last_accessed(), and created_on() methods that return datetime objects, as long as there's also atime()/mtime()/ctime() methods that return timestamps.

My issues with the 'path' module (basically recapping what I've said on the subject in the past): - It inherits from str/unicode, so path object have many str methods that make no sense for paths. - On OSX, it inherits from str instead of unicode, due to http://python.org/sf/767645 - I don't like __div__ overloading for join(). Just

Michael Hoffman

7:24 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On Mon, 27 Jun 2005, Phillip J. Eby wrote:

...

At 08:20 AM 6/27/2005 +0100, Michael Hoffman wrote:

...
os.getcwd() returns a string, but path.getcwd() returns a new path object.

In that case, I'd expect it to be 'path.fromcwd()' or 'path.cwd()'; i.e. a constructor classmethod by analogy with 'dict.fromkeys()' or 'datetime.now()'. 'getcwd()' looks like it's getting a property of a path instance, and doesn't match stdlib conventions for constructors.

So, +1 as long as it's called cwd() or something better (i.e. clearer and/or more consistent with stdlib constructor conventions).

...

...
...
from path import path import os

default_path = path() getcwd_path = path.getcwd() default_path.abspath()

...

...
...
getcwd_path.abspath()

...

...
...
os.chdir("etc") default_path.abspath()

...

...
...
getcwd_path.abspath()

+1 on cwd(). -1 on making this the default constructor. Essentially the default constructor returns a path object that will reflect the CWD at the time that further instance methods are called. path.cwd() will return a path object that reflects the path at the time of construction. This example may be instructive: path('/home/hoffman') path('/home/hoffman') path('/home/hoffman/etc') path('/home/hoffman') Unfortunately only some of the methods work on paths created with the default constructor:

...

...
...
path().listdir() Traceback (most recent call last): File "<stdin>", line 1, in ? File "/usr/lib/python2.4/site-packages/path.py", line 297, in listdir names = os.listdir(self) OSError: [Errno 2] No such file or directory: ''

Is there support to have all of the methods work when the path is the empty string? Among other benefits, this would mean that sys.path could be turned into useful path objects with a simple list comprehension. -- Michael Hoffman <hoffman@ebi.ac.uk> European Bioinformatics Institute

Phillip J. Eby

7:42 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 08:24 PM 6/27/2005 +0100, Michael Hoffman wrote:

...

On Mon, 27 Jun 2005, Phillip J. Eby wrote:

...
At 08:20 AM 6/27/2005 +0100, Michael Hoffman wrote:

...
os.getcwd() returns a string, but path.getcwd() returns a new path object.

In that case, I'd expect it to be 'path.fromcwd()' or 'path.cwd()'; i.e. a constructor classmethod by analogy with 'dict.fromkeys()' or 'datetime.now()'. 'getcwd()' looks like it's getting a property of a path instance, and doesn't match stdlib conventions for constructors.

So, +1 as long as it's called cwd() or something better (i.e. clearer and/or more consistent with stdlib constructor conventions).

+1 on cwd().

-1 on making this the default constructor. Essentially the default constructor returns a path object that will reflect the CWD at the time that further instance methods are called.

Only if we make the default argument to path() be os.curdir, which isn't a bad idea.

...

Unfortunately only some of the methods work on paths created with the default constructor:

...
...
...
path().listdir() Traceback (most recent call last): File "<stdin>", line 1, in ? File "/usr/lib/python2.4/site-packages/path.py", line 297, in listdir names = os.listdir(self) OSError: [Errno 2] No such file or directory: ''

This wouldn't be a problem if the default constructor arg were os.curdir (i.e. '.' for most platforms) instead of an empty string.

...

Is there support to have all of the methods work when the path is the empty string? Among other benefits, this would mean that sys.path could be turned into useful path objects with a simple list comprehension.

Ugh. sys.path entries are not path objects, nor should they be. PEP 302 (implemented in Python 2.3 and up) allows sys.path to contain any strings you like, as interpreted by objects in sys.path_hooks. Programs that assume only filesystem paths appear in sys.path will break in the presence of PEP 302-sanctioned import hooks.

Skip Montanaro

7:31 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Phillip> It has many ways to do the same thing, and many of its property Phillip> and method names are confusing because they either do the same Phillip> thing as a standard function, but have a different name (like Phillip> the 'parent' property that is os.path.dirname in disguise), or Phillip> they have the same name as a standard function but do something Phillip> different (like the 'listdir()' method that returns full paths Phillip> rather than just filenames). To the extent that the path module tries to provide a uniform abstraction that's not saddled with a particular way of doing things (e.g., the Unix way or the Windows way), I don't think this is necessarily a bad thing. Skip

Phillip J. Eby

12:52 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 02:31 PM 6/26/2005 -0500, Skip Montanaro wrote:

...

Phillip> It has many ways to do the same thing, and many of its property Phillip> and method names are confusing because they either do the same Phillip> thing as a standard function, but have a different name (like Phillip> the 'parent' property that is os.path.dirname in disguise), or Phillip> they have the same name as a standard function but do something Phillip> different (like the 'listdir()' method that returns full paths Phillip> rather than just filenames).

To the extent that the path module tries to provide a uniform abstraction that's not saddled with a particular way of doing things (e.g., the Unix way or the Windows way), I don't think this is necessarily a bad thing.

I'm confused by your statements. First, I didn't notice the path module providing any OS-abstractions that aren't already provided by os.path. Second, using inconsistent and confusing names is pretty much always a bad thing. :)

Skip Montanaro

1:29 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Phillip> ... but have a different name (like the 'parent' property that Phillip> is os.path.dirname in disguise) ... Phillip> ... (like the 'listdir()' method that returns full paths rather Phillip> than just filenames). Skip> To the extent that the path module tries to provide a uniform Skip> abstraction that's not saddled with a particular way of doing Skip> things (e.g., the Unix way or the Windows way), I don't think this Skip> is necessarily a bad thing. Phillip> I'm confused by your statements. First, I didn't notice the Phillip> path module providing any OS-abstractions that aren't already Phillip> provided by os.path. Second, using inconsistent and confusing Phillip> names is pretty much always a bad thing. :) Sorry, let me be more explicit. "dirname" is the Unix name for "return the parent of this path". In the Windows and Mac OS9 worlds (ignore any possible Posix compatibility for a moment), my guess would be it's probably something else. I suspect listdir gets its "return individual filenames instead of full paths" from the semantics of the Posix opendir/readdir/ closedir functions. If it makes more sense to return strings that represent full paths or new path objects that have been absolute-ified, then the minor semantic change going from os.path.listdir() to the listdir method of Jason's path objects isn't a big problem to me. Skip

Phillip J. Eby

3:03 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 08:29 PM 6/26/2005 -0500, Skip Montanaro wrote:

...

Phillip> ... but have a different name (like the 'parent' property that Phillip> is os.path.dirname in disguise) ...

Phillip> ... (like the 'listdir()' method that returns full paths rather Phillip> than just filenames).

Skip> To the extent that the path module tries to provide a uniform Skip> abstraction that's not saddled with a particular way of doing Skip> things (e.g., the Unix way or the Windows way), I don't think this Skip> is necessarily a bad thing.

Phillip> I'm confused by your statements. First, I didn't notice the Phillip> path module providing any OS-abstractions that aren't already Phillip> provided by os.path. Second, using inconsistent and confusing Phillip> names is pretty much always a bad thing. :)

Sorry, let me be more explicit. "dirname" is the Unix name for "return the parent of this path". In the Windows and Mac OS9 worlds (ignore any possible Posix compatibility for a moment), my guess would be it's probably something else. I suspect listdir gets its "return individual filenames instead of full paths" from the semantics of the Posix opendir/readdir/ closedir functions. If it makes more sense to return strings that represent full paths or new path objects that have been absolute-ified, then the minor semantic change going from os.path.listdir() to the listdir method of Jason's path objects isn't a big problem to me.

The semantics aren't the issue; it's fine and indeed quite useful to have a method that returns path objects. I'm just saying it shouldn't be called listdir(), since that's confusing when compared to what the existing listdir() function does. If you look at my original post, you'll see I suggested it be called 'subpaths()' instead, to help reflect that it returns paths, rather than filenames.

Dörwald Walter

10:22 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Phillip J. Eby wrote:

...

[...] I'm also not keen on the fact that it makes certain things properties whose value can change over time; i.e. ctime/mtime/atime and size really shouldn't be properties, but rather methods.

I think ctime, mtime and atime should be (or return) datetime.datetime objects instead of integer timestamps. Bye, Walter Dörwald

Phillip J. Eby

12:54 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 12:22 AM 6/27/2005 +0200, Dörwald Walter wrote:

...

Phillip J. Eby wrote:

...
[...] I'm also not keen on the fact that it makes certain things properties whose value can change over time; i.e. ctime/mtime/atime and size really shouldn't be properties, but rather methods.

I think ctime, mtime and atime should be (or return) datetime.datetime objects instead of integer timestamps.

With what timezone? I don't think that can be done portably and unambiguously, so I'm -1 on that.

Bob Ippolito

1:26 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

On Jun 26, 2005, at 8:54 PM, Phillip J. Eby wrote:

...

At 12:22 AM 6/27/2005 +0200, Dörwald Walter wrote:

...
Phillip J. Eby wrote:

...
[...] I'm also not keen on the fact that it makes certain things properties whose value can change over time; i.e. ctime/mtime/atime and size really shouldn't be properties, but rather methods.

I think ctime, mtime and atime should be (or return) datetime.datetime objects instead of integer timestamps.

With what timezone? I don't think that can be done portably and unambiguously, so I'm -1 on that.

That makes no sense, timestamps aren't any better, and datetime objects have no time zone set by default anyway. datetime.fromtimestamp(time.time()) gives you the same thing as datetime.now(). -bob

Phillip J. Eby

3:09 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 09:26 PM 6/26/2005 -0400, Bob Ippolito wrote:

...

On Jun 26, 2005, at 8:54 PM, Phillip J. Eby wrote:

...
At 12:22 AM 6/27/2005 +0200, Dörwald Walter wrote:

...
Phillip J. Eby wrote:

...
I'm also not keen on the fact that it makes certain things properties whose value can change over time; i.e. ctime/mtime/atime and size really shouldn't be properties, but rather methods.

I think ctime, mtime and atime should be (or return) datetime.datetime objects instead of integer timestamps.

With what timezone? I don't think that can be done portably and unambiguously, so I'm -1 on that.

That makes no sense, timestamps aren't any better,

Sure they are, if what you want is a timestamp. In any case, the most common use case I've seen for mtime and friends is just comparing against a previous value, or the value on another file, so it doesn't actually matter most of the time what the type of the value is.

...

and datetime objects have no time zone set by default anyway. datetime.fromtimestamp(time.time()) gives you the same thing as datetime.now().

In which case, it's also easy enough to get a datetime if you really want one. I personally would rather do that than complicate the use cases where a datetime isn't really needed. (i.e. most of the time, at least in my experience)

Walter Dörwald

6:52 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Phillip J. Eby wrote:

...

At 09:26 PM 6/26/2005 -0400, Bob Ippolito wrote:

...
On Jun 26, 2005, at 8:54 PM, Phillip J. Eby wrote:

...
At 12:22 AM 6/27/2005 +0200, Dörwald Walter wrote:

...
Phillip J. Eby wrote:

...
I'm also not keen on the fact that it makes certain things properties whose value can change over time; i.e. ctime/mtime/ atime and size really shouldn't be properties, but rather methods.

I think ctime, mtime and atime should be (or return) datetime.datetime objects instead of integer timestamps.

With what timezone? I don't think that can be done portably and unambiguously, so I'm -1 on that.

That makes no sense, timestamps aren't any better,

Sure they are, if what you want is a timestamp. In any case, the most common use case I've seen for mtime and friends is just comparing against a previous value, or the value on another file, so it doesn't actually matter most of the time what the type of the value is.

I find timestamp values to be somewhat opaque. So all things being equal, I'd prefer datetime objects.

...

...
and datetime objects have no time zone set by default anyway. datetime.fromtimestamp(time.time()) gives you the same thing as datetime.now().

In which case, it's also easy enough to get a datetime if you really want one. I personally would rather do that than complicate the use cases where a datetime isn't really needed. (i.e. most of the time, at least in my experience)

We should have one uniform way of representing time in Python. IMHO datetime objects are the natural choice. Bye, Walter Dörwald

Phillip J. Eby

7:37 p.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

At 08:52 PM 6/27/2005 +0200, Walter Dörwald wrote:

...

Phillip J. Eby wrote:

...
At 09:26 PM 6/26/2005 -0400, Bob Ippolito wrote:

...
On Jun 26, 2005, at 8:54 PM, Phillip J. Eby wrote:

...
At 12:22 AM 6/27/2005 +0200, Dörwald Walter wrote:

...
Phillip J. Eby wrote:

...
I'm also not keen on the fact that it makes certain things properties whose value can change over time; i.e. ctime/mtime/ atime and size really shouldn't be properties, but rather methods.

I think ctime, mtime and atime should be (or return) datetime.datetime objects instead of integer timestamps.

With what timezone? I don't think that can be done portably and unambiguously, so I'm -1 on that.

That makes no sense, timestamps aren't any better,

Sure they are, if what you want is a timestamp. In any case, the most common use case I've seen for mtime and friends is just comparing against a previous value, or the value on another file, so it doesn't actually matter most of the time what the type of the value is.

I find timestamp values to be somewhat opaque. So all things being equal, I'd prefer datetime objects.

Their opaqueness is actually one of the reasons I prefer them. :)

...

...
...
and datetime objects have no time zone set by default anyway. datetime.fromtimestamp(time.time()) gives you the same thing as datetime.now().

In which case, it's also easy enough to get a datetime if you really want one. I personally would rather do that than complicate the use cases where a datetime isn't really needed. (i.e. most of the time, at least in my experience)

We should have one uniform way of representing time in Python.

Um, counting the various datetime variants (date, time, datetime), timestamps (float and long), and time tuples, Python has 6 ways now. The type chosen for a given API is largely dependent on the *source* of the time value. And the value supplied by most existing OS-level Python APIs is either a long or a float. However, there's also a practicality-beats-purity issue here; using a datetime produces an otherwise-unnecessary dependency on the datetime module for users of this functionality, despite use cases for an actual datetime value being very infrequent. Date arithmetic on file timestamps that can't be trivially done in terms of seconds is rare, as is display of file timestamps. None of these use cases will be significantly harmed by having to use datetime.fromtimestamp(); they will probably be importing datetime already. What I don't want is for simple scripts to need to import datetime (even indirectly by way of the path class) just to get easy access to stat() values.

Skip Montanaro

1:16 a.m.

New subject: Adding the 'path' module (was Re: Some RFE for review)

Walter> I think ctime, mtime and atime should be (or return) Walter> datetime.datetime objects instead of integer timestamps. +1 Skip

Raymond Hettinger

4:25 a.m.

...

1193128: str.translate(None, delchars) should be allowed and only delete delchars from the string.

I had agreed to this one and it's on my todo list to implement.

...

1214675: warnings should get a removefilter() method. An alternative would be to fully document the "filters" attribute to allow direct tinkering with it.

I'm concerned that removefilter() may not work well in the presence of multiple modules that use the warnings module. It may be difficult to make sure the one removed wasn't subsequently added another module. Also, the issue is compounded because the order of filter application is important.

...

1205239: Shift operands should be allowed to be negative integers, so e.g. a << -2 is the same as a >> 2. Review: Allowing this would open a source of bugs previously well identifiable.

The OP is asking why it is different for negative sequence indicies (why the added convenience was thought to outweigh the loss of error detection).

...

1152248: In order to read "records" separated by something other than newline, file objects should either support an additional parameter (the separator) to (x)readlines(), or gain an additional method which does this. Review: The former is a no-go, I think, because what is read won't be lines.

Okay, call it a record then. The OPs request is not a non-starter. There is a proven precedent in AWK which allows programmer specifiable record separators.

...

The latter is further complicating the file interface, so I would follow the principle that not every 3-line function should be builtin.

This is not a design principle. UserDict.Mixin shows that most of the mapping API is easily expressible in terms of a few lines and a few primitives; however, the mapping API has long been proven as valuable for its expressiveness. Likewise, Guido's any() and all() builtins can be expressed in a single line but were accepted anyway. A more nuanced version of the "principle" is: if a proposal can be easily expressed with a small grouping of existing constructs, then must meet much higher standards of use frequency and expressiveness in order to be accepted.

...

1110010: A function "attrmap" should be introduced which is used as follows: attrmap(x)['att'] == getattr(x, 'att') The submitter mentions the use case of new-style classes without a __dict__ used at the right of %-style string interpolation. Review: I don't know whether this is worth it.

While potentially useful, the function is entirely unintuitive (it has to be studied a bit before being able to see what it is for). Also, the OP is short on use cases (none were presented). IMO, this belongs as a cookbook recipe. Raymond Hettinger

Nick Coghlan

9:34 a.m.

Reinhold Birkenfeld wrote:

...

1152248: In order to read "records" separated by something other than newline, file objects should either support an additional parameter (the separator) to (x)readlines(), or gain an additional method which does this. Review: The former is a no-go, I think, because what is read won't be lines. The latter is further complicating the file interface, so I would follow the principle that not every 3-line function should be builtin.

As Douglas Alan's sample implementation (and his second attempt [1]) show, getting this right (and reasonably efficient) is actually a non-trivial exercise. Leveraging the existing xreadlines infrastructure is an idea worth considering. I think it's worth leaving this one open, and see if someone comes up with a patch (obviously, this was my opinion from the start, or I wouldn't have raised the RFE in response to Douglas's query!) Cheers, Nick. [1] http://mail.python.org/pipermail/python-list/2005-February/268547.html -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://boredomandlaziness.blogspot.com

Paul Moore

12:22 p.m.

On 6/27/05, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

As Douglas Alan's sample implementation (and his second attempt [1]) show, getting this right (and reasonably efficient) is actually a non-trivial exercise. Leveraging the existing xreadlines infrastructure is an idea worth considering.

I think it's worth leaving this one open, and see if someone comes up with a patch (obviously, this was my opinion from the start, or I wouldn't have raised the RFE in response to Douglas's query!)

As a more general approach, would it be worth considering an addition to itertools which took an iterator which generated "blocks" of items, and split them on a subsequence? It's a generalisation of the basic pattern here, and would be able to encapsulate the fiddly "what if a separator overlaps a block split" logic without locking it down to string manipulation... Or does that count as overgeneralisation? Paul.

Raymond Hettinger

2:32 p.m.

[Paul Moore on readline getting a record separator argument]

...

As a more general approach, would it be worth considering an addition to itertools which took an iterator which generated "blocks" of items, and split them on a subsequence?

Nope. Assign responsibility to the class that has all of the relevant knowledge (API for retrieving blocks, type of the retrieved data, how EOF is detected, etc).

...

It's a generalisation of the basic pattern here, and would be able to encapsulate the fiddly "what if a separator overlaps a block split" logic without locking it down to string manipulation...

How do you build, scan, and extract the buffer in a type independent manner? Are there any use cases for non-string data buffers, a stream of integers or somesuch? Raymond

Oren Tirosh

3:40 p.m.

On 6/27/05, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

Reinhold Birkenfeld wrote:

...
1152248: In order to read "records" separated by something other than newline, file objects should either support an additional parameter (the separator) to (x)readlines(), or gain an additional method which does this. Review: The former is a no-go, I think, because what is read won't be lines. The latter is further complicating the file interface, so I would follow the principle that not every 3-line function should be builtin.

As Douglas Alan's sample implementation (and his second attempt [1]) show, getting this right (and reasonably efficient) is actually a non-trivial exercise. Leveraging the existing xreadlines infrastructure is an idea worth considering.

Do you mean the existing xreadlines infrustructure that no longer exists since 2.4 ? :-) An infrastructure that could be leveraged is the readahead buffer used by the file object's line iterator. Oren

Nick Coghlan

9:59 p.m.

Oren Tirosh wrote:

...

An infrastructure that could be leveraged is the readahead buffer used by the file object's line iterator.

That's the infrastructure I meant. I was just being sloppy with my terminology ;) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://boredomandlaziness.blogspot.com

Reinhold Birkenfeld

July 2005

4:34 p.m.

Reinhold Birkenfeld wrote:

...

Hi,

while bugs and patches are sometimes tricky to close, RFE can be very easy to decide whether to implement in the first place. So what about working a bit on this front? Here are several RFE reviewed, perhaps some can be closed ("should" is always from submitter's point of view):

Aren't there opinions to the RFE other than the "path module" one? Reinhold -- Mail address is perfectly valid!

7146

Age (days ago)

7167

Last active (days ago)

List overview

Download

70 comments

21 participants

participants (21)

"Martin v. Löwis"
Andrew Durdin
Bob Ippolito
Donovan Baarda
Dörwald Walter
Gerrit Holl
Guido van Rossum
Just van Rossum
M.-A. Lemburg
Michael Hoffman
Neil Hodgson
Nick Coghlan
Oren Tirosh
Paul Moore
Phillip J. Eby
Raymond Hettinger
Reinhold Birkenfeld
Skip Montanaro
Thomas Heller
Trent Mick
Walter Dörwald

Some RFE for review

Reinhold Birkenfeld

Reinhold Birkenfeld

Michael Hoffman

Michael Hoffman

Reinhold Birkenfeld

Reinhold Birkenfeld

Gerrit Holl

Donovan Baarda

Michael Hoffman

Oren Tirosh

Reinhold Birkenfeld

tags

participants (21)