os.listdir iteration support

Hi to all, I would find very useful having a version of os.listdir returning a generator. If a directory has many files, say 20,000, it could take a long time getting all of them with os.listdir and this could be a problem in asynchronous environments (e.g. asynchronous servers). The only solution which comes to my mind in such case is using a thread/fork or having a non-blocking version of listdir() returning an iterator. What do you think about that?

"Giampaolo Rodola'" <gnewsg@gmail.com> wrote in message news:d827975f-7c1e-471e-bac1-8d55262ab122@d27g2000prf.googlegroups.com... | I would find very useful having a version of os.listdir returning a generator. If there are no technical issues in the way, such a replacement (rather than addition) would be in line with other list -> iterator replacements in 3.0 (range, dict,items, etc). A list could then be obtained with list(os.listdir). tjr

On Nov 22, 2007 3:25 PM, Terry Reedy <tjreedy@udel.edu> wrote:
But how common is this use case really? -- --Guido van Rossum (home page: http://www.python.org/~guido/)

But how many FTP servers are written in Python *and* have directories with 20,000 files in them? --Guido On Nov 23, 2007 6:06 AM, Giampaolo Rodola' <gnewsg@gmail.com> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 23 Nov, 21:23, "Guido van Rossum" <gu...@python.org> wrote:
I sincerely don't know. Surely it's a rather specific use case, but it is one of the tasks which takes the longest amount of time on an FTP server. 20,000 is probably an exaggerated hypothetical situation, so I did a simple test with a more realistic scenario. On windows a very crowded directory is C:\windows\system32. Currently the C:\windows\system32 of my Windows XP workstation contains 2201 files. I tried to run the code below which is how an FTP server should properly respond to a "LIST" command issued by client. It took 1.70300006866 seconds to complete the first time and 0.266000032425 the second one. I don't know if such specific use case could justify a listdir generators support to have into the stdlib but having something like Greg Ewing's opendirs module could have saved a lot of time in this specific case. -- Giampaolo import os, stat, time from tarfile import filemode try: import pwd, grp except ImportError: pwd = grp = None def format_list(directory): """Return a directory listing emulating "/bin/ls -lA" UNIX command output. This is how output appears to client: -rw-rw-rw- 1 owner group 7045120 Sep 02 3:47 music.mp3 drwxrwxrwx 1 owner group 0 Aug 31 18:50 e-books -rw-rw-rw- 1 owner group 380 Sep 02 3:40 module.py """ listing = os.listdir(directory) result = [] for basename in listing: file = os.path.join(directory, basename) # if the file is a broken symlink, use lstat to get stat for # the link try: stat_result = os.stat(file) except (OSError,AttributeError): stat_result = os.lstat(file) perms = filemode(stat_result.st_mode) # permissions nlinks = stat_result.st_nlink # number of links to inode if not nlinks: # non-posix system, let's use a bogus value nlinks = 1 if pwd and grp: # get user and group name, else just use the raw uid/gid try: uname = pwd.getpwuid(stat_result.st_uid).pw_name except KeyError: uname = stat_result.st_uid try: gname = grp.getgrgid(stat_result.st_gid).gr_name except KeyError: gname = stat_result.st_gid else: # on non-posix systems the only chance we use default # bogus values for owner and group uname = "owner" gname = "group" size = stat_result.st_size # file size # stat.st_mtime could fail (-1) if file's last modification # time is too old, in that case we return local time as last # modification time. try: mtime = time.strftime("%b %d %H:%M", time.localtime(stat_result.st_mtime)) except ValueError: mtime = time.strftime("%b %d %H:%M") # if the file is a symlink, resolve it, e.g. "symlink -> real_file" if stat.S_ISLNK(stat_result.st_mode): basename = basename + " -> " + os.readlink(file) # formatting is matched with proftpd ls output result.append("%s %3s %-8s %-8s %8s %s %s\r\n" %( perms, nlinks, uname, gname, size, mtime, basename)) return ''.join(result) if __name__ == '__main__': before = time.time() format_list(r'C:\windows\system32') print time.time() - before

On Fri, Nov 23, 2007, Giampaolo Rodola' wrote:
Your code calls os.stat() on each file. I know from past experience that os.stat() is *extremely* expensive. Because os.listdir() runs at C speed, it only gets slow when run against hundreds of thousands of entries. (One directory on a work server has over 200K entries, and it takes os.listdir() about twenty seconds. I believe that if we switched from ext3 to something more appropriate that would get reduced.)
Doubtful. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

On Thu, Nov 22, 2007, Giampaolo Rodola' wrote:
-1 The problem is that reading a directory requires an open file handle; given a generator context, there's no clear mechanism for determining when to close the handle. Because the list needs to be created in the first place, why bother with a generator? -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

Greg Ewing schrieb:
What about an os.iterdir() generator which uses opendir/readdir as proposed? The generator's close() could also call closedir(), and you could have a warning in the docs about making sure to have it closed at some point. One could even use an enclosing with closing(os.iterdir()) as d: block. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

Georg Brandl wrote:
What about an os.iterdir() generator which uses opendir/readdir as proposed?
I was feeling in the mood for a diversion, so I whipped up a Pyrex prototype of an opendir() object that can be used either as a file-like object or an iterator. Here's the docstring: """opendir(pathname) --> an open directory object Opens a directory and provides incremental access to the filenames it contains. May be used as a file-like object or as an iterator. When used as a file-like object, each call to read() returns one filename, or an empty string when the end of the directory is reached. The close() method should be called when finished with the directory. The close() method should also be called when used as an iterator and iteration is stopped prematurely. If iteration proceeds to completion, the directory is closed automatically.""" Source, setup.py and a brief test attached. -- Greg ############################################################## # # opendir.pyx - A class exposing the functionality of # =========== the opendir() family of C libary functions. # # By Gregory Ewing # greg.ewing@canterbury.ac.nz # # This software and derivative works created from it # may be used and redistributed without restriction. # ############################################################## cdef extern from "sys/errno.h": int errno cdef extern from "stdio.h": char *strerror(int) cdef extern from "dirent.h": ctypedef struct DIR struct dirent: int d_namlen char d_name[1] DIR *c_opendir "opendir" (char *) int readdir_r(DIR *, dirent *, dirent **) long telldir(DIR *) void seekdir(DIR *, long) void rewinddir(DIR *) int closedir(DIR *) int dirfd(DIR *) #------------------------------------------------------------------ cdef class opendir: """opendir(pathname) --> an open directory object Opens a directory and provides incremental access to the filenames it contains. May be used as a file-like object or as an iterator. When used as a file-like object, each call to read() returns one filename, or an empty string when the end of the directory is reached. The close() method should be called when finished with the directory. The close() method should also be called when used as an iterator and iteration is stopped prematurely. If iteration proceeds to completion, the directory is closed automatically.""" cdef DIR *dir def __cinit__(self, char *path): self.dir = c_opendir(path) if not self.dir: raise IOError(errno, "%s: '%s'" % (strerror(errno), path)) def __dealloc__(self): if self.dir: closedir(self.dir) def read(self): """read() --> filename or empty string Returns the next filename from the directory, or an empty string if the end of the directory has been reached.""" cdef dirent entry, *result check_open(self) if readdir_r(self.dir, &entry, &result) < 0: raise IOError(errno) if result: return entry.d_name else: return "" def tell(self): """tell() --> position Returns a value representing the current position in the directory, suitable for passing to tell(). Only valid for this directory object as long as it remains open.""" check_open(self) return telldir(self.dir) def seek(self, long pos): """seek(position) Returns the directory to the specified position, which should be a value previously returned by tell().""" check_open(self) seekdir(self.dir, pos) def rewind(self): """rewind() Resets the position to the beginning of the directory.""" check_open(self) rewinddir(self.dir) def close(self): """close() Closes the directory and frees the underlying file descriptor.""" if self.dir: if closedir(self.dir) < 0: raise IOError(errno) self.dir = NULL # MaxOSX doesn't seem to have dirfd, despite what the # man page says. :-( # # def fileno(self): # """fileno() --> file descriptor # # Returns the file descriptor associated with the open directory.""" # # check_open(self) # return dirfd(self.dir) def __iter__(self): return self def __next__(self): """next() --> filename Returns the next filename from the directory. If the end of the directory has been reached, closes the directory and raises StopIteration.""" if self.dir: result = self.read() if result: return result self.close() raise StopIteration #------------------------------------------------------------------ cdef int check_open(opendir d) except -1: if not d.dir: raise ValueError("Directory is closed") return 0 from distutils.core import setup from Pyrex.Distutils.extension import Extension from Pyrex.Distutils import build_ext setup( name = 'opendir', ext_modules=[ Extension("opendir", ["opendir.pyx"]), ], cmdclass = {'build_ext': build_ext} ) # # Test the opendir module # from opendir import opendir print "READ" d = opendir(".") while 1: name = d.read() if not name: break print " ", name print "EOF" print "ITERATE" d = opendir(".") for name in d: print " ", name print "STOP" print "TELL/SEEK" d = opendir(".") for i in range(3): name = d.read() print " ", name pos = d.tell() for i in range(3): name = d.read() print " ", name d.seek(pos) while 1: name = d.read() if not name: break print " ", name print "EOF" print "REWIND" d = opendir(".") for i in range(3): name = d.read() print " ", name d.rewind() while 1: name = d.read() if not name: break print " ", name print "EOF" print "EXCEPTION" try: d = opendir("spanish_inquisition") except Exception, e: print e print "DONE"

Adam Atlas wrote:
It doesn't, actually. On Windows, os.listdir uses FindFirstFile and FindNextFile, on OS2 it's DosFindFirst and DosFindNext, and on everything else it's Posix opendir and readdir. All of these are incremental, so a generator is the most natural way to expose the underlying API. That's just a set of facts and a single opinion. Past that I personally have no preference. Neil

On Fri, Nov 23, 2007, Adam Atlas wrote:
Enh. That is not reliable without work, and getting it reliable is a waste of work. The proposed idea for adding an opendir() function is workable, but it still doesn't solve the need for closing the handle within listdir(). No matter what, changes the semantics of listdir() to leave a handle lying around is going to cause problems for some people.
Because the list needs to be created in the first place
How so?
If you're going to ask a question, it would be nice to leave the entire original context in place, especially given that it's not a particularly long chunk of text. Anyway, the Windows case aside, if you don't have a reliable close() mechanism, you need to slurp the whole thing into a list in one swell foop so that you can just close the handle. Even in the Windows case, you need a handle, and I don't know what the consequences are of leaving it lying around. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

"Giampaolo Rodola'" <gnewsg@gmail.com> wrote in message news:d827975f-7c1e-471e-bac1-8d55262ab122@d27g2000prf.googlegroups.com... | I would find very useful having a version of os.listdir returning a generator. If there are no technical issues in the way, such a replacement (rather than addition) would be in line with other list -> iterator replacements in 3.0 (range, dict,items, etc). A list could then be obtained with list(os.listdir). tjr

On Nov 22, 2007 3:25 PM, Terry Reedy <tjreedy@udel.edu> wrote:
But how common is this use case really? -- --Guido van Rossum (home page: http://www.python.org/~guido/)

But how many FTP servers are written in Python *and* have directories with 20,000 files in them? --Guido On Nov 23, 2007 6:06 AM, Giampaolo Rodola' <gnewsg@gmail.com> wrote:
-- --Guido van Rossum (home page: http://www.python.org/~guido/)

On 23 Nov, 21:23, "Guido van Rossum" <gu...@python.org> wrote:
I sincerely don't know. Surely it's a rather specific use case, but it is one of the tasks which takes the longest amount of time on an FTP server. 20,000 is probably an exaggerated hypothetical situation, so I did a simple test with a more realistic scenario. On windows a very crowded directory is C:\windows\system32. Currently the C:\windows\system32 of my Windows XP workstation contains 2201 files. I tried to run the code below which is how an FTP server should properly respond to a "LIST" command issued by client. It took 1.70300006866 seconds to complete the first time and 0.266000032425 the second one. I don't know if such specific use case could justify a listdir generators support to have into the stdlib but having something like Greg Ewing's opendirs module could have saved a lot of time in this specific case. -- Giampaolo import os, stat, time from tarfile import filemode try: import pwd, grp except ImportError: pwd = grp = None def format_list(directory): """Return a directory listing emulating "/bin/ls -lA" UNIX command output. This is how output appears to client: -rw-rw-rw- 1 owner group 7045120 Sep 02 3:47 music.mp3 drwxrwxrwx 1 owner group 0 Aug 31 18:50 e-books -rw-rw-rw- 1 owner group 380 Sep 02 3:40 module.py """ listing = os.listdir(directory) result = [] for basename in listing: file = os.path.join(directory, basename) # if the file is a broken symlink, use lstat to get stat for # the link try: stat_result = os.stat(file) except (OSError,AttributeError): stat_result = os.lstat(file) perms = filemode(stat_result.st_mode) # permissions nlinks = stat_result.st_nlink # number of links to inode if not nlinks: # non-posix system, let's use a bogus value nlinks = 1 if pwd and grp: # get user and group name, else just use the raw uid/gid try: uname = pwd.getpwuid(stat_result.st_uid).pw_name except KeyError: uname = stat_result.st_uid try: gname = grp.getgrgid(stat_result.st_gid).gr_name except KeyError: gname = stat_result.st_gid else: # on non-posix systems the only chance we use default # bogus values for owner and group uname = "owner" gname = "group" size = stat_result.st_size # file size # stat.st_mtime could fail (-1) if file's last modification # time is too old, in that case we return local time as last # modification time. try: mtime = time.strftime("%b %d %H:%M", time.localtime(stat_result.st_mtime)) except ValueError: mtime = time.strftime("%b %d %H:%M") # if the file is a symlink, resolve it, e.g. "symlink -> real_file" if stat.S_ISLNK(stat_result.st_mode): basename = basename + " -> " + os.readlink(file) # formatting is matched with proftpd ls output result.append("%s %3s %-8s %-8s %8s %s %s\r\n" %( perms, nlinks, uname, gname, size, mtime, basename)) return ''.join(result) if __name__ == '__main__': before = time.time() format_list(r'C:\windows\system32') print time.time() - before

On Fri, Nov 23, 2007, Giampaolo Rodola' wrote:
Your code calls os.stat() on each file. I know from past experience that os.stat() is *extremely* expensive. Because os.listdir() runs at C speed, it only gets slow when run against hundreds of thousands of entries. (One directory on a work server has over 200K entries, and it takes os.listdir() about twenty seconds. I believe that if we switched from ext3 to something more appropriate that would get reduced.)
Doubtful. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

On Thu, Nov 22, 2007, Giampaolo Rodola' wrote:
-1 The problem is that reading a directory requires an open file handle; given a generator context, there's no clear mechanism for determining when to close the handle. Because the list needs to be created in the first place, why bother with a generator? -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith

Greg Ewing schrieb:
What about an os.iterdir() generator which uses opendir/readdir as proposed? The generator's close() could also call closedir(), and you could have a warning in the docs about making sure to have it closed at some point. One could even use an enclosing with closing(os.iterdir()) as d: block. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

Georg Brandl wrote:
What about an os.iterdir() generator which uses opendir/readdir as proposed?
I was feeling in the mood for a diversion, so I whipped up a Pyrex prototype of an opendir() object that can be used either as a file-like object or an iterator. Here's the docstring: """opendir(pathname) --> an open directory object Opens a directory and provides incremental access to the filenames it contains. May be used as a file-like object or as an iterator. When used as a file-like object, each call to read() returns one filename, or an empty string when the end of the directory is reached. The close() method should be called when finished with the directory. The close() method should also be called when used as an iterator and iteration is stopped prematurely. If iteration proceeds to completion, the directory is closed automatically.""" Source, setup.py and a brief test attached. -- Greg ############################################################## # # opendir.pyx - A class exposing the functionality of # =========== the opendir() family of C libary functions. # # By Gregory Ewing # greg.ewing@canterbury.ac.nz # # This software and derivative works created from it # may be used and redistributed without restriction. # ############################################################## cdef extern from "sys/errno.h": int errno cdef extern from "stdio.h": char *strerror(int) cdef extern from "dirent.h": ctypedef struct DIR struct dirent: int d_namlen char d_name[1] DIR *c_opendir "opendir" (char *) int readdir_r(DIR *, dirent *, dirent **) long telldir(DIR *) void seekdir(DIR *, long) void rewinddir(DIR *) int closedir(DIR *) int dirfd(DIR *) #------------------------------------------------------------------ cdef class opendir: """opendir(pathname) --> an open directory object Opens a directory and provides incremental access to the filenames it contains. May be used as a file-like object or as an iterator. When used as a file-like object, each call to read() returns one filename, or an empty string when the end of the directory is reached. The close() method should be called when finished with the directory. The close() method should also be called when used as an iterator and iteration is stopped prematurely. If iteration proceeds to completion, the directory is closed automatically.""" cdef DIR *dir def __cinit__(self, char *path): self.dir = c_opendir(path) if not self.dir: raise IOError(errno, "%s: '%s'" % (strerror(errno), path)) def __dealloc__(self): if self.dir: closedir(self.dir) def read(self): """read() --> filename or empty string Returns the next filename from the directory, or an empty string if the end of the directory has been reached.""" cdef dirent entry, *result check_open(self) if readdir_r(self.dir, &entry, &result) < 0: raise IOError(errno) if result: return entry.d_name else: return "" def tell(self): """tell() --> position Returns a value representing the current position in the directory, suitable for passing to tell(). Only valid for this directory object as long as it remains open.""" check_open(self) return telldir(self.dir) def seek(self, long pos): """seek(position) Returns the directory to the specified position, which should be a value previously returned by tell().""" check_open(self) seekdir(self.dir, pos) def rewind(self): """rewind() Resets the position to the beginning of the directory.""" check_open(self) rewinddir(self.dir) def close(self): """close() Closes the directory and frees the underlying file descriptor.""" if self.dir: if closedir(self.dir) < 0: raise IOError(errno) self.dir = NULL # MaxOSX doesn't seem to have dirfd, despite what the # man page says. :-( # # def fileno(self): # """fileno() --> file descriptor # # Returns the file descriptor associated with the open directory.""" # # check_open(self) # return dirfd(self.dir) def __iter__(self): return self def __next__(self): """next() --> filename Returns the next filename from the directory. If the end of the directory has been reached, closes the directory and raises StopIteration.""" if self.dir: result = self.read() if result: return result self.close() raise StopIteration #------------------------------------------------------------------ cdef int check_open(opendir d) except -1: if not d.dir: raise ValueError("Directory is closed") return 0 from distutils.core import setup from Pyrex.Distutils.extension import Extension from Pyrex.Distutils import build_ext setup( name = 'opendir', ext_modules=[ Extension("opendir", ["opendir.pyx"]), ], cmdclass = {'build_ext': build_ext} ) # # Test the opendir module # from opendir import opendir print "READ" d = opendir(".") while 1: name = d.read() if not name: break print " ", name print "EOF" print "ITERATE" d = opendir(".") for name in d: print " ", name print "STOP" print "TELL/SEEK" d = opendir(".") for i in range(3): name = d.read() print " ", name pos = d.tell() for i in range(3): name = d.read() print " ", name d.seek(pos) while 1: name = d.read() if not name: break print " ", name print "EOF" print "REWIND" d = opendir(".") for i in range(3): name = d.read() print " ", name d.rewind() while 1: name = d.read() if not name: break print " ", name print "EOF" print "EXCEPTION" try: d = opendir("spanish_inquisition") except Exception, e: print e print "DONE"

Adam Atlas wrote:
It doesn't, actually. On Windows, os.listdir uses FindFirstFile and FindNextFile, on OS2 it's DosFindFirst and DosFindNext, and on everything else it's Posix opendir and readdir. All of these are incremental, so a generator is the most natural way to expose the underlying API. That's just a set of facts and a single opinion. Past that I personally have no preference. Neil

On Fri, Nov 23, 2007, Adam Atlas wrote:
Enh. That is not reliable without work, and getting it reliable is a waste of work. The proposed idea for adding an opendir() function is workable, but it still doesn't solve the need for closing the handle within listdir(). No matter what, changes the semantics of listdir() to leave a handle lying around is going to cause problems for some people.
Because the list needs to be created in the first place
How so?
If you're going to ask a question, it would be nice to leave the entire original context in place, especially given that it's not a particularly long chunk of text. Anyway, the Windows case aside, if you don't have a reliable close() mechanism, you need to slurp the whole thing into a list in one swell foop so that you can just close the handle. Even in the Windows case, you need a handle, and I don't know what the consequences are of leaving it lying around. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "Typing is cheap. Thinking is expensive." --Roy Smith
participants (8)
-
Aahz
-
Adam Atlas
-
Georg Brandl
-
Giampaolo Rodola'
-
Greg Ewing
-
Guido van Rossum
-
Neil Toronto
-
Terry Reedy