[Python-Dev] PEP 471 "scandir" accepted

Akira Li 4kir4.1i at gmail.com
Wed Jul 23 01:21:14 CEST 2014


Ben Hoyt <benhoyt at gmail.com> writes:

>> Note: listdir() accepts an integer path (an open file descriptor that
>> refers to a directory) that is passed to fdopendir() on POSIX [4] i.e.,
>> *you can't use scandir() to replace listdir() in this case* (as I've
>> already mentioned in [1]). See the corresponding tests from [2].
>>
>> [1] https://mail.python.org/pipermail/python-dev/2014-July/135296.html
>> [2] https://mail.python.org/pipermail/python-dev/2014-June/135265.html
>>
>> From os.listdir() docs [3]:
>>
>>> This function can also support specifying a file descriptor; the file
>>> descriptor must refer to a directory.
>>
>> [3] https://docs.python.org/3.4/library/os.html#os.listdir
>> [4] http://hg.python.org/cpython/file/3.4/Modules/posixmodule.c#l3736
>
> Fair point.
>
> Yes, I hadn't realized listdir supported dir_fd (must have been
> looking at 2.x docs), though you've pointed it out at [1] above. and I
> guess I wasn't thinking about implementation at the time.

FYI, dir_fd is related but *different*: compare "specifying a file
descriptor" [1] vs. "paths relative to directory descriptors" [2].

"NOTE: os.supports_fd and os.supports_dir_fd are different sets." [3]:

  >>> import os
  >>> os.listdir in os.supports_fd
  True
  >>> os.listdir in os.supports_dir_fd
  False


[1] https://docs.python.org/3/library/os.html#path-fd
[2] https://docs.python.org/3/library/os.html#dir-fd
[3] https://mail.python.org/pipermail/python-dev/2014-July/135296.html

To be clear: *listdir() does not support dir_fd* though it can be
emulated using os.open(dir_fd=..).

You can safely ignore the rest of the e-mail until you want to implement
path-fd [1] support for os.scandir() in several months.

Here's code example that demonstrates both path-fd [1] and dir-fd [2]:

  import contextlib
  import os

  with contextlib.ExitStack() as stack:
      dir_fd = os.open('/etc', os.O_RDONLY)
      stack.callback(os.close, dir_fd)
      fd = os.open('init.d', os.O_RDONLY, dir_fd=dir_fd) # dir-fd [2]
      stack.callback(os.close, fd)
      print("\n".join(os.listdir(fd))) # path-fd [1]

It is the same as os.listdir('/etc/init.d') unless '/etc' is symlinked
to refer to another directory after the first os.open('/etc',..)
call. See also, os.fwalk(dir_fd=..) [4]

[4] https://docs.python.org/3/library/os.html#os.fwalk

> However, given that we have to support this for listdir() anyway, I
> think it's worth reconsidering whether scandir()'s directory argument
> can be an integer FD.

What is entry.path in this case? If input directory is a file descriptor
(an integer) then os.path.join(directory, entry.name) won't work.

"PEP 471 should explicitly reject the support for specifying a file
descriptor so that a code that uses os.scandir may assume that
entry.path attribute is always present (no exceptions due
to a failure to read /proc/self/fd/NNN or an error while calling
fcntl(F_GETPATH) or GetFileInformationByHandleEx() -- see
http://stackoverflow.com/q/1188757 )." [5]

[5] https://mail.python.org/pipermail/python-dev/2014-July/135441.html

On the other hand os.fwalk() [4] that supports both path-fd [1] and
dir-fd [2] could be implemented without entry.path property if
os.scandir() supports just path-fd [1]. os.fwalk() provides a safe way
to traverse a directory tree without symlink races e.g., [6]:

  def get_tree_size(directory):
      """Return total size of files in directory and subdirs."""
      return sum(entry.lstat().st_size
                 for root, dirs, files, rootfd in fwalk(directory)
                 for entry in files)

[6] http://legacy.python.org/dev/peps/pep-0471/#examples

where fwalk() is the exact copy of os.fwalk() except that it uses
_fwalk() which is defined in terms of scandir():

  import os

  # adapt os._fwalk() to use scandir() instead of os.listdir()
  def _fwalk(topfd, toppath, topdown, onerror, follow_symlinks):
      # Note: This uses O(depth of the directory tree) file descriptors:
      # if necessary, it can be adapted to only require O(1) FDs, see
      # http://bugs.python.org/issue13734

      entries = scandir(topfd)
      dirs, nondirs = [], []
      for entry in entries: #XXX call onerror on OSError on next() and return?
          # report symlinks to directories as directories (like os.walk)
          #  but no recursion into symlinked subdirectories unless
          #  follow_symlinks is true

          # add dangling symlinks as nondirs (DirEntry.is_dir() doesn't
          #  raise on broken links)
          try:
              (dirs if entry.is_dir() else nondirs).append(entry)
          except FileNotFoundError:
              continue # ignore disappeared files

      if topdown:
          yield toppath, dirs, nondirs, topfd

      for entry in dirs:
          try:
              orig_st = entry.stat(follow_symlinks=follow_symlinks)
              #XXX O_DIRECTORY, O_CLOEXEC, [? O_NOCTTY, O_SEARCH ?]
              dirfd = os.open(entry.name, os.O_RDONLY, dir_fd=topfd)
          except OSError as err:
              if onerror is not None:
                  onerror(err)
              return
          try:
              if follow_symlinks or os.path.samestat(orig_st, os.stat(dirfd)):
                  dirpath = os.path.join(toppath, entry.name) # entry.path
                  yield from _fwalk(dirfd, dirpath, topdown, onerror,
                                    follow_symlinks)
          finally:
              close(dirfd) # or use with entry.opendir() as dirfd: ...

      if not topdown:
          yield toppath, dirs, nondirs, topfd


i.e., if os.scandir() supports specifying file descriptors [1] then it
is relatively straightforward to define os.fwalk() in terms of it. Would
scandir() provide the same performance benefits as for os.walk()?

entry.stat() can be implemented without entry.path when entry._directory
(or whatever other DirEntry's attribute that stores the first parameter
to os.scandir(fd)) is an open file descriptor that refers to a directory:

  def stat(self, *, follow_symlinks=True):
      return os.stat(self.name, #NOTE: ignore caching
          follow_symlinks=follow_symlinks, dir_fd=self._directory)
  lstat = lambda self: self.stat(follow_symlinks=False)


--
Akira


More information about the Python-Dev mailing list