iglob performance no better than glob
Kyp
kyp at stsci.edu
Sun Jan 31 19:23:05 EST 2010
On Jan 31, 2:44 pm, Peter Otten <__pete... at web.de> wrote:
> Kyp wrote:
> > I have a dir with a large # of files that I need to perform operations
> > on, but only needing to access a subset of the files, i.e. the first
> > 100 files.
>
> > Using glob is very slow, so I ran across iglob, which returns an
> > iterator, which seemed just like what I wanted. I could iterate over
> > the files that I wanted, not having to read the entire dir.
>
> > So the iglob was faster, but accessing the first file took about the
> > same time as glob.glob.
>
> > Here's some code to compare glob vs. iglob performance, it outputs
> > the time before/after a glob.iglob('*.*') files.next() sequence and a
> > glob.glob('*.*') sequence.
>
> > #!/usr/bin/env python
>
> > import glob,time
> > print '\nTest of glob.iglob'
> > print 'before iglob:', time.asctime()
> > files = glob.iglob('*.*')
> > print 'after iglob:',time.asctime()
> > print files.next()
> > print 'after files.next():', time.asctime()
>
> > print '\nTest of glob.glob'
> > print 'before glob:', time.asctime()
> > files = glob.glob('*.*')
> > print 'after glob:',time.asctime()
>
> > Here are the results:
>
> > Test of glob.iglob
> > before iglob: Sun Jan 31 11:09:08 2010
> > after iglob: Sun Jan 31 11:09:08 2010
> > foo.bar
> > after files.next(): Sun Jan 31 11:09:59 2010
>
> > Test of glob.glob
> > before glob: Sun Jan 31 11:09:59 2010
> > after glob: Sun Jan 31 11:10:51 2010
>
> > The results are about the same for the 2 approaches, both took about
> > 51 seconds. Am I doing something wrong with iglob?
>
> No, but iglob() being lazy is pointless in your case because it uses
> os.listdir() and fnmatch.filter() underneath which both read the whole
> directory before returning anything.
>
> > Is there a way to get the first X # of files from a dir with lots of
> > files, that does not take a long time to run?
>
> Here's my attempt. It turned out to be more work than expected, so I cut a
> few corners. It's Linux-only "works on my machine" code, but may give you
> some hints on how to proceed.
>
> from ctypes import *
> import fnmatch
> import glob
> import os
> import re
> from itertools import ifilter, imap
>
> class dirent(Structure):
> "works on my machine ;)"
> _fields_ = [
> ("d_ino", c_long),
> ("d_off", c_long),
> ("d_reclen", c_ushort),
> ("d_type", c_ubyte),
> ("d_name", c_char*256)]
>
> direntp = POINTER(dirent)
>
> LIBC = "libc.so.6"
> cdll.LoadLibrary(LIBC)
> libc = CDLL(LIBC)
> libc.readdir.restype = direntp
>
> def diriter(dir):
> "lazy partial replacement for os.listdir()"
> # errors? what errors?
> dirp = libc.opendir(dir)
> if not dirp:
> return
> try:
> while True:
> ep = libc.readdir(dirp)
> if not ep:
> break
> yield ep.contents.d_name
> finally:
> libc.closedir(dirp)
>
> def filter(names, pattern):
> "lazy partial replacement for fnmatch.filter()"
> import posixpath
>
> pattern = os.path.normcase(pattern)
> r = fnmatch.translate(pattern)
> r = re.compile(r)
>
> if os.path is not posixpath:
> names = imap(os.path.normcase, names)
>
> return ifilter(r.match, names)
>
> def globiter(path):
> "lazy partial replacement for glob.glob()"
> dir, filename = os.path.split(path)
> if glob.has_magic(dir):
> raise ValueError("wildcards in directory not supported")
> return filter(diriter(dir), filename)
>
> if __name__ == "__main__":
> import sys
> [pattern] = sys.argv[1:]
> for name in globiter(pattern):
> print name
>
> Peter
I'll give it a try, thanx for the reply.
mark
More information about the Python-list
mailing list