[Python-Dev] requirements for moving import over to importlib?

Sat Feb 11 01:23:07 CET 2012

On Feb 10, 2012 3:38 PM, "Brett Cannon" <brett at python.org> wrote:
> On Fri, Feb 10, 2012 at 15:07, PJ Eby <pje at telecommunity.com> wrote:
>> On Fri, Feb 10, 2012 at 1:05 PM, Brett Cannon <brett at python.org> wrote:
>>> First is that if this were used on Windows or OS X (i.e. the OSs we
support that typically have case-insensitive filesystems), then this
approach would be a massive gain as we already call os.listdir() when
PYTHONCASEOK isn't defined to check case-sensitivity; take your 5 stat
calls and add in 5 listdir() calls and that's what you get on Windows and
OS X right now. Linux doesn't have this check so you would still be
potentially paying a penalty there.
>>
>>
>> Wow.  That means it'd always be a win for pre-stdlib sys.path entries,
because any successful stdlib import equals a failed pre-stdlib lookup.
 (Of course, that's just saving some of the overhead that's been *added* by
importlib, not a new gain, but still...)
>
>
> How so? import.c does a listdir() as well (this is not special to
importlib).

IIRC, it does a FindFirstFile on Windows, which is not the same thing.
That's one system call into a preallocated buffer, not a series of system
calls and creation of Python string objects.

> Don't care about automatic reloaders. I'm just asking about the case
where the mtime granularity is coarse enough to allow for a directory
change, an import to execute, and then another directory change to occur
all within a single mtime increment. That would lead to the set cache to be
out of date.

Ah.  Good point.  Well, if there's any way to know what the mtime
granularity is, we can avoid the race condition by never performing the
listdir when the current clock time is too close to the stat().  In effect,
we can bypass the optimization if the directory was just modified.

Something like:

    mtime = stat(dir).st_mtime
    if abs(time.time()-mtime)>unsafe_window:
         old_mtime, files = cache.get(dir, (-1, ()))
         if mtime!=old_mtime:
              files = frozenset(listdir(dir))
              cache[dir] = mtime, files
         # code to check for possibility of importing
         # and shortcut if found, or
         # exit with failure if no matching files

    # fallthrough to direct filesystem checking

The "unsafe window" is presumably filesystem and platform dependent, but
ISTR that even FAT filesystems have 2-second accuracy.  The other catch is
the relationship between st_mtime and time.time(); I assume they'd be the
same in any sane system, but what if you're working across a network and
there's clock skew?  Ugh.

Worst case example would be say, accessing a FAT device that's been shared
over a Windows network from a machine whose clock is several hours off.  So
it always looks safe to read, even if it's just been changed.

What's the downside in that case?  You're trying to import something that
just changed in the last fraction of a second...  why?

I mean, sure, the directory listing will be wrong, no question.  But it
only matters that it was wrong if you added, removed, or renamed importable
files.  Why are you trying to import one of them?

Ah, here's a use case: you're starting up IDLE, and while it's loading, you
save some .py files you plan to import later.  Your editor saves them all
at once, but IDLE does the listdir() midway through.  You then do an import
from the IDLE prompt, and it fails because the listdir() didn't catch
everything.

Okay, now I know how to fix this.  The problem isn't that there's a race
condition per se, the problem is that the race results in a broken cache
later.  After all, it could just as easily have been the case that the
import failed due to timing.  The problem is that all *future* imports
would fail in this circumstance.

So the fix is a time-to-live recheck: if TTL seconds have passed since the
last use of the cached frozenset, reload it, and reset the TTL to infinity.

In other words:

    mtime = stat(dir).st_mtime
    now - time.time()
    if abs(now-mtime)>unsafe_window:
         old_mtime, then, files = cache.get(dir, (-1, now, ()))
         if mtime!=old_mtime or then is not None and now-then>TTL:
              files = frozenset(listdir(dir))
              cache[dir] = mtime, now if mtime!=old_mtime else None, files
         # code to check for possibility of importing
         # and shortcut if found, or
         # exit with failure if no matching files

    # fallthrough to direct filesystem checking

What this does (or should do) is handle clock-skew race condition stale
caches by reloading the listdir even if mtime hasn't changed, as soon as
TTL seconds have passed since the last snapshot was taken.  However, if the
mtime stays the same, no subsequent listdirs will occur.  As long as the
TTL is set high enough that a full startup of Python can occur, but low
enough that it resets by the time a human can notice something's wrong, it
should be golden.  ;-)

The TTL approach could be used in place of the unsafe_window, actually;
there's probably not much need for both.   The pure unsafe_window approach
has the advantage of elegance: it slows down only when you've just written
to the directory, and only briefly.   It doesn't load the directory twice,
either.

I suppose ideally, we'd set unsafe_window fairly low, and TTL fairly high,
so that for command-line apps and such you'd be done your entire script (or
at least all its importing) before reaching the TTL value.  But interactive
apps and servers wouldn't end up with a permanently stale cache just
because something was changed during startup.

Feh.  Screw it, just use a fairly high TTL and forget trying to tune the
unsafe_window, since if you're using a TTL you have to do the listdir() a
second time if there are any imports later.  It's also a single tunable
parameter at that point.

How high a TTL?  It's got to be at least as high as the worst-case mtime
granularity...  which is how high?  Yet, low enough that the human who
goes, "huh, that import should've worked", checks the directory listing and
tries it again will have it go through.  Hopefully, the worst-case mtime
granularity is shorter than that.  ;-)

>> Yep.  I was actually thinking this could be backported to 2.x, even
without importlib, as a module to be imported in sitecustomize or via a
.pth file.  All it needs is a path hook, after all, and a subclass of the
pkgutil importer to test it.  And if we can get some people with huge NFS
libraries and/or zillions of .egg directories on sys.path to test it, we
could find out whether it's a win, lose, or draw for those scenarios.
>
>
> You can do that if you want, obviously I don't want to bother since it
won't make it into Python 2.7.

Of course.  My thought wasn't to do this with a full version of importlib,
just to make a proof-of-concept importer.  All it really needs to do is
skip over the normal import processing in the case where the cache says
there's no way for the import to succeed; that's where all the speedup
really comes from.
On Feb 10, 2012 3:38 PM, "Brett Cannon" <brett at python.org> wrote:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20120210/85f80213/attachment.html>

[Python-Dev] requirements for moving __import__ over to importlib?

[Python-Dev] requirements for moving import over to importlib?