Re: [Python-Dev] requirements for moving __import__ over to importlib?
On Feb 9, 2012 9:58 AM, "Brett Cannon" <brett@python.org> wrote:
This actually depends on the type of ImportError. My current solution actually would trigger an ImportError at the import statement if no finder could locate the module. But if some ImportError was raised because of some other issue during load then that would come up at first use.
That's not really a lazy import then, or at least not as lazy as what Mercurial or PEAK use for general lazy importing. If you have a lot of them, that module-finding time really adds up. Again, the goal is fast startup of command-line tools that only use a small subset of the overall framework; doing disk access for lazy imports goes against that goal.
On Thu, Feb 9, 2012 at 13:43, PJ Eby <pje@telecommunity.com> wrote:
On Feb 9, 2012 9:58 AM, "Brett Cannon" <brett@python.org> wrote:
This actually depends on the type of ImportError. My current solution actually would trigger an ImportError at the import statement if no finder could locate the module. But if some ImportError was raised because of some other issue during load then that would come up at first use.
That's not really a lazy import then, or at least not as lazy as what Mercurial or PEAK use for general lazy importing. If you have a lot of them, that module-finding time really adds up.
Again, the goal is fast startup of command-line tools that only use a small subset of the overall framework; doing disk access for lazy imports goes against that goal.
Depends if you consider stat calls the overhead vs. the actual disk read/write to load the data. Anyway, this is going to lead down to a discussion/argument over design parameters which I'm not up to having since I'm not actively working on a lazy loader for the stdlib right now.
On Thu, 9 Feb 2012 14:19:59 -0500 Brett Cannon <brett@python.org> wrote:
On Thu, Feb 9, 2012 at 13:43, PJ Eby <pje@telecommunity.com> wrote:
Again, the goal is fast startup of command-line tools that only use a small subset of the overall framework; doing disk access for lazy imports goes against that goal.
Depends if you consider stat calls the overhead vs. the actual disk read/write to load the data. Anyway, this is going to lead down to a discussion/argument over design parameters which I'm not up to having since I'm not actively working on a lazy loader for the stdlib right now.
For those of you not watching -ideas, or ignoring the "Python TIOBE -3%" discussion, this would seem to be relevant to any discussion of reworking the import mechanism: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/ Independent Software developer/SCM consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
On 2/9/2012 11:53 AM, Mike Meyer wrote:
On Thu, 9 Feb 2012 14:19:59 -0500 Brett Cannon<brett@python.org> wrote:
On Thu, Feb 9, 2012 at 13:43, PJ Eby<pje@telecommunity.com> wrote:
Again, the goal is fast startup of command-line tools that only use a small subset of the overall framework; doing disk access for lazy imports goes against that goal.
Depends if you consider stat calls the overhead vs. the actual disk read/write to load the data. Anyway, this is going to lead down to a discussion/argument over design parameters which I'm not up to having since I'm not actively working on a lazy loader for the stdlib right now. For those of you not watching -ideas, or ignoring the "Python TIOBE -3%" discussion, this would seem to be relevant to any discussion of reworking the import mechanism:
http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html
<mike
So what is the implication here? That building a cache of module locations (cleared when a new module is installed) would be more effective than optimizing the search for modules on every invocation of Python?
On 2/9/2012 3:27 PM, Glenn Linderman wrote:
On 2/9/2012 11:53 AM, Mike Meyer wrote:
On Thu, 9 Feb 2012 14:19:59 -0500 Brett Cannon<brett@python.org> wrote:
On Thu, Feb 9, 2012 at 13:43, PJ Eby<pje@telecommunity.com> wrote:
Again, the goal is fast startup of command-line tools that only use a small subset of the overall framework; doing disk access for lazy imports goes against that goal.
Depends if you consider stat calls the overhead vs. the actual disk read/write to load the data. Anyway, this is going to lead down to a discussion/argument over design parameters which I'm not up to having since I'm not actively working on a lazy loader for the stdlib right now. For those of you not watching -ideas, or ignoring the "Python TIOBE -3%" discussion, this would seem to be relevant to any discussion of reworking the import mechanism:
http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html
"For 32k processes on BlueGene/P, importing 100 trivial C-extension modules takes 5.5 hours, compared to 35 minutes for all other interpreter loading and initialization. We developed a simple pure-Python module (based on knee.py, a hierarchical import example) that cuts the import time from 5.5 hours to 6 minutes."
So what is the implication here? That building a cache of module locations (cleared when a new module is installed) would be more effective than optimizing the search for modules on every invocation of Python?
-- Terry Jan Reedy
For those of you not watching -ideas, or ignoring the "Python TIOBE -3%" discussion, this would seem to be relevant to any discussion of reworking the import mechanism:
http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html
Interesting. This gives me an idea for a way to cut stat calls per sys.path entry per import by roughly 4x, at the cost of a one-time
On Thu, Feb 9, 2012 at 2:53 PM, Mike Meyer <mwm@mired.org> wrote: directory read per sys.path entry. That is, an importer created for a particular directory could, upon first use, cache a frozenset(listdir()), and the stat().st_mtime of the directory. All the filename checks could then be performed against the frozenset, and the st_mtime of the directory only checked once per import, to verify whether the frozenset() needed refreshing. Since a failed module lookup takes at least 5 stat checks (pyc, pyo, py, directory, and compiled extension (pyd/so)), this cuts it down to only 1, at the price of a listdir(). The big question is how long does a listdir() take, compared to a stat() or failed open()? That would tell us whether the tradeoff is worth making. I did some crude timeit tests on frozenset(listdir()) and trapping failed stat calls. It looks like, for a Windows directory the size of the 2.7 stdlib, you need about four *failed* import attempts to overcome the initial caching cost, or about 8 successful bytecode imports. (For Linux, you might need to double these numbers; my tests showed a different ratio there, perhaps due to the Linux stdib I tested having nearly twice as many directory entries as the directory I tested on Windows!) However, the numbers are much better for application directories than for the stdlib, since they are located earlier on sys.path. Every successful stdlib import in an application is equal to one failed import attempt for every preceding directory on sys.path, so as long as the average directory on sys.path isn't vastly larger than the stdlib, and the average application imports at least four modules from the stdlib (on Windows, or 8 on Linux), there would be a net performance gain for the application as a whole. (That is, there'd be an improved per-sys.path entry import time for stdlib modules, even if not for any application modules.) For smaller directories, the tradeoff actually gets better. A directory one seventh the size of the 2.7 Windows stdlib has a listdir() that's proportionately faster, but failed stats() in that directory are *not* proportionately faster; they're only somewhat faster. This means that it takes fewer failed module lookups to make caching a win - about 2 in this case, vs. 4 for the stdlib. Now, these numbers are with actual disk or network access abstracted away, because the data's in the operating system cache when I run the tests. It's possible that this strategy could backfire if you used, say, an NFS directory with ten thousand files in it as your first sys.path entry. Without knowing the timings for listdir/stat/failed stat in that setup, it's hard to say how many stdlib imports you need before you come out ahead. When I tried a directory about 7 times larger than the stdlib, creating the frozenset took 10 times as long, but the cost of a failed stat didn't go up by very much. This suggests that there's probably an optimal directory size cutoff for this trick; if only there were some way to check the size of a directory without reading it, we could turn off the caching for oversize directories, and get a major speed boost for everything else. On most platforms, the stat().st_size of the directory itself will give you some idea, but on Windows that's always zero. On Windows, we could work around that by using a lower-level API than listdir() and simply stop reading the directory if we hit the maximum number of entries we're willing to build a cache for, and then call it off. (Another possibility would be to explicitly enable caching by putting a flag file in the directory, or perhaps by putting a special prefix on the sys.path entry, setting the cutoff in an environment variable, etc.) In any case, this seems really worth a closer look: in non-pathological cases, it could make directory-based importing as fast as zip imports are. I'd be especially interested in knowing how the listdir/stat/failed stat ratios work on NFS - ISTM that they might be even *more* conducive to this approach, if setup latency dominates the cost of individual system calls. If this works out, it'd be a good example of why importlib is a good idea; i.e., allowing us to play with ideas like this. Brett, wouldn't you love to be able to say importlib is *faster* than the old C-based importing? ;-)
On Thu, 9 Feb 2012 17:00:04 -0500 PJ Eby <pje@telecommunity.com> wrote:
On Thu, Feb 9, 2012 at 2:53 PM, Mike Meyer <mwm@mired.org> wrote:
For those of you not watching -ideas, or ignoring the "Python TIOBE -3%" discussion, this would seem to be relevant to any discussion of reworking the import mechanism:
http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html
Interesting. This gives me an idea for a way to cut stat calls per sys.path entry per import by roughly 4x, at the cost of a one-time directory read per sys.path entry.
Why do you even think this is a problem with "stat calls"?
On 2/9/12 10:15 PM, Antoine Pitrou wrote:
On Thu, 9 Feb 2012 17:00:04 -0500 PJ Eby<pje@telecommunity.com> wrote:
On Thu, Feb 9, 2012 at 2:53 PM, Mike Meyer<mwm@mired.org> wrote:
For those of you not watching -ideas, or ignoring the "Python TIOBE -3%" discussion, this would seem to be relevant to any discussion of reworking the import mechanism:
http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html
Interesting. This gives me an idea for a way to cut stat calls per sys.path entry per import by roughly 4x, at the cost of a one-time directory read per sys.path entry.
Why do you even think this is a problem with "stat calls"?
All he said is that reading about that problem and its solution gave him an idea about dealing with stat call overhead. The cost of stat calls has demonstrated itself to be a significant problem in other, more typical contexts. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Thu, Feb 9, 2012 at 5:34 PM, Robert Kern <robert.kern@gmail.com> wrote:
On 2/9/12 10:15 PM, Antoine Pitrou wrote:
On Thu, 9 Feb 2012 17:00:04 -0500 PJ Eby<pje@telecommunity.com> wrote:
On Thu, Feb 9, 2012 at 2:53 PM, Mike Meyer<mwm@mired.org> wrote:
For those of you not watching -ideas, or ignoring the "Python TIOBE
-3%" discussion, this would seem to be relevant to any discussion of reworking the import mechanism:
http://mail.scipy.org/**pipermail/numpy-discussion/** 2012-January/059801.html<http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html>
Interesting. This gives me an idea for a way to cut stat calls per
sys.path entry per import by roughly 4x, at the cost of a one-time directory read per sys.path entry.
Why do you even think this is a problem with "stat calls"?
All he said is that reading about that problem and its solution gave him an idea about dealing with stat call overhead. The cost of stat calls has demonstrated itself to be a significant problem in other, more typical contexts.
Right. It was the part of the post that mentioned that all they sped up was knowing which directory the files were in, not the actual loading of bytecode. The thought then occurred to me that this could perhaps be applied to normal importing, as a zipimport-style speedup. (The zipimport module caches each zipfile directory it finds on sys.path, so failed import lookups are extremely fast.) It occurs to me, too, that applying the caching trick to *only* the stdlib directories would still be a win as soon as you have between four and eight site-packages (or user specific site-packages) imports in an application, so it might be worth applying unconditionally to system-defined stdlib (non-site) directories.
On 2/9/2012 7:19 PM, PJ Eby wrote:
Right. It was the part of the post that mentioned that all they sped up was knowing which directory the files were in, not the actual loading of bytecode. The thought then occurred to me that this could perhaps be applied to normal importing, as a zipimport-style speedup. (The zipimport module caches each zipfile directory it finds on sys.path, so failed import lookups are extremely fast.)
It occurs to me, too, that applying the caching trick to *only* the stdlib directories would still be a win as soon as you have between four and eight site-packages (or user specific site-packages) imports in an application, so it might be worth applying unconditionally to system-defined stdlib (non-site) directories.
It might be worthwhile to store a single file in in the directory that contains /Lib with the info inport needs to get files in /Lib and its subdirs, and check that it is not outdated relative to /Lib. Since in Python 3, .pyc files go in __pycache__, if /Lib included an empyty __pycache__ on installation, /Lib would never be touched on most installations. Ditto for the non-__pycache__ subdirs. -- Terry Jan Reedy
On Thu, Feb 9, 2012 at 17:00, PJ Eby <pje@telecommunity.com> wrote:
On Thu, Feb 9, 2012 at 2:53 PM, Mike Meyer <mwm@mired.org> wrote:
For those of you not watching -ideas, or ignoring the "Python TIOBE -3%" discussion, this would seem to be relevant to any discussion of reworking the import mechanism:
http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html
Interesting. This gives me an idea for a way to cut stat calls per sys.path entry per import by roughly 4x, at the cost of a one-time directory read per sys.path entry.
That is, an importer created for a particular directory could, upon first use, cache a frozenset(listdir()), and the stat().st_mtime of the directory. All the filename checks could then be performed against the frozenset, and the st_mtime of the directory only checked once per import, to verify whether the frozenset() needed refreshing.
I actually contemplated this back in 2006 when I first began importlib for use at Google to get around NFS's crappy stat performance. Never got around to it as compatibility with import.c turned out to be a little tricky. =) Your solution below, PJE, is more-or-less what I was considering (although I also considered variants that didn't stat the directory when you knew your code wasn't changing stuff behind your back).
Since a failed module lookup takes at least 5 stat checks (pyc, pyo, py, directory, and compiled extension (pyd/so)), this cuts it down to only 1, at the price of a listdir(). The big question is how long does a listdir() take, compared to a stat() or failed open()? That would tell us whether the tradeoff is worth making.
Actually it's pyc OR pyo, py, directory (which can lead to another set for __init__.py and __pycache__), .so, module.so (or whatever your platform uses for extensions).
I did some crude timeit tests on frozenset(listdir()) and trapping failed stat calls. It looks like, for a Windows directory the size of the 2.7 stdlib, you need about four *failed* import attempts to overcome the initial caching cost, or about 8 successful bytecode imports. (For Linux, you might need to double these numbers; my tests showed a different ratio there, perhaps due to the Linux stdib I tested having nearly twice as many directory entries as the directory I tested on Windows!)
However, the numbers are much better for application directories than for the stdlib, since they are located earlier on sys.path. Every successful stdlib import in an application is equal to one failed import attempt for every preceding directory on sys.path, so as long as the average directory on sys.path isn't vastly larger than the stdlib, and the average application imports at least four modules from the stdlib (on Windows, or 8 on Linux), there would be a net performance gain for the application as a whole. (That is, there'd be an improved per-sys.path entry import time for stdlib modules, even if not for any application modules.)
Does this comment take into account the number of modules required to load the interpreter to begin with? That's already like 48 modules loaded by Python 3.2 as it is.
For smaller directories, the tradeoff actually gets better. A directory one seventh the size of the 2.7 Windows stdlib has a listdir() that's proportionately faster, but failed stats() in that directory are *not* proportionately faster; they're only somewhat faster. This means that it takes fewer failed module lookups to make caching a win - about 2 in this case, vs. 4 for the stdlib.
Now, these numbers are with actual disk or network access abstracted away, because the data's in the operating system cache when I run the tests. It's possible that this strategy could backfire if you used, say, an NFS directory with ten thousand files in it as your first sys.path entry. Without knowing the timings for listdir/stat/failed stat in that setup, it's hard to say how many stdlib imports you need before you come out ahead. When I tried a directory about 7 times larger than the stdlib, creating the frozenset took 10 times as long, but the cost of a failed stat didn't go up by very much.
This suggests that there's probably an optimal directory size cutoff for this trick; if only there were some way to check the size of a directory without reading it, we could turn off the caching for oversize directories, and get a major speed boost for everything else. On most platforms, the stat().st_size of the directory itself will give you some idea, but on Windows that's always zero. On Windows, we could work around that by using a lower-level API than listdir() and simply stop reading the directory if we hit the maximum number of entries we're willing to build a cache for, and then call it off.
(Another possibility would be to explicitly enable caching by putting a flag file in the directory, or perhaps by putting a special prefix on the sys.path entry, setting the cutoff in an environment variable, etc.)
In any case, this seems really worth a closer look: in non-pathological cases, it could make directory-based importing as fast as zip imports are. I'd be especially interested in knowing how the listdir/stat/failed stat ratios work on NFS - ISTM that they might be even *more* conducive to this approach, if setup latency dominates the cost of individual system calls.
If this works out, it'd be a good example of why importlib is a good idea; i.e., allowing us to play with ideas like this. Brett, wouldn't you love to be able to say importlib is *faster* than the old C-based importing? ;-)
Yes, that woud be nice. =) Now there are a couple things to clarify/question here. First is that if this were used on Windows or OS X (i.e. the OSs we support that typically have case-insensitive filesystems), then this approach would be a massive gain as we already call os.listdir() when PYTHONCASEOK isn't defined to check case-sensitivity; take your 5 stat calls and add in 5 listdir() calls and that's what you get on Windows and OS X right now. Linux doesn't have this check so you would still be potentially paying a penalty there. Second is variance in filesystems. Are we guaranteed that the stat of a directory is updated before a file change is made? Else there is a small race condition there which would suck. We also have the issue of granularity; Antoine has already had to add the source file size to .pyc files in Python 3.3 to combat crappy mtime granularity when generating bytecode. If we get file mod -> import -> file mod -> import, are we guaranteed that the second import will know there was a modification if the first three steps occur fast enough to fit within the granularity of an mtime value? I was going to say something about __pycache__, but it actually doesn't affect this. Since you would have to stat the directory anyway, you might as well just stat directory for the file you want to keep it simple. Only if you consider __pycache__ to be immutable except for what the interpreter puts in that directory during execution could you optimize that step (in which case you can stat the directory once and never care again as the set would be just updated by import whenever a new .pyc file was written). Having said all of this, implementing this idea would be trivial using importlib if you don't try to optimize the __pycache__ case. It's just a question of whether people are comfortable with the semantic change to import. This could also be made into something that was in importlib for people to use when desired if we are too worried about semantic changes.
On Fri, Feb 10, 2012 at 1:05 PM, Brett Cannon <brett@python.org> wrote:
On Thu, Feb 9, 2012 at 17:00, PJ Eby <pje@telecommunity.com> wrote:
I did some crude timeit tests on frozenset(listdir()) and trapping failed stat calls. It looks like, for a Windows directory the size of the 2.7 stdlib, you need about four *failed* import attempts to overcome the initial caching cost, or about 8 successful bytecode imports. (For Linux, you might need to double these numbers; my tests showed a different ratio there, perhaps due to the Linux stdib I tested having nearly twice as many directory entries as the directory I tested on Windows!)
However, the numbers are much better for application directories than for the stdlib, since they are located earlier on sys.path. Every successful stdlib import in an application is equal to one failed import attempt for every preceding directory on sys.path, so as long as the average directory on sys.path isn't vastly larger than the stdlib, and the average application imports at least four modules from the stdlib (on Windows, or 8 on Linux), there would be a net performance gain for the application as a whole. (That is, there'd be an improved per-sys.path entry import time for stdlib modules, even if not for any application modules.)
Does this comment take into account the number of modules required to load the interpreter to begin with? That's already like 48 modules loaded by Python 3.2 as it is.
I didn't count those, no. So, if they're loaded from disk *after* importlib is initialized, then they should pay off the cost of caching even fairly large directories that appear earlier on sys.path than the stdlib. We still need to know about NFS and other ratios, though... I still worry that people with more extreme directory sizes or slow-access situations will run into even worse trouble than they have now.
First is that if this were used on Windows or OS X (i.e. the OSs we support that typically have case-insensitive filesystems), then this approach would be a massive gain as we already call os.listdir() when PYTHONCASEOK isn't defined to check case-sensitivity; take your 5 stat calls and add in 5 listdir() calls and that's what you get on Windows and OS X right now. Linux doesn't have this check so you would still be potentially paying a penalty there.
Wow. That means it'd always be a win for pre-stdlib sys.path entries, because any successful stdlib import equals a failed pre-stdlib lookup. (Of course, that's just saving some of the overhead that's been *added* by importlib, not a new gain, but still...) Second is variance in filesystems. Are we guaranteed that the stat of a
directory is updated before a file change is made?
Not quite sure what you mean here. The directory stat is used to ensure that new files haven't been added, old ones removed, or existing ones renamed. Changes to the files themselves shouldn't factor in, should they?
Else there is a small race condition there which would suck. We also have the issue of granularity; Antoine has already had to add the source file size to .pyc files in Python 3.3 to combat crappy mtime granularity when generating bytecode. If we get file mod -> import -> file mod -> import, are we guaranteed that the second import will know there was a modification if the first three steps occur fast enough to fit within the granularity of an mtime value?
Again, I'm not sure how this relates. Automatic code reloaders monitor individual files that have been previously imported, so the directory timestamps aren't relevant. Of course, I could be confused here. Are you saying that if somebody makes a new .py file and saves it, that it'll be possible to import it before it's finished being written? If so, that could happen already, and again caching the directory doesn't make any difference. Alternately, you could have a situation where the file is deleted after we load the listdir(), but in that case the open will fail and we can fall back... heck, we can even force resetting the cache in that event. I was going to say something about __pycache__, but it actually doesn't
affect this. Since you would have to stat the directory anyway, you might as well just stat directory for the file you want to keep it simple. Only if you consider __pycache__ to be immutable except for what the interpreter puts in that directory during execution could you optimize that step (in which case you can stat the directory once and never care again as the set would be just updated by import whenever a new .pyc file was written).
Having said all of this, implementing this idea would be trivial using importlib if you don't try to optimize the __pycache__ case. It's just a question of whether people are comfortable with the semantic change to import. This could also be made into something that was in importlib for people to use when desired if we are too worried about semantic changes.
Yep. I was actually thinking this could be backported to 2.x, even without importlib, as a module to be imported in sitecustomize or via a .pth file. All it needs is a path hook, after all, and a subclass of the pkgutil importer to test it. And if we can get some people with huge NFS libraries and/or zillions of .egg directories on sys.path to test it, we could find out whether it's a win, lose, or draw for those scenarios.
On Fri, Feb 10, 2012 at 15:07, PJ Eby <pje@telecommunity.com> wrote:
On Fri, Feb 10, 2012 at 1:05 PM, Brett Cannon <brett@python.org> wrote:
On Thu, Feb 9, 2012 at 17:00, PJ Eby <pje@telecommunity.com> wrote:
I did some crude timeit tests on frozenset(listdir()) and trapping failed stat calls. It looks like, for a Windows directory the size of the 2.7 stdlib, you need about four *failed* import attempts to overcome the initial caching cost, or about 8 successful bytecode imports. (For Linux, you might need to double these numbers; my tests showed a different ratio there, perhaps due to the Linux stdib I tested having nearly twice as many directory entries as the directory I tested on Windows!)
However, the numbers are much better for application directories than for the stdlib, since they are located earlier on sys.path. Every successful stdlib import in an application is equal to one failed import attempt for every preceding directory on sys.path, so as long as the average directory on sys.path isn't vastly larger than the stdlib, and the average application imports at least four modules from the stdlib (on Windows, or 8 on Linux), there would be a net performance gain for the application as a whole. (That is, there'd be an improved per-sys.path entry import time for stdlib modules, even if not for any application modules.)
Does this comment take into account the number of modules required to load the interpreter to begin with? That's already like 48 modules loaded by Python 3.2 as it is.
I didn't count those, no. So, if they're loaded from disk *after* importlib is initialized, then they should pay off the cost of caching even fairly large directories that appear earlier on sys.path than the stdlib. We still need to know about NFS and other ratios, though... I still worry that people with more extreme directory sizes or slow-access situations will run into even worse trouble than they have now.
It's possible. No way to make it work for everyone. This is why I didn't worry about some crazy perf optimization.
First is that if this were used on Windows or OS X (i.e. the OSs we support that typically have case-insensitive filesystems), then this approach would be a massive gain as we already call os.listdir() when PYTHONCASEOK isn't defined to check case-sensitivity; take your 5 stat calls and add in 5 listdir() calls and that's what you get on Windows and OS X right now. Linux doesn't have this check so you would still be potentially paying a penalty there.
Wow. That means it'd always be a win for pre-stdlib sys.path entries, because any successful stdlib import equals a failed pre-stdlib lookup. (Of course, that's just saving some of the overhead that's been *added* by importlib, not a new gain, but still...)
How so? import.c does a listdir() as well (this is not special to importlib).
Second is variance in filesystems. Are we guaranteed that the stat of a
directory is updated before a file change is made?
Not quite sure what you mean here. The directory stat is used to ensure that new files haven't been added, old ones removed, or existing ones renamed. Changes to the files themselves shouldn't factor in, should they?
Changes in any fashion to the directory. Do filesystems atomically update the mtime of a directory when they commit a change? Otherwise we have a potential race condition.
Else there is a small race condition there which would suck. We also have the issue of granularity; Antoine has already had to add the source file size to .pyc files in Python 3.3 to combat crappy mtime granularity when generating bytecode. If we get file mod -> import -> file mod -> import, are we guaranteed that the second import will know there was a modification if the first three steps occur fast enough to fit within the granularity of an mtime value?
Again, I'm not sure how this relates. Automatic code reloaders monitor individual files that have been previously imported, so the directory timestamps aren't relevant.
Don't care about automatic reloaders. I'm just asking about the case where the mtime granularity is coarse enough to allow for a directory change, an import to execute, and then another directory change to occur all within a single mtime increment. That would lead to the set cache to be out of date.
Of course, I could be confused here. Are you saying that if somebody makes a new .py file and saves it, that it'll be possible to import it before it's finished being written? If so, that could happen already, and again caching the directory doesn't make any difference.
Alternately, you could have a situation where the file is deleted after we load the listdir(), but in that case the open will fail and we can fall back... heck, we can even force resetting the cache in that event.
I was going to say something about __pycache__, but it actually doesn't
affect this. Since you would have to stat the directory anyway, you might as well just stat directory for the file you want to keep it simple. Only if you consider __pycache__ to be immutable except for what the interpreter puts in that directory during execution could you optimize that step (in which case you can stat the directory once and never care again as the set would be just updated by import whenever a new .pyc file was written).
Having said all of this, implementing this idea would be trivial using importlib if you don't try to optimize the __pycache__ case. It's just a question of whether people are comfortable with the semantic change to import. This could also be made into something that was in importlib for people to use when desired if we are too worried about semantic changes.
Yep. I was actually thinking this could be backported to 2.x, even without importlib, as a module to be imported in sitecustomize or via a .pth file. All it needs is a path hook, after all, and a subclass of the pkgutil importer to test it. And if we can get some people with huge NFS libraries and/or zillions of .egg directories on sys.path to test it, we could find out whether it's a win, lose, or draw for those scenarios.
You can do that if you want, obviously I don't want to bother since it won't make it into Python 2.7.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/10/2012 03:38 PM, Brett Cannon wrote:
Changes in any fashion to the directory. Do filesystems atomically update the mtime of a directory when they commit a change? Otherwise we have a potential race condition.
Hmm, maybe I misundersand you. In POSIX land, the only thing which changes the mtime of a directory is linking / unlinking / renaming a file: changes to individual files aren't detectable by examining their containing directory's stat(). Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk81jDsACgkQ+gerLs4ltQ7YRwCePFEQA7E74dD9/j8ILuRMHLlA xbkAn1vTYGrEn4VOnVpygGafkGgnm42e =rJGg -----END PGP SIGNATURE-----
On Fri, Feb 10, 2012 at 16:29, Tres Seaver <tseaver@palladion.com> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 02/10/2012 03:38 PM, Brett Cannon wrote:
Changes in any fashion to the directory. Do filesystems atomically update the mtime of a directory when they commit a change? Otherwise we have a potential race condition.
Hmm, maybe I misundersand you. In POSIX land, the only thing which changes the mtime of a directory is linking / unlinking / renaming a file: changes to individual files aren't detectable by examining their containing directory's stat().
Individual file changes are not important; either the module is already in sys.modules so no attempt is made to detect a change or it hasn't been loaded and so it will have to be read regardless. All I'm asking is whether filesystems typically update the filesystem for a e.g. file deletion atomically with the mtime for the containing directory or not.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/10/2012 04:42 PM, Brett Cannon wrote:
On Fri, Feb 10, 2012 at 16:29, Tres Seaver <tseaver@palladion.com> wrote:
On 02/10/2012 03:38 PM, Brett Cannon wrote:
Changes in any fashion to the directory. Do filesystems atomically update the mtime of a directory when they commit a change? Otherwise we have a potential race condition.
Hmm, maybe I misundersand you. In POSIX land, the only thing which changes the mtime of a directory is linking / unlinking / renaming a file: changes to individual files aren't detectable by examining their containing directory's stat().
Individual file changes are not important; either the module is already in sys.modules so no attempt is made to detect a change or it hasn't been loaded and so it will have to be read regardless. All I'm asking is whether filesystems typically update the filesystem for a e.g. file deletion atomically with the mtime for the containing directory or not.
In POSIX land, most certainly. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk81kCIACgkQ+gerLs4ltQ5MogCfQwP2n4gl9PfsNXuP3c5al8EX TgwAn2EoGz1vk0OQAh5n3Tl9oze1CSSC =3iuR -----END PGP SIGNATURE-----
participants (8)
-
Antoine Pitrou -
Brett Cannon -
Glenn Linderman -
Mike Meyer -
PJ Eby -
Robert Kern -
Terry Reedy -
Tres Seaver