Re: [Python-Dev] Import redesign [LONG]

[ taking the liberty to CC: this back to python-dev ] On Fri, 19 Nov 1999, David Ascher wrote:
Not at all. I thought of this last night after my email. Since the Importer can easily retain state, it can hold a cache of the directory listings. If it doesn't find the file in its cached state, then it can reload the information from disk. If it finds it in the cache, but not on disk, then it can remove the item from its cache. The problem occurs when you path is [A, B], the file is in B, and you add something to A on-the-fly. The cache might direct the importer at B, missing your file. Of course, with the appropriate caveats/warnings, the system would work quite well. It really only breaks during development (which is one reason why I didn't accept some caching changes to imputil from MAL; but that was for the Importer in there; Python's new Importer could have a cache). I'm also not quite sure what the cost of reading a directory is, compared to issuing a bunch of stat() calls. Each directory read is an opendir/readdir(s)/closedir. Note that the DBM approach is kind of similar, but will amortize this cost over many processes. Cheers, -g -- Greg Stein, http://www.lyra.org/

[David Ascher got involuntarily forwarded]
I posted something here about dircache not too long ago. Essentially, I found it completely unreliable on NT and on Linux to stat the directory. There was some test code attached. - Gordon

Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. The modtime of the directory's stat info should only change if you add or delete entries in the directory. Were you perhaps expecting changes when other operations took place, like rewriting an existing file? Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Gordon wrote: Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. to which I replied: Skip> The modtime of the directory's stat info should only change if you Skip> add or delete entries in the directory. Were you perhaps Skip> expecting changes when other operations took place, like rewriting Skip> an existing file? I took a couple minutes to write a simple script to check things. It created a file, changed its mode, then unlinked it. I was a bit surprised that deleting a file didn't appear to change the directory's mod time. Then I realized that since file times are only recorded with one-second precision, you might see no change to the directory's mtime in some circumstances. Adding a sleep to the script between directory operations resolved the apparent inconsistency. Still, as Gordon stated, you probably can't count on directory modtimes to tell you when to invalidate the cache. It's consistent, just not reliable... if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Skip Montanaro wrote:
[dir stat cache times]
Or two, on Windows with older (FAT, as opposed to VFAT) file systems.
If the dir stat time is less than 2 seconds ago, flush - always. If the dir stat time says it hasn't been changed for at least 2 seconds then you can cache all entries and trust that any change is detected. In other words: take the *current* time into account, then it can work. I think. Maybe. Until you get into network drives and clock skew... -- Jean-Claude

Jean-Claude wrote:
Oh lordy, it gets worse. With a time.sleep(1.0) between new files, Linux detects the change in the dir's mtime immediately. Cool. On NT, I get an average 2.0 sec delay. But sometimes it doesn't detect a delay in 100 secs (and my script quits). Then I added a stat of some file in the directory before the stat of the directory, (not the file I added). Now it acts just like Linux - no delay (on both FAT and NTFS partitions). OK...
I think. Maybe. Until you get into network drives and clock skew...
No success whatsoever in either direction across Samba. In fact the mtime of my Linux home directory as seen from NT is Jan 1, 1980. - Gordon

Gordon> No success whatsoever in either direction across Samba. In fact Gordon> the mtime of my Linux home directory as seen from NT is Jan 1, Gordon> 1980. Ain't life grand? :-( Ah, well, it was a nice idea... S

That's only the case for an NT mount point (something of the form \\host\name; I notice that os.stat() only believes it exists if you append a backslash: \\host\name\). For interior directories, at least with the Samba version that I'm using, os.stat() seems to give correct results. I think that this whole issue (that doing a stat on a directory to find out whether files in it were modified doesn't give usable results) is widely blown out of proportion. The only useful bit of info is that mtimes may have an up to 2 second granularity, and that anything as recent as 2 seconds should be considered as newer than the cache even if the cache is also less than 2 seconds. --Guido van Rossum (home page: http://www.python.org/~guido/)

[Gordon]
Correct (as I discovered not long after I posted). (I find that from NT I have to stat some file _in_ the directory to get an updated mtime from the stat _of_ the directory).
This has come up twice: re caching importers and dircache.py (used only by dircmp). We've arrived at the fact that it _can_ be made to work on Windows boxes. NFS? Andrew (anyone still use that)? IOW, do we want to trust it? Do we want to document that it might not be trustworthy in some situations? Make it optional- for-wizards? Kill it? IOOW, what's the proper proportion ;-)?

Greg> The problem occurs when you path is [A, B], the file is in B, and Greg> you add something to A on-the-fly. The cache might direct the Greg> importer at B, missing your file. Typically your path will be relatively short (< 20 directories), right? Just stat the directories before consulting the cache. If any changed since the last time the cache was built, then invalidate the entire cache (or that portion of the cached information that is downstream from the first modified directory). It's still going to be cheaper than performing listdir for each directory in the path, and like you said, only require flushes during development or installation actions. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

[David Ascher got involuntarily forwarded]
I posted something here about dircache not too long ago. Essentially, I found it completely unreliable on NT and on Linux to stat the directory. There was some test code attached. - Gordon

Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. The modtime of the directory's stat info should only change if you add or delete entries in the directory. Were you perhaps expecting changes when other operations took place, like rewriting an existing file? Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Gordon wrote: Gordon> I posted something here about dircache not too long ago. Gordon> Essentially, I found it completely unreliable on NT and on Linux Gordon> to stat the directory. There was some test code attached. to which I replied: Skip> The modtime of the directory's stat info should only change if you Skip> add or delete entries in the directory. Were you perhaps Skip> expecting changes when other operations took place, like rewriting Skip> an existing file? I took a couple minutes to write a simple script to check things. It created a file, changed its mode, then unlinked it. I was a bit surprised that deleting a file didn't appear to change the directory's mod time. Then I realized that since file times are only recorded with one-second precision, you might see no change to the directory's mtime in some circumstances. Adding a sleep to the script between directory operations resolved the apparent inconsistency. Still, as Gordon stated, you probably can't count on directory modtimes to tell you when to invalidate the cache. It's consistent, just not reliable... if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs, Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...

Skip Montanaro wrote:
[dir stat cache times]
Or two, on Windows with older (FAT, as opposed to VFAT) file systems.
If the dir stat time is less than 2 seconds ago, flush - always. If the dir stat time says it hasn't been changed for at least 2 seconds then you can cache all entries and trust that any change is detected. In other words: take the *current* time into account, then it can work. I think. Maybe. Until you get into network drives and clock skew... -- Jean-Claude

Jean-Claude wrote:
Oh lordy, it gets worse. With a time.sleep(1.0) between new files, Linux detects the change in the dir's mtime immediately. Cool. On NT, I get an average 2.0 sec delay. But sometimes it doesn't detect a delay in 100 secs (and my script quits). Then I added a stat of some file in the directory before the stat of the directory, (not the file I added). Now it acts just like Linux - no delay (on both FAT and NTFS partitions). OK...
I think. Maybe. Until you get into network drives and clock skew...
No success whatsoever in either direction across Samba. In fact the mtime of my Linux home directory as seen from NT is Jan 1, 1980. - Gordon

Gordon> No success whatsoever in either direction across Samba. In fact Gordon> the mtime of my Linux home directory as seen from NT is Jan 1, Gordon> 1980. Ain't life grand? :-( Ah, well, it was a nice idea... S

That's only the case for an NT mount point (something of the form \\host\name; I notice that os.stat() only believes it exists if you append a backslash: \\host\name\). For interior directories, at least with the Samba version that I'm using, os.stat() seems to give correct results. I think that this whole issue (that doing a stat on a directory to find out whether files in it were modified doesn't give usable results) is widely blown out of proportion. The only useful bit of info is that mtimes may have an up to 2 second granularity, and that anything as recent as 2 seconds should be considered as newer than the cache even if the cache is also less than 2 seconds. --Guido van Rossum (home page: http://www.python.org/~guido/)

[Gordon]
Correct (as I discovered not long after I posted). (I find that from NT I have to stat some file _in_ the directory to get an updated mtime from the stat _of_ the directory).
This has come up twice: re caching importers and dircache.py (used only by dircmp). We've arrived at the fact that it _can_ be made to work on Windows boxes. NFS? Andrew (anyone still use that)? IOW, do we want to trust it? Do we want to document that it might not be trustworthy in some situations? Make it optional- for-wizards? Kill it? IOOW, what's the proper proportion ;-)?

Greg> The problem occurs when you path is [A, B], the file is in B, and Greg> you add something to A on-the-fly. The cache might direct the Greg> importer at B, missing your file. Typically your path will be relatively short (< 20 directories), right? Just stat the directories before consulting the cache. If any changed since the last time the cache was built, then invalidate the entire cache (or that portion of the cached information that is downstream from the first modified directory). It's still going to be cheaper than performing listdir for each directory in the path, and like you said, only require flushes during development or installation actions. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented...
participants (5)
-
Gordon McMillan
-
Greg Stein
-
Guido van Rossum
-
Jean-Claude Wippler
-
Skip Montanaro