Suggestion: stopping to trust os mtimes
Trusting OS-based mtimes for .pyc caching has some inherent problems. (Clock syncing and similar) Frankly, though I've never been bitten by this, it does give me an uncomfortable feeling. What if, instead, we'd use md5- or sha-based approach? I'm willing to bet that the 2^128 chance of problems is miniscule compared to the real problems clock syncing has already caused. (I think I remember some problem with .pyc's on IIS, but I may just be hallucinating) Problems: .pyc size would increase by 24 bytes <wink> -- Moshe Zadka <moshez@math.huji.ac.il> http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com
Moshe Zadka wrote:
Trusting OS-based mtimes for .pyc caching has some inherent problems. (Clock syncing and similar) Frankly, though I've never been bitten by this, it does give me an uncomfortable feeling. What if, instead, we'd use md5- or sha-based approach? I'm willing to bet that the 2^128 chance of problems is miniscule compared to the real problems clock syncing has already caused. (I think I remember some problem with .pyc's on IIS, but I may just be hallucinating)
Problems: .pyc size would increase by 24 bytes <wink>
Much worse: you'd have to recalculate the MD5-sum every time you import the .pyc file... Frankly, I don't think this is needed at all ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
hi, Moshe Zadka wrote:
Trusting OS-based mtimes for .pyc caching has some inherent problems. (Clock syncing and similar) Frankly, though I've never been bitten by this, it does give me an uncomfortable feeling. What if, instead, we'd use md5- or sha-based approach? I'm willing to bet that the 2^128 chance of problems is miniscule compared to the real problems clock syncing has already caused. (I think I remember some problem with .pyc's on IIS, but I may just be hallucinating)
The timestamp is returned by simply 'stat'ing the .py file. If you want more, you actually would have to open the .py files all the time. This would be trading a big performance penalty for a security, that will almost always not needed. In Unix many sub systems (for example 'make' depend on a monotone system clock. A random jumping clock would break many of them anyway. Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)
On Fri, 2 Jun 2000, Peter Funk wrote:
Moshe Zadka wrote:
Trusting OS-based mtimes for .pyc caching has some inherent problems. (Clock syncing and similar) Frankly, though I've never been bitten by this, it does give me an uncomfortable feeling. What if, instead, we'd use md5- or sha-based approach? I'm willing to bet that the 2^128 chance of problems is miniscule compared to the real problems clock syncing has already caused. (I think I remember some problem with .pyc's on IIS, but I may just be hallucinating)
The timestamp is returned by simply 'stat'ing the .py file. If you want more, you actually would have to open the .py files all the time. This would be trading a big performance penalty for a security, that will almost always not needed. In Unix many sub systems (for example 'make' depend on a monotone system clock. A random jumping clock would break many of them anyway.
He does have a point, but I think the wrong solution :-) While the clock may be monotonically increasing on one system, it isn't always the case when things like NFS come into play. I recall a case back '95 when I was editing a .py over an NFS mount and running the code on the target machine. The clocks on the two boxes were off by about three seconds. I was going thru the edit/run/edit/run cycle so quickly, that at one point, I saved a .py file that was older than the associated .pyc file. Needless to say, I was really confused that my recent edit didn't produce the desired result :-) Cheers, -g p.s. and no, I don't know why the internal timestamp didn't take effect -- Greg Stein, http://www.lyra.org/
On Fri, 2 Jun 2000, Greg Stein wrote:
He does have a point, but I think the wrong solution :-)
In my defense, it was after spending the whole day on my feet giving a lecture, or driving (for 12 hours). But it does bother me, even if the solution is terrible. How about having, in addition to the time-stamp, the size of the file? At least on UNIX, it comes for free with the same stat call. -- Moshe Zadka <moshez@math.huji.ac.il> http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com
[Moshe]
How about having, in addition to the time-stamp, the size of the file? At least on UNIX, it comes for free with the same stat call.
+1 from me. Note that, besides inter-machine clock skew, some filesystems have a timestamp granularity too coarse to distinguish close-in-time writes. For those (& related) reasons, the attentive Pythoneer will have noted that all of the winning 1st-round Software Carpentry "make"-replacement designs provide for alternatives to timestamps. Tom Tromey's has the clearest discussion of the problems with timestamps: http://software-carpentry.codesourcery.com/entries/build/Tromey/Tromey.html In my industrial experience, (timestamp, size) pairs have never failed in practice. However, "my industrial experience" has been entirely in projects where source-control wrappers add a checkin comment block to every checked-in file, and that alone made it exceedingly unlikely that any two successive versions of a file would have the same size. In Python I'm still (a little) worried about cases like SOME_GLOBAL_CONFIG_OPTION = 0 where "0" gets replaced by "1" or "2" or ... there are lots of interesting things you can do to Python programs without changing their size. At Dragon, checked-in Python files were also subject to the "checkin comment block" rule, so no project under source control suffered from this. I suspect it burned people in their pre-source-controlled development projects, though! One group in particular had a project that involved acres of machine-generated Python modules, and I know they suffered from coarse timestamps on their flavor of Unix (so part of their "make" procedure was to nuke all .pyc's on each build). it's-easy-to-laugh-at-problems-you-don't-have<wink>-ly y'rs - tim
On Sun, 4 Jun 2000, Tim Peters wrote: [Moshe]
How about having, in addition to the time-stamp, the size of the file? At least on UNIX, it comes for free with the same stat call.
[Tim]
+1 from me. <even more reasons>
Now there is the big problem, that this will be changing the header size. I thought that this would be a good time anyway, since 1.5.2 pycs aren't compatible with 1.6, but changing the header size is a bigger thing. so-this-waits-until-guido-comes-back-i-guess-ly y'rs, Z. -- Moshe Zadka <moshez@math.huji.ac.il> http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com
Greg> I recall a case back '95 when I was editing a .py over an NFS Greg> mount and running the code on the target machine. The clocks on Greg> the two boxes were off by about three seconds. I was going thru Greg> the edit/run/edit/run cycle so quickly, that at one point, I saved Greg> a .py file that was older than the associated .pyc file. A help I think would be to compare the mtimes of the .py and .pyc files with the current system clock and squawk if either appears to have been created in the future. I believe this is what GNU make does. Of course, the best solution to all of this is the non-Python solution: use NTP so your clocks stay sync'd. It's even available out-of-the-box on my iMac... -- Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/ "We have become ... the stewards of life's continuity on earth. We did not ask for this role... We may not be suited to it, but here we are." - Stephen Jay Gould
On Sat, 3 Jun 2000, Skip Montanaro wrote:
Of course, the best solution to all of this is the non-Python solution: use NTP so your clocks stay sync'd. It's even available out-of-the-box on my iMac...
But the "Python Way" was always to adapt: not to require One True Way, but to use the facilities where it finds itself. In any case, is there any objection to storing the size of the .py alongside it's mtime in the .pyc, and regenerating if either is changed? This is just as efficient, and much more reliable. -- Moshe Zadka <moshez@math.huji.ac.il> http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com
On Sat, 3 Jun 2000, Moshe Zadka wrote:
... In any case, is there any objection to storing the size of the .py alongside it's mtime in the .pyc, and regenerating if either is changed? This is just as efficient, and much more reliable.
If we change the header size, then this would be fine. At the moment, I don't think that anybody is suggesting a change to the .pyc header format because of the number of tool breakages that would ensue. People have seemed interested in my idea to update the marshal format to record a version number -- it doesn't require a change to the .pyc header. Cheers, -g -- Greg Stein, http://www.lyra.org/
On Sat, 3 Jun 2000, Skip Montanaro wrote:
Greg> I recall a case back '95 when I was editing a .py over an NFS Greg> mount and running the code on the target machine. The clocks on Greg> the two boxes were off by about three seconds. I was going thru Greg> the edit/run/edit/run cycle so quickly, that at one point, I saved Greg> a .py file that was older than the associated .pyc file.
A help I think would be to compare the mtimes of the .py and .pyc files with the current system clock and squawk if either appears to have been created in the future. I believe this is what GNU make does.
Sure, but to the target machine, the .pyc was fine and the .py was in the past. :-) Of course, the proper solution is to introduce compile/link stages into Python so that we don't get bitten by 3-second clock differences. :-) -- Greg Stein, http://www.lyra.org/
Hi, Greg Stein:
He does have a point, but I think the wrong solution :-)
While the clock may be monotonically increasing on one system, it isn't always the case when things like NFS come into play.
That is a well known and common trap. Don't use NFS for Software development unless you've read and understood RFC 868. ;-) BTW.: Last year someone posted a pure Python implementation of RFC 868 time server and clients to c.l.p. This might be useful on those WinXX boxes.
I recall a case back '95 when I was editing a .py over an NFS mount and running the code on the target machine. The clocks on the two boxes were off by about three seconds. I was going thru the edit/run/edit/run cycle so quickly, that at one point, I saved a .py file that was older than the associated .pyc file.
Needless to say, I was really confused that my recent edit didn't produce the desired result :-)
Sure. ;-) But the same would have happenend, if you edited a .c source file and if your target computer has C-compiler/linker, which is fast enough to have a edit/compile/run cycle completed faster than the clock difference. This is not uncommon today. So the problem is not Python's fault and so I see no need to fix it there. One thing could be added though: If Python 'stat's a .py file, which has a time stamp in the future, it could issue a warning similar to that displayed by 'make': *** Warning: File `%s' has modification time in the future (%ld > %ld Possibly this message could point the user to RFC 868 and the 'netdate' Unix command. But that would be sugar on the cake. Regards, Peter
On Mon, 5 Jun 2000, Peter Funk wrote:
Greg Stein:
He does have a point, but I think the wrong solution :-)
While the clock may be monotonically increasing on one system, it isn't always the case when things like NFS come into play.
That is a well known and common trap. Don't use NFS for Software development unless you've read and understood RFC 868. ;-)
"Make"'s philosphy for basing the decision on which files need to be remade on the timestapmps is not necessarily the best -- but the user can replace mkae if it doesn't "do the right thing". Since Python takes on some of Make's roles (regenrating files only if those need to be regenerated), it is subject to the same problems. So it is Python's fault, and that's where the problem should be fixed. -- Moshe Zadka <moshez@math.huji.ac.il> http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com
On Fri, 2 Jun 2000, Moshe Zadka wrote:
Trusting OS-based mtimes for .pyc caching has some inherent problems. (Clock syncing and similar) Frankly, though I've never been bitten by this, it does give me an uncomfortable feeling. What if, instead, we'd use md5- or sha-based approach?
That is an expensive computation. You'd have to read the whole file in and compute the hash. Today, we simply stat() each file. If the .pyc looks valid, we open it and check the date stamp against one of those stat's. You would be adding an open(), a read of the full file, and compute a hash -- to every import of a .pyc. -1 Cheers, -g -- Greg Stein, http://www.lyra.org/
participants (6)
-
Greg Stein
-
M.-A. Lemburg
-
Moshe Zadka
-
pf@artcom-gmbh.de
-
Skip Montanaro
-
Tim Peters