TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Demo program: Windows 10, Python 3.8.3:
# osscandirtest.py
import time, os with open('Test', 'w') as f: f.write('Anything\n') # Write to a file time.sleep(10) with open('Test', 'r') as f: f.readline() # Read the file print(os.stat('Test')) for DirEntry in os.scandir('.'): if DirEntry.name == 'Test': stat = DirEntry.stat() print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} {stat.st_atime=}')
Sample output:
os.stat_result(st_mode=33206, st_ino=8162774324687317, st_dev=2230120362, st_nlink=1, st_uid=0, st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371, st_ctime=1600631262) scandir DirEntry stat.st_ctime=1600631262.951019 stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848
For os.stat, atime is 10 seconds more than mtime, as would be expected. But for os.scandir, atime is a copy of mtime. ISTM that this is a bug, and in fact recently it stopped me from using os.scandir in a program where I needed the access timestamp. No big deal, but ... If it is a feature for some reason, presumably it should be documented.
Best wishes Rob Cliffe
Could you please file this as an issue on bugs.python.org?
Thanks! -Greg
On Sat, Oct 17, 2020 at 7:25 PM Rob Cliffe via Python-Dev python-dev@python.org wrote:
>
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Demo program: Windows 10, Python 3.8.3:
# osscandirtest.py
import time, os with open('Test', 'w') as f: f.write('Anything\n') # Write to a file time.sleep(10) with open('Test', 'r') as f: f.readline() # Read the file print(os.stat('Test')) for DirEntry in os.scandir('.'): if DirEntry.name == 'Test': stat = DirEntry.stat() print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} {stat.st_atime=}')
Sample output:
os.stat_result(st_mode=33206, st_ino=8162774324687317, st_dev=2230120362, st_nlink=1, st_uid=0, st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371, st_ctime=1600631262) scandir DirEntry stat.st_ctime=1600631262.951019 stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848
For os.stat, atime is 10 seconds more than mtime, as would be expected. But for os.scandir, atime is a copy of mtime. ISTM that this is a bug, and in fact recently it stopped me from using os.scandir in a program where I needed the access timestamp. No big deal, but ... If it is a feature for some reason, presumably it should be documented.
Best wishes Rob Cliffe
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZV... Code of Conduct: http://python.org/psf/codeofconduct/
Interesting! Indeed, please create an issue and post a link here.
From a quick look at the code, I can't see any obvious bugs here, the info seems to be coming directly from FindNextFileW. This will likely require some more digging.
On Sun, Oct 18, 2020 at 7:37 AM Gregory P. Smith greg@krypto.org wrote:
Could you please file this as an issue on bugs.python.org?
Thanks! -Greg
On Sat, Oct 17, 2020 at 7:25 PM Rob Cliffe via Python-Dev python-dev@python.org wrote:
>
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Demo program: Windows 10, Python 3.8.3:
# osscandirtest.py
import time, os with open('Test', 'w') as f: f.write('Anything\n') # Write to a file time.sleep(10) with open('Test', 'r') as f: f.readline() # Read the file print(os.stat('Test')) for DirEntry in os.scandir('.'): if DirEntry.name == 'Test': stat = DirEntry.stat() print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} {stat.st_atime=}')
Sample output:
os.stat_result(st_mode=33206, st_ino=8162774324687317, st_dev=2230120362, st_nlink=1, st_uid=0, st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371, st_ctime=1600631262) scandir DirEntry stat.st_ctime=1600631262.951019 stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848
For os.stat, atime is 10 seconds more than mtime, as would be expected. But for os.scandir, atime is a copy of mtime. ISTM that this is a bug, and in fact recently it stopped me from using os.scandir in a program where I needed the access timestamp. No big deal, but ... If it is a feature for some reason, presumably it should be documented.
Best wishes Rob Cliffe
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZV... Code of Conduct: http://python.org/psf/codeofconduct/
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/INJBNXRK... Code of Conduct: http://python.org/psf/codeofconduct/
How do I do that, please? I can't see an obvious create option on that web page. Do I need to log in? Thanks Rob Cliffe
On 18/10/2020 05:31, Gregory P. Smith wrote:
Could you please file this as an issue on bugs.python.org http://bugs.python.org?
Thanks! -Greg
On Sat, Oct 17, 2020 at 7:25 PM Rob Cliffe via Python-Dev
<python-dev@python.org mailto:python-dev@python.org> wrote:
TLDR: In os.scandir directory entries, atime is always a copy of
mtime
rather than the actual access time.
Demo program: Windows 10, Python 3.8.3:
# osscandirtest.py
import time, os
with open('Test', 'w') as f: f.write('Anything\n') # Write to a file
time.sleep(10)
with open('Test', 'r') as f: f.readline() # Read the file
print(os.stat('Test'))
for DirEntry in os.scandir('.'):
if DirEntry.name == 'Test':
stat = DirEntry.stat()
print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=}
{stat.st_atime=}')
Sample output:
os.stat_result(st_mode=33206, st_ino=8162774324687317,
st_dev=2230120362, st_nlink=1, st_uid=0,
st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371,
st_ctime=1600631262)
scandir DirEntry stat.st_ctime=1600631262.951019
stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848
For os.stat, atime is 10 seconds more than mtime, as would be
expected.
But for os.scandir, atime is a copy of mtime.
ISTM that this is a bug, and in fact recently it stopped me from
using
os.scandir in a program where I needed the access timestamp. No big
deal, but ...
If it is a feature for some reason, presumably it should be
documented.
Best wishes
Rob Cliffe
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
<mailto:python-dev@python.org>
To unsubscribe send an email to python-dev-leave@python.org
<mailto:python-dev-leave@python.org>
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZVUAQBLECFMNN2PUOH322B2BYD/
Code of Conduct: http://python.org/psf/codeofconduct/
On 10/18/2020 12:25 PM, Rob Cliffe via Python-Dev wrote:
How do I do that, please? I can't see an obvious create option on that web page. Do I need to log in?
Yes, you need to log in before you can open an issue. You might need to create an account first if you don't have one: it's called "Register" on bpo. After you've logged in, there's a Create New button.
Eric
Thanks Rob Cliffe
On 18/10/2020 05:31, Gregory P. Smith wrote:
Could you please file this as an issue on bugs.python.org http://bugs.python.org?
Thanks! -Greg
On Sat, Oct 17, 2020 at 7:25 PM Rob Cliffe via Python-Dev
<python-dev@python.org mailto:python-dev@python.org> wrote:
TLDR: In os.scandir directory entries, atime is always a copy of
mtime
rather than the actual access time.
Demo program: Windows 10, Python 3.8.3:
# osscandirtest.py
import time, os
with open('Test', 'w') as f: f.write('Anything\n') # Write to a file
time.sleep(10)
with open('Test', 'r') as f: f.readline() # Read the file
print(os.stat('Test'))
for DirEntry in os.scandir('.'):
if DirEntry.name == 'Test':
stat = DirEntry.stat()
print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=}
{stat.st_atime=}')
Sample output:
os.stat_result(st_mode=33206, st_ino=8162774324687317,
st_dev=2230120362, st_nlink=1, st_uid=0,
st_gid=0, st_size=10, st_atime=1600631381, st_mtime=1600631371,
st_ctime=1600631262)
scandir DirEntry stat.st_ctime=1600631262.951019
stat.st_mtime=1600631371.7062848 stat.st_atime=1600631371.7062848
For os.stat, atime is 10 seconds more than mtime, as would be
expected.
But for os.scandir, atime is a copy of mtime.
ISTM that this is a bug, and in fact recently it stopped me from
using
os.scandir in a program where I needed the access timestamp. No big
deal, but ...
If it is a feature for some reason, presumably it should be
documented.
Best wishes
Rob Cliffe
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
<mailto:python-dev@python.org>
To unsubscribe send an email to python-dev-leave@python.org
<mailto:python-dev-leave@python.org>
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/RIKQAXZVUAQBLECFMNN2PUOH322B2BYD/
Code of Conduct: http://python.org/psf/codeofconduct/
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/377JYZMK... Code of Conduct: http://python.org/psf/codeofconduct/
On 10/15/20, Rob Cliffe via Python-Dev python-dev@python.org wrote: >
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
There are inconsistencies in various scenarios between between the stat info from the directory entry and the stat info from the File Control Block (FCB) -- the filesystem's in-memory record that's common to all opens for a file/directory.
The worst case is for an NTFS file with multiple hardlinks, for which the directory entry information is from the last time the file was opened using a particular hardlink. The accurate NTFS file information is in the file's Master File Table (MFT) record, which gets accessed to populate the FCB and update the particular link when a file is opened.
If you're looking for file times and file size, the only reliable information comes from directly opening the file an querying the info via GetFileInformationByHandle (called by os.stat), GetFileInformationByHandleEx (FileBasicInfo, FileStandardInfo), GetFileTime, and GetFileSizeEx.
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
Eryk replied with a deeper explanation of the cause, but fundamentally this is what you are seeing.
Feel free to file a bug, but we'll likely only add a vague note to the docs about how Windows works here rather than changing anything. If anything, we should probably fix os.stat() to avoid updating the access time so that both functions behave the same, but that might be too complicated.
Cheers, Steve
On 19Oct2020 1242, Steve Dower wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
Let me correct myself first :)
Windows has decided not to update file access time metadata in directory entries on reads. os.stat() always[1] looks at the file entry metadata, while os.scandir() always looks at the directory entry metadata.
My suggested approach still applies, other than the bit where we might fix os.stat(). The best we can do is regress os.scandir() to have similarly poor performance, but the best you can do is use os.stat() for accurate timings when files might be being modified while your program is running, and don't do it when you just need names/kinds (and I'm okay adding that note to the docs).
Cheers, Steve
[1]: With some fallback to directory entries in exceptional cases that don't apply here.
On 19.10.2020 14:47, Steve Dower wrote:
On 19Oct2020 1242, Steve Dower wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
Let me correct myself first :)
Windows has decided not to update file access time metadata in directory entries on reads. os.stat() always[1] looks at the file entry metadata, while os.scandir() always looks at the directory entry metadata.
Is this behavior documented somewhere?
Such weirdness certaintly something that needs to be documented but I really don't like describing such quirks that are out of our control and may be subject to change in Python documentation. So we should only consider doing so if there are no other options.
>
My suggested approach still applies, other than the bit where we might fix os.stat(). The best we can do is regress os.scandir() to have similarly poor performance, but the best you can do is use os.stat() for accurate timings when files might be being modified while your program is running, and don't do it when you just need names/kinds (and I'm okay adding that note to the docs).
Cheers, Steve
[1]: With some fallback to directory entries in exceptional cases that don't apply here.
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QHHJFYED...
Regards, Ivan
On Mon, Oct 19, 2020 at 6:28 AM Ivan Pozdeev via Python-Dev python-dev@python.org wrote:
>
On 19.10.2020 14:47, Steve Dower wrote:
On 19Oct2020 1242, Steve Dower wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
Let me correct myself first :)
Windows has decided not to update file access time metadata in directory entries on reads. os.stat() always[1] looks at the file entry metadata, while os.scandir() always looks at the directory entry metadata.
Is this behavior documented somewhere?
Such weirdness certaintly something that needs to be documented but I really don't like describing such quirks that are out of our control and may be subject to change in Python documentation. So we should only consider doing so if there are no other options.
I'm sure this is covered in MSDN. Linking to that if it has it in a concise explanation would make sense from a note in our docs.
If I'm understanding Steve correctly this is due to Windows/NTFS storing the access time potentially redundantly in two different places. One within the directory entry itself and one with the file's own metadata. Those of us with a traditional posix filesystem background may raise eyeballs at this duplication, seeing a directory as a place that merely maps names to inodes with the inode structure (equiv: file entry metadata) being the sole source of truth. Which ones get updated when and by what actions is up to the OS.
So yes, just document the "quirk" as an intended OS behavior. This is one reason scandir() can return additional information on windows vs what it can return on posix. The entire point of scandir() is to return as much as possible from the directory without triggering reads of the inodes/file-entry-metadata. :)
-gps
>
>
My suggested approach still applies, other than the bit where we might fix os.stat(). The best we can do is regress os.scandir() to have similarly poor performance, but the best you can do is use os.stat() for accurate timings when files might be being modified while your program is running, and don't do it when you just need names/kinds (and I'm okay adding that note to the docs).
Cheers, Steve
[1]: With some fallback to directory entries in exceptional cases that don't apply here.
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QHHJFYED...
Regards, Ivan
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VFXDBURS... Code of Conduct: http://python.org/psf/codeofconduct/
On 10/19/20 9:52 AM, Gregory P. Smith wrote:
On Mon, Oct 19, 2020 at 6:28 AM Ivan Pozdeev via Python-Dev
<python-dev@python.org mailto:python-dev@python.org> wrote:
On 19.10.2020 14:47, Steve Dower wrote:
> On 19Oct2020 1242, Steve Dower wrote:
>> On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
>>> TLDR: In os.scandir directory entries, atime is always a copy of
mtime rather than the actual access time.
>>
>> Correction - os.stat() updates the access time to _now_, while
os.scandir() returns the last access time without updating it.
>
> Let me correct myself first :)
>
> *Windows* has decided not to update file access time metadata *in
directory entries* on reads. os.stat() always[1] looks at the file
entry
> metadata, while os.scandir() always looks at the directory entry
metadata.
Is this behavior documented somewhere?
Such weirdness certaintly something that needs to be documented but
I really don't like describing such quirks that are out of our control
and may be subject to change in Python documentation. So we should
only consider doing so if there are no other options.
I'm sure this is covered in MSDN. Linking to that if it has it in a concise explanation would make sense from a note in our docs.
If I'm understanding Steve correctly this is due to Windows/NTFS storing the access time potentially redundantly in two different places. One within the directory entry itself and one with the file's own metadata. Those of us with a traditional posix filesystem background may raise eyeballs at this duplication, seeing a directory as a place that merely maps names to inodes with the inode structure (equiv: file entry metadata) being the sole source of truth. Which ones get updated when and by what actions is up to the OS.
So yes, just document the "quirk" as an intended OS behavior. This is one reason scandir() can return additional information on windows vs what it can return on posix. The entire point of scandir() is to return as much as possible from the directory without triggering reads of the inodes/file-entry-metadata. :)
-gps
depending on atimes isn't a consistently reliable mechanism anyway, since filesystems on Linux et. al. are allowed to be mounted so as to not independently update access times.
On 19Oct2020 1652, Gregory P. Smith wrote:
I'm sure this is covered in MSDN. Linking to that if it has it in a concise explanation would make sense from a note in our docs.
Probably unlikely :) I'm pretty sure this started "perfect" and was then wound back to improve performance. But it's almost certainly an option somewhere, which means you can't rely on it being either true nor false. You just have to be explicit for certain pieces of information.
If I'm understanding Steve correctly this is due to
Windows/NTFS storing
the access time potentially redundantly in two different places. One
within the directory entry itself and one with the file's own metadata.
Those of us with a traditional posix filesystem background may raise
eyeballs at this duplication, seeing a directory as a place that merely
maps names to inodes with the inode structure (equiv: file entry
metadata) being the sole source of truth. Which ones get updated when
and by what actions is up to the OS.
So yes, just document the "quirk" as an intended OS behavior. This is one reason scandir() can return additional information on windows vs what it can return on posix. The entire point of scandir() is to return as much as possible from the directory without triggering reads of the inodes/file-entry-metadata. :)
Yeah, I'd document it as a quirk of scandir. There's also a race where if you scandir(), then someone touches the file, then you look at the cached stat you get the wrong information too (an any platform). Making clearer that it's for non-time sensitive queries is most accurate, though we could also give an example of "access times may not be up to date depending on OS-level caching" without committing us to being responsible for OS decisions.
Cheers, Steve
On 20/10/20 4:52 am, Gregory P. Smith wrote:
Those of us with a traditional posix filesystem background may raise eyeballs at this duplication, seeing a directory as a place that merely maps names to inodes
This is probably a holdover from MS-DOS, where there was no separate inode-like structure -- it was all in the directory entry.
-- Greg
On 10/19/20, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
On 20/10/20 4:52 am, Gregory P. Smith wrote:
Those of us with a traditional posix filesystem background may raise eyeballs at this duplication, seeing a directory as a place that merely maps names to inodes
This is probably a holdover from MS-DOS, where there was no separate inode-like structure -- it was all in the directory entry.
DOS implemented a find-first/find-next API (int 21h 4E/4F) that provided a file's name, attributes, size, and last write time/date. I think it's clear that the design was influenced by the readily-available contents of a FAT dirent. The Win32 API extended this to FindFirstFile/FindNextFile, with added support for the long filename, create and access times, and, in NT 5+, the reparse tag for a reparse point.
NTFS had to support this metadata in the directory index, else FindFirstFile/FindNextFile would be too expensive if the filesystem had to fetch the metadata from the MFT for every matching file in a listing. It tries to keep the duplicated metadata in sync -- such as when a file is open, closed, manually extended in size, when the cache is flushed, or when metadata is explicitly set (e.g. SetFileInformationByHandle: FileBasicInfo). But for performance it doesn't update the duplicated data every time a file is read from or written to. And, in particular, if it's just the access time that changed, it updates the duplicated access time with a one-hour granularity. (There's also a registry value, as I mentioned previously, that disables updating access times completely -- in both the MFT record and the directory index.)
That said, if a file has multiple hardlinks the current NTFS implementation for updating duplicated data is totally unreliable. It only updates the accessed link. All other links go stale. We don't have any reasonable way to special case this situation because the directory entry doesn't include the number of links a file has. It has to be opened and queried directly, but then one might as well do a full stat() for every file.
I recommend relying on only the high-level is_dir(), is_file(), and is_symlink() methods of os.scandir() items, to quickly process a directory. inode() is reliable -- as much as is possible in Windows -- because the implementation gets the full stat info, but check to ensure it's not 0. It's based on the file ID, which Windows filesystems aren't required to support (or reliably support; it's not stable in FAT). NTFS and ReFS support reliable 64-bit file IDs, and opening by file ID.
On 10/19/20, Steve Dower steve.dower@python.org wrote:
On 19Oct2020 1242, Steve Dower wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
Let me correct myself first :)
Windows has decided not to update file access time metadata in directory entries on reads. os.stat() always[1] looks at the file entry metadata, while os.scandir() always looks at the directory entry metadata.
My suggested approach still applies, other than the bit where we might fix os.stat(). The best we can do is regress os.scandir() to have similarly poor performance, but the best you can do is use os.stat() for accurate timings when files might be being modified while your program is running, and don't do it when you just need names/kinds (and I'm okay adding that note to the docs).
If this is the correction to which you're referring in the previous message, I assumed you stood by the claim that os.stat() may update st_atime. That shouldn't be the case, so there shouldn't be anything that needs to be fixed there, unless I'm missing what you think needs to be fixed. If it's actually a problem, then I'd really, really like a test case that reproduces it. If it was just a misinterpreted test case or mis-remembered fact, then that's good news for me. ;-)
Regarding updating the access time in the directory entry, in my previous reply I explained that NTFS should update it with a one-hour granularity. With FAT, it's daily.
Regarding the view that this is only about "accurate timings when files might be being modified while your program is running", in my previous messages I stressed that the directory entry for a hard link may have the wrong size, change time, write time, and access time if it wasn't the last link used to update the file. That has nothing to do with the file being modified while the program is running. It's a stale directory entry. If you call os.stat() on the stale link, NTFS will update it with the correct metadata.
On Mon, Oct 19, 2020, at 07:42, Steve Dower wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
This is surprising - do we know why this happens?
Also, it doesn't seem true on my system with python 3.8.5 [and, yes, I checked that last access update is enabled for my test and updates normally when reading the file's contents].
On 10/19/20, Steve Dower steve.dower@python.org wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
os.stat() shouldn't affect st_atime because it doesn't access the file data. That has me curious if it can be reproduced.
With NTFS in Windows 10, I'd expect the os.stat() st_atime to change immediately when the file data is read or modified. With other filesystems, it may not be updated until the kernel file object that was used to access the file's data is closed.
Note that updating the access time in NTFS can be disabled by the "NtfsDisableLastAccessUpdate" value in "HKLM\System\CurrentControlSet\Control\FileSystem". The default value in Windows 10 should be 0x80000002, which means the value is system managed and updating the access time is enabled.
If it's only the access time that changes, the directory entry may be updated with a significant granularity such as hourly or daily. For NTFS, it's hourly. To confirm this, wait an hour from the current access time in the directory entry; open the file; read some data; and close the file. The access time in the directory entry should be updated.
For details, download the [MS-FSA] PDF [1] and look for all references to the following sections:
* 2.1.4.17 Algorithm for Noting That a File Has Been Modified
* 2.1.4.19 Algorithm for Noting That a File Has Been Accessed
* 2.1.4.18 Algorithm for Updating Duplicated Information
Also check the tables in Appendix A, which provide the update granularity of file time stamps (presumably for directory entries) for common Windows filesystems.
[1] https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-fsa/860b1516...
Going back to my initial message, I can't stress enough that this problem is at its worst when a file has multiple hardlinks. If a particular link in a directory wasn't the last link used to access the file, its duplicated metadata may have the wrong file size, access time, modify time, and change time (the latter is not reported by Python). As is, for the current implementation, I'd only rely on the basic attributes such as whether it's a directory or reparse point (symlink, mountpoint, etc) when using scandir() to quickly process a directory. For reliable stat information, call os.stat().
I do think, however, that os.scandir() can be improved in Windows without significant performance loss if it calls GetFileAttributesExW to get st_file_attributes, st_size, st_ctime (create time), st_mtime, and st_atime. This API call is relatively fast because it doesn't require opening the file via CreateFileW, which is one of the more expensive operations in os.stat(). But I haven't tried modifying scandir() to benchmark it.
Ultimately, I'm waiting for Windows 10 to provide a WinAPI function that calls the relatively new NTAPI function NtQueryInformationByName [2] (by name, not by handle!) to get the FileStatInformation, as well as for this information to be made available by handle via GetFileInformationByHandleEx. Compared to GetFileAttributesExW, the FileStatInformation additionally provides the file ID (if implemented by the filesystem), change time, reparse tag, number of links, and the effective access of the security context of the caller (i.e. process or thread access token). The latter is something that we've never impemented with os.stat(). It's not the same as POSIX owner-group-other permissions. It would need a new attribute such as st_effective_access. It could be used to provide a real implementation of os.access() in Windows.
https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs...
On 19Oct2020 1846, Eryk Sun wrote:
On 10/19/20, Steve Dower steve.dower@python.org wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
os.stat() shouldn't affect st_atime because it doesn't access the file data. That has me curious if it can be reproduced.
With NTFS in Windows 10, I'd expect the os.stat() st_atime to change immediately when the file data is read or modified. With other filesystems, it may not be updated until the kernel file object that was used to access the file's data is closed.
I thought I got my self-correction fired off quickly enough to save you from writing this :)
For details, download the [MS-FSA] PDF [1] and look for all references to the following sections:
[1] https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-fsa/860b1516...
Thanks for the detailed reference.
Going back to my initial message, I can't stress enough that this problem is at its worst when a file has multiple hardlinks. If a particular link in a directory wasn't the last link used to access the file, its duplicated metadata may have the wrong file size, access time, modify time, and change time (the latter is not reported by Python). As is, for the current implementation, I'd only rely on the basic attributes such as whether it's a directory or reparse point (symlink, mountpoint, etc) when using scandir() to quickly process a directory. For reliable stat information, call os.stat().
I do think, however, that os.scandir() can be improved in Windows without significant performance loss if it calls GetFileAttributesExW to get st_file_attributes, st_size, st_ctime (create time), st_mtime, and st_atime. This API call is relatively fast because it doesn't require opening the file via CreateFileW, which is one of the more expensive operations in os.stat(). But I haven't tried modifying scandir() to benchmark it.
Resolving the path is the most expensive part, even if the file is not opened (I've been working with the NTFS team on this area, and we've been benchmarking/analysing all of it). There are a few improvements coming across the board, but I'd much rather just emphasise that os.scandir() is as fast as we can manage using cached information (including as cached by the OS). Otherwise we prevent people from using the fastest available option when they can, if they don't need the additional information/accuracy.
Cheers, Steve
On 10/19/20, Steve Dower steve.dower@python.org wrote: >
Resolving the path is the most expensive part, even if the file is not opened (I've been working with the NTFS team on this area, and we've been benchmarking/analysing all of it).
If you say it's been extensively benchmarked and there's no direct way around the speed bottleneck, then I take your word for it. To clarify what I had in mind, I was hoping that because NTFS implements the fast I/O function FastIoQueryOpen [1] (via NtfsNetworkOpenCreate, as given by its FastIoDispatch table) that IRP_MJ_CREATE would be bypassed and that the filesystem would not incur a significant cost to parse the remaining path. I figured that most of the work would be in the ObObjectObjectByName and IopParseDevice executive calls that lead up to querying the filesystem.
Anyway, it's unfortunate that the Windows API doesn't support NT handle-relative names, except in the registry API. If we could call NTAPI NtQueryAttributesFile [2] directly, then the ObjectAttributes argument could be relative to a directory handle set in the RootDirectory field. That would eliminate the vast majority of the path-resolution cost. A handle-relative open or query goes straight to the filesystem device, which goes straight to the directory that contains the file.
To eliminate the cost of opening the directory handle, scandir() could be rewritten to use CreateFileW and GetFileInformationByHandleEx: FileIdBothDirectoryInfo [3] instead of FindFirstFileW / FindNextFileW. Just cache the directory handle in place of caching the find handle. scandir() would gain fd support in Windows. Opening a directory via os.open requires the flag _O_OBTAIN_DIR (0x2000), defined in fcntl.h.
FileIdBothDirectoryInfo provides the file ID, so the implementation would support the inode() method without calling stat(). It would still directly support is_dir() and is_file() based on the file attributes, and is_symlink() based on the file attributes and the EaSize field. The Windows Protocols document that the latter contains the reparse tag for a reparse point. The field is reused because a reparse point can't have extended attributes.
All that said, I don't prefer to call NtQueryAttributesFile or any other NTAPI function in Windows Python. I'd rather do the best possible with just the Windows API. I wish there were a new GetFileAttributesExExW function that supported handle-relative names. Even better would be a new function that calls NtQueryInformationByName -- something like GetFileInformationByName -- for FileStatInfo (and FileCaseSensitiveInfo as well, which is becoming more of an issue), also with support for handle-relative names.
[1] https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/ns-wdm-_fa... [2] https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-zwq... [3] https://docs.microsoft.com/en-us/windows/win32/api/winbase/ns-winbase-file_i...
On 19/10/2020 12:42, Steve Dower wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
Eryk replied with a deeper explanation of the cause, but fundamentally this is what you are seeing.
Feel free to file a bug, but we'll likely only add a vague note to the docs about how Windows works here rather than changing anything. If anything, we should probably fix os.stat() to avoid updating the access time so that both functions behave the same, but that might be too complicated.
Cheers, Steve Sorry - what you say does not match the behaviour I observe, which is that (1) Neither os.stat, nor reading os.scandir directory entries, update any of the times on disk. (2) os.stat.st_atime returns the "correct" time the file was last accessed. (3) os.scandir always returns st.atime equal to st.mtime.
Modified demo program:
# osscandirtest.py
import time, os
print(f'[1] {time.time()=}') with open('Test', 'w') as f: f.write('Anything\n')
time.sleep(20)
print(f'[2] {time.time()=}') with open('Test', 'r') as f: f.readline() # Read the file
time.sleep(10)
print(f'[3] {time.time()=}') print(os.stat('Test')) for DirEntry in os.scandir('.'): if DirEntry.name == 'Test': stat = DirEntry.stat() print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} {stat.st_atime=}') print(os.stat('Test')) for DirEntry in os.scandir('.'): if DirEntry.name == 'Test': stat = DirEntry.stat() print(f'scandir DirEntry {stat.st_ctime=} {stat.st_mtime=} {stat.st_atime=}') print(f'[4] {time.time()=}')
Sample output:
[1] time.time()=1603166161.12121 [2] time.time()=1603166181.1306772 [3] time.time()=1603166191.1426473 os.stat_result(st_mode=33206, st_ino=9851624184951253, st_dev=2230120362, st_nlink=1, st_uid=0, st_gid=0, st_size=10, st_atime=1603166181, st_mtime=1603166161, st_ctime=1603166161) scandir DirEntry stat.st_ctime=1603166161.12121 stat.st_mtime=1603166161.12121 stat.st_atime=1603166161.12121 os.stat_result(st_mode=33206, st_ino=9851624184951253, st_dev=2230120362, st_nlink=1, st_uid=0, st_gid=0, st_size=10, st_atime=1603166181, st_mtime=1603166161, st_ctime=1603166161) scandir DirEntry stat.st_ctime=1603166161.12121 stat.st_mtime=1603166161.12121 stat.st_atime=1603166161.12121 [4] time.time()=1603166191.1426473
You will observe that (1) The results from the two os.stat calls are the same, as are the results from the two scandir calls. (2) The os.stat.st_atime (1603166181) IS the time that the file was read with the with open('Test', 'r') as f: f.readline() # Read the file line of code, as it matches the [2] time.time()=1603166181.1306772 line of output (apart from discarded fractions of a second) and is 20 seconds (not 30 seconds) after the file creation time, as expected. (3) The os.scandir atime is a copy of mtime (and in this case, of ctime as well).
So it really does seem that the only thing "wrong" is that os.scandir returns atime as a copy of mtime, rather than the correct value. And since os.stat returns the "right" answer and os.scandir doesn't, it really seems that this is a bug, or at least a deficiency, in os.scandir.
Demo run on Windows 10 Home version 1903 OS build 18362.1139 Python version 3.8.3 (32-bit). Best wishes Rob Cliffe
On 20Oct2020 0520, Rob Cliffe wrote:
On 19/10/2020 12:42, Steve Dower wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
Eryk replied with a deeper explanation of the cause, but fundamentally this is what you are seeing.
Feel free to file a bug, but we'll likely only add a vague note to the docs about how Windows works here rather than changing anything. If anything, we should probably fix os.stat() to avoid updating the access time so that both functions behave the same, but that might be too complicated.
Cheers, Steve Sorry - what you say does not match the behaviour I observe, which is that
Yes, I posted a correction already (immediately after sending the first email).
What you are seeing is what Windows decided was the best approach. If you want to avoid that, os.stat() will get the latest available information. But I don't want to penalise people who don't need it by slowing down their scandir calls unnecessarily.
A documentation patch to make this difference between os.stat() and DirEntry even clearer would be fine.
Cheers, Steve
On Tue, Oct 20, 2020, at 07:42, Steve Dower wrote:
On 20Oct2020 0520, Rob Cliffe wrote:
On 19/10/2020 12:42, Steve Dower wrote:
On 15Oct2020 2239, Rob Cliffe via Python-Dev wrote:
TLDR: In os.scandir directory entries, atime is always a copy of mtime rather than the actual access time.
Correction - os.stat() updates the access time to _now_, while os.scandir() returns the last access time without updating it.
Eryk replied with a deeper explanation of the cause, but fundamentally this is what you are seeing.
Feel free to file a bug, but we'll likely only add a vague note to the docs about how Windows works here rather than changing anything. If anything, we should probably fix os.stat() to avoid updating the access time so that both functions behave the same, but that might be too complicated.
Cheers, Steve Sorry - what you say does not match the behaviour I observe, which is that
Yes, I posted a correction already (immediately after sending the first email).
ok, see, the correction you posted doesn't address the part of your claim that people are taking issue with, which is that calling os.stat() causes the atime to be set to the time of the call to os.stat(). This is not the same thing as [correctly] saying that "calling os.stat() may return a more up-to-date atime, the time of the last read, write, or other operation", and the phrasing "updates the access time to _now_" certainly seemed unambiguous.
And at this point it's not clear to me whether you understand that people are reading your claim this way.
What correction, exactly, do you mean? The post I saw with the word "Correction" on it is the one that makes the claim people are taking issue with.
On Fri, Oct 23, 2020, at 02:14, Random832 wrote:
What correction, exactly, do you mean? The post I saw with the word "Correction" on it is the one that makes the claim people are taking issue with.
okay, sorry, I see the other correction post now...
My issue I guess was the same as Eryk Sun, it wasn't clear which parts of the previous post you were correcting and which (if any) you stood by, since they were about the behavior of different parts of the system, so it didn't register as a correction to that part when I originally read it.
Le lun. 19 oct. 2020 à 13:50, Steve Dower steve.dower@python.org a écrit :
Feel free to file a bug, but we'll likely only add a vague note to the docs about how Windows works here rather than changing anything.
I agree that this surprising behavior can be documented. Attempting to provide accurate access time in os.scandir() is likely to slow-down the function which would defeat its whole purpose.
--
By the way, who relies on the access time? I don't understand why the creation and modification times are not enough for all usages. I would rather want to kill the whole concept of "access" time in operating systems (or just configure the OS to not update it anymore). I guess that it's really hard to make it efficient and accurate at the same time...
Linux has a "relatime" mount option (Fedora enables it by default): "With this option enabled, atime data is written to the disk only if the file has been modified since the atime data was last updated (mtime), or if the file was last accessed more than a certain amount of time ago (by default, one day)." Minor enhancement over always updating atime.
Night gathers, and now my watch begins. It shall not end until my death.
On 10/26/20, Victor Stinner vstinner@python.org wrote:
Le lun. 19 oct. 2020 à 13:50, Steve Dower steve.dower@python.org a écrit :
Feel free to file a bug, but we'll likely only add a vague note to the docs about how Windows works here rather than changing anything.
I agree that this surprising behavior can be documented. Attempting to provide accurate access time in os.scandir() is likely to slow-down the function which would defeat its whole purpose.
I don't think the access time (st_atime) is a significant concern. I'm concerned with the reliability of the file size (st_size) and last-write time (st_mtime) in stat() results. Developers are used to various filesystem policies on various platforms that limit when the access time gets updated, if at all. FAT32 filesystems only have an access date, and the driver in Windows fixes the access time at midnight. Updating the access time in NTFS and ReFS can be completely disabled at the system level; otherwise it's updated with a granularity of one hour if it's only the access time that would be updated.
The biggest concern for me is NTFS hardlinks, for which the st_size and st_mtime in the directory entry is unreliable. When a file with multiple hardlinks is modified, the filesystem only updates the duplicated information in the directory entry of the opened link. Because the entry in the directory doesn't include the link count or even a boolean value to indicate that a file has multiple hardlinks, if you don't know whether or not there's a possibility of hardlinks, then os.stat() is required in order to reliably determine st_size and st_mtime, to the extent that reliably knowing st_mtime is possible.
A general problem that affects even os.stat() is that a modified file may only be noted by setting a flag (FO_FILE_MODIFIED) in the kernel file object of the particular open. Whether it's immediately noted in the last-write time of the shared FCB (file control block) is up to filesystem policy.
Starting with Windows 10 1809 (as noted in [MS-FSA]), NTFS immediately notes the modification time, so the st_mtime value from os.stat() is current. In prior versions of NTFS, and with other Microsoft filesystems such as FAT32, the last-write time is only noted when the file is flushed to disk via FlushFileBuffers (i.e. os.fsync) or when the open is closed.
This means that st_size may change without also changing st_mtime. I'm using Windows 10 2004 currently, so I can't show an NTFS example, but the following shows the behavior with FAT32:
f = open('spam.txt', 'w')
st1 = os.stat('spam.txt')
time.sleep(10)
f.write('spam')
f.flush()
st2 = os.stat('spam.txt')
The above write was noted only by setting the FO_FILE_MODIFIED flag on the kernel file object. (The file object can be inspected with a local kernel debugger.) The write time wasn't noted in the FCB, i.e. st_mtime hasn't changed in st2:
>>> st2.st_size - st1.st_size
4
>>> st2.st_mtime - st1.st_mtime
0.0
The last-write time is noted when FlushFileBuffers (os.fsync) is called on the open:
>>> os.fsync(f.fileno())
>>> st3 = os.stat('spam.txt')
>>> st3.st_mtime - st1.st_mtime
10.0
Note also that, with NTFS, to the extent that the FCB metadata is current, calling os.stat() on a link updates the duplicated information in the directory entry. So calling os.stat() on a NTFS file may update the entry that's returned by a subsequent os.scandir() call.
On 27/10/20 8:24 am, Victor Stinner wrote:
I would rather want to kill the whole concept of "access" time in operating systems (or just configure the OS to not update it anymore). I guess that it's really hard to make it efficient and accurate at the same time...
Also it's kind of weird that just looking at data on the disk can change something about it. Sometimes it's an advantage to not have quantum computing!
-- Greg
On Tue, Oct 27, 2020 at 10:00 AM Greg Ewing greg.ewing@canterbury.ac.nz wrote: >
On 27/10/20 8:24 am, Victor Stinner wrote:
I would rather want to kill the whole concept of "access" time in operating systems (or just configure the OS to not update it anymore). I guess that it's really hard to make it efficient and accurate at the same time...
Also it's kind of weird that just looking at data on the disk can change something about it. Sometimes it's an advantage to not have quantum computing!
And yet, it's of incredible value to be able to ask "now, where was that file... the one that I was looking at last week, called something about calendars, and it had a cat picture in it". Being able to answer that kinda depends on recording accesses one way or another, so the weirdnesses are bound to happen.
ChrisA
On Mon, Oct 26, 2020, 4:06 PM Chris Angelico rosuav@gmail.com wrote:
On Tue, Oct 27, 2020 at 10:00 AM Greg Ewing greg.ewing@canterbury.ac.nz wrote: >
On 27/10/20 8:24 am, Victor Stinner wrote:
I would rather want to kill the whole concept of "access" time in operating systems (or just configure the OS to not update it anymore). I guess that it's really hard to make it efficient and accurate at the same time...
Also it's kind of weird that just looking at data on the disk can change something about it. Sometimes it's an advantage to not have quantum computing!
And yet, it's of incredible value to be able to ask "now, where was that file... the one that I was looking at last week, called something about calendars, and it had a cat picture in it". Being able to answer that kinda depends on recording accesses one way or another, so the weirdnesses are bound to happen.
scandir is never going to answer that. Neither is a simple blind "access" time stored in filesystem metadata.
ChrisA
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZMNVRGZ7... Code of Conduct: http://python.org/psf/codeofconduct/
Greg Ewing writes:
Also it's kind of weird that just looking at data on the disk can change something about it.
The "something about it" did change. The world is a dynamic entity, it does change. What you think is weird is that the metadata change is recorded.
Note: you can "fix" directory updates by mounting the filesystem r/o.
Sometimes it's an advantage to not have quantum computing!
I think effective encryption is a bigger one, myself. ;-)
On 10/28/20, Stephen J. Turnbull turnbull.stephen.fw@u.tsukuba.ac.jp wrote: >
Note: you can "fix" directory updates by mounting the filesystem r/o.
Mounting the filesystem as readonly is the extreme case. Popular Unix systems support a "noatime" mount option that disables updating file access times, unless one of the other timestamps changes. In Windows, NTFS and ReFS support a system setting (but not per-volume) to disable updating access times -- "NtfsDisableLastAccessUpdate" and "RefsDisableLastAccessUpdate".