How to track files processed
Martin A. Brown
martin at linux-ip.net
Mon Apr 18 13:56:12 EDT 2016
Greetings,
>If you are parsing files in a directory what is the best way to
>record which files were actioned?
>
>So that if i re-parse the directory i only parse the new files in
>the directory?
How will you know that the files are new?
If a file has exactly the same content as another file, but a
different name, is it new?
Often this depends on the characteristics of the system in which
your (planned) software is operating.
Peter Otten has also asked for some more context, which would help
us give you some tips that are more targetted to the problem you are
trying to solve.
But, I'll just forge ahead and make some assumptions:
* You are watching a directory for new/changed files.
* New files are appearing regularly.
* Contents of old files get updated and you want to know.
Have you ever seen an MD5SUMS file? Do you know what a content hash
is? You could find a place to store the content hash (a.k.a.
digest) of each file that you process.
Below is a program that should work in Python2 and Python3. You
could use this sort of approach as part of your solution. In order
to make sure you have handled a file before, you should store and
compare two things.
1. The filename.
2. The content hash.
Note: If you are sure the content is not going to change, then just
use the filename to track whether you have handled something or not.
How would you use this tracking info ?
* Create a dictionary (or a set), e.g.:
handled = dict()
handled[('410c35da37b9a25d9b5d701753b011e5','setup.py')] = time.time()
Lasts only as long as the program runs. But, you will know
that you have handled any file by the tuple of its content hash
and filename.
* Store the filename (and/or digest) in a database. So many
options: sqlite, pickle, anydbm, text file of your own
crafting, SQLAlchemy ...
* Create a file, hardlink or symlink in the filesystem (in the
same directory or another directory), e.g.:
trackingfile = os.path.join('another-directory', 'setup.py')
with open(trackingfile, 'w') as f:
f.write('410c35da37b9a25d9b5d701753b011e5')
OR
os.symlink('setup.py', '410c35da37b9a25d9b5d701753b011e5-setup.py')
Now, you can also examine your little cache of handled files to
compare for when the content hash changes. If the system is an
automated system, then this can be perfectly fine. If humans
create the files, I would suggest not doing this. Humans tend
to be easily confused by such things (and then want to delete
the files or just be intimidated by them; scary hashes!).
There are lots of options, but without some more context, we can
only make generic suggestions. So, I'll stop with my generic
suggestions now.
Have fun and good luck!
-Martin
#! /usr/bin/python
from __future__ import print_function
import os
import sys
import logging
import hashlib
logformat = '%(levelname)-9s %(name)s %(filename)s#%(lineno)s ' \
+ '%(funcName)s %(message)s'
logging.basicConfig(stream=sys.stderr, format=logformat, level=logging.ERROR)
logger = logging.getLogger(__name__)
def hashthatfile(fname):
contenthash = hashlib.md5()
try:
with open(fname, 'rb') as f:
contenthash.update(f.read())
return contenthash.hexdigest()
except IOError as e:
logger.warning("See exception below; skipping file %s", fname)
logger.exception(e)
return None
def main(dirname):
for fname in os.listdir(dirname):
if not os.path.isfile(fname):
logger.debug("Skipping non-file %s", fname)
continue
logger.info("Found file %s", fname)
digest = hashthatfile(fname)
logger.info("Computed MD5 hash digest %s", digest)
print('%s %s' % (digest, fname,))
return os.EX_OK
if __name__ == '__main__':
if len(sys.argv) == 1:
sys.exit(main(os.getcwd()))
else:
sys.exit(main(sys.argv[1]))
# -- end of file
--
Martin A. Brown
http://linux-ip.net/
More information about the Python-list
mailing list