Using python to delta-load files into a central DB
Chris Nethery
gilcneth at earthlink.net
Fri Apr 13 10:21:15 EDT 2007
Gabriel,
I think that would work well. Also, thank you for suggesting the use of
filecmp. I have never used this module, but it looks like a much better
solution than what I had been doing previously--using os.stat and performing
a DB lookup in order to verify that the filename and timestamp existed in a
'file update' table. Also, if the only limitation to difflib is that both
files reside in memory, I should be fine. The largest of all of these files
is just over 200k, which should be fine. If memory serves me right, I can't
use more than 4MB, so I should be fine. And, if I spawn separate processes
for generating the delta files, I should be able to speed things up even
more.
Thanks again for your help!
Best Regards,
Christopher Nethery
"Gabriel Genellina" <gagsl-py2 at yahoo.com.ar> wrote in message
news:mailman.6444.1176436788.32031.python-list at python.org...
> En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <gilcneth at earthlink.net>
> escribió:
>
>> Yes, they are tab-delimited text files that will change very little
>> throughout the day.
>> But, this is messy, antiquated 80s junk, nonetheless.
>
> Ugh...
>
>> Rows are designated either by a row type or they contain a comment. Each
>> row type has an identity value, but the 'comment' rows do not. The
>> comment
>> rows, however, are logically associated with the last occurring row type.
>> When I generate my bulk insert file, I add the identity of the last
>> occurring row type to the comment rows, and generate and populate an
>> additional identity column in order to retain the order of the comments.
>> Generally rows will either be added or changed, but sometimes rows will
>> be
>> removed. Typically, only 1-5 new rows will be added to a file in a given
>> day, but users sometimes make manual corrections/deletions to older rows
>> and
>> sometimes certain column values are recalculated.
>
> http://tgolden.sc.sabren.com/python/win32_how_do_i/watch_directory_for_changes.html
>
> You could keep a copy of all files - let's say, as they were yesterday.
> When you want to process the changes, iterate over all files and see if
> they are newer than your copy. You could use the filecmp module:
> http://docs.python.org/lib/module-filecmp.html
> For each modified file: load it, and process the comments adding the
> associated row type and the identity. Just do the same with the
> "yesterday" file. (I assume they're not so big that you can keep both in
> memory). You have then two lists of lines; then, use the functions in
> module difflib to detect the changed lines; based on those results,
> generate your database inserts/deletes/updates.
>
> This way you will not process the unchanged files, and inside each file,
> you will ignore unchanged lines. At least in principle it should be faster
> than redoing all from scratch each time...
>
>> Did I mention that the header contains another implied hierarchy?
>> Fortunately, I can just ignore it and strip it off.
>
> good - I imagine it's enough work as it is now...
>
> --
> Gabriel Genellina
>
More information about the Python-list
mailing list