Using python to delta-load files into a central DB

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Thu Apr 12 23:59:38 EDT 2007


En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <gilcneth at earthlink.net>  
escribió:

> Yes, they are tab-delimited text files that will change very little
> throughout the day.
> But, this is messy, antiquated 80s junk, nonetheless.

Ugh...

> Rows are designated either by a row type or they contain a comment.  Each
> row type has an identity value, but the 'comment' rows do not.  The  
> comment
> rows, however, are logically associated with the last occurring row type.
> When I generate my bulk insert file, I add the identity of the last
> occurring row type to the comment rows, and generate and populate an
> additional identity column in order to retain the order of the comments.
> Generally rows will either be added or changed, but sometimes rows will  
> be
> removed.  Typically, only 1-5 new rows will be added to a file in a given
> day, but users sometimes make manual corrections/deletions to older rows  
> and
> sometimes certain column values are recalculated.

http://tgolden.sc.sabren.com/python/win32_how_do_i/watch_directory_for_changes.html

You could keep a copy of all files - let's say, as they were yesterday.
When you want to process the changes, iterate over all files and see if  
they are newer than your copy. You could use the filecmp module:  
http://docs.python.org/lib/module-filecmp.html
For each modified file: load it, and process the comments adding the  
associated row type and the identity. Just do the same with the  
"yesterday" file. (I assume they're not so big that you can keep both in  
memory). You have then two lists of lines; then, use the functions in  
module difflib to detect the changed lines; based on those results,  
generate your database inserts/deletes/updates.

This way you will not process the unchanged files, and inside each file,  
you will ignore unchanged lines. At least in principle it should be faster  
than redoing all from scratch each time...

> Did I mention that the header contains another implied hierarchy?
> Fortunately, I can just ignore it and strip it off.

good - I imagine it's enough work as it is now...

-- 
Gabriel Genellina




More information about the Python-list mailing list