searching and storing large quantities of xml!

Steve Holden steve at holdenweb.com
Sat Jan 16 18:10:41 EST 2010


dads wrote:
> I work in as 1st line support and python is one of my hobbies. We get
> quite a few requests for xml from our website and its a long strung
> out process. So I thought I'd try and create a system that deals with
> it for fun.
> 
> I've been tidying up the archived xml and have been thinking what's
> the best way to approach this issue as it took a long time to deal
> with big quantities of xml. If you have 5/6 years worth of 26000+
> 5-20k xml files per year. The archived stuff is zipped but what is
> better, 26000 files in one big zip file, 26000 files in one big zip
> file but in folders for months and days, or zip files in zip files!
> 
> I created an app in wxpython to search the unzipped xml files by the
> modified date and just open them up and just using the something like
> l.find('>%s<' % fiveDigitNumber) != -1: is this quicker than parsing
> the xml?
> 
> Generally the requests are less than 3 months old so that got me into
> thinking should I create a script that finds all the file names and
> corresponding web number of old xml and bungs them into a db table one
> for each year and another script that after everyday archives the xml
> and after 3months zip it up, bungs info into table etc. Sorry for the
> ramble I just want other peoples opinions on the matter. =)

The first question I'd ask is what library you are using for the XML
processing. If you aren't using cElementTree it would definitely be
worth checking to see if it improves your processing speed. You can test
with ElementTree if you want, but cElementTree is an extension module,
and therefore much faster.

Fredrik Lundh wrote it so it's pretty solid stuff (he was one of the
minds behind the RE engine).

regards
 Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010  http://us.pycon.org/
Holden Web LLC                 http://www.holdenweb.com/
UPCOMING EVENTS:        http://holdenweb.eventbrite.com/




More information about the Python-list mailing list