searching and storing large quantities of xml!

Paul Rubin no.email at nospam.invalid
Sat Jan 16 20:14:05 EST 2010


dads <wayne.dads.bell at gmail.com> writes:
> I've been tidying up the archived xml and have been thinking what's
> the best way to approach this issue as it took a long time to deal
> with big quantities of xml. If you have 5/6 years worth of 26000+
> 5-20k xml files per year. The archived stuff is zipped but what is
> better, 26000 files in one big zip file, 26000 files in one big zip
> file but in folders for months and days, or zip files in zip files!

If I'm reading that properly, you have 5-6 years worth of files, 26000
files per year, 5-20k bytes per file?  At 10k bytes/file that's about
1.3GB which isn't all that much data by today's standards.

> Generally the requests are less than 3 months old so that got me into
> thinking should I create a script that finds all the file names and
> corresponding web number of old xml and bungs them into a db table one
> for each year and another script that after everyday archives the xml
> and after 3months zip it up, bungs info into table etc. Sorry for the
> ramble I just want other peoples opinions on the matter. =)

Extract all the files and put them into some kind of indexed database or
search engine.  I've used solr (http://lucene.apache.org/solr) for this
purpose and while it has limitations, it's fairly easy to set up and use
for basic purposes.



More information about the Python-list mailing list