[Tutor] Trying to parse a HUGE(1gb) xml file in python
ashish makani
ashish.makani at gmail.com
Tue Dec 21 05:11:03 CET 2010
Chris
This block of code made my day - especially yummydataaddrs & "here's your
stupid data"
> for start,end in yummydataaddrs:
> fd.seek(start)
> print "here's your stupid data:", fd.read(end-start+1)
Nothing is more impressive than solid code, with a good sense of humor.
Thanks for the code & especially since i am in a time crunch, this approach,
might get me what i need more quickly.
Thanks also for Knuth's awesome quote & reminded me of my stanford friend
who told me that Prof. Knuth, still holds a christmas tree lecture every
year...unfortunately inspite of being in the bay area this year, i missed it
:(
http://stanford-online.stanford.edu/seminars/knuth/101206-knuth-500.asx
Thanks a ton
cheers
ashish
p.s. To everybody
OT(off_topic): I moved to the bay area recently & am passionate about
technology in general & linux, python, c, embedded, mobile, wireless
stuff,.....
I was wondering if any of you guys, are part of some bay area python( or
other tech) meetup ( as in do you guys meetup, in person) for like a tech
talk / discussion / brainstorming/ hack nights ?
If yes, i would love to know more & be a part of it
On Mon, Dec 20, 2010 at 9:27 PM, Chris Fuller <cfuller084 at thinkingplanet.net
> wrote:
>
> This isn't XML, it's an abomination of XML. Best to not treat it as XML.
> Good thing you're only after one class of tags. Here's what I'd do. I'll
> give a general solution, but there are two parameters / four cases that
> could
> make the code simpler, I'll just point them out at the end.
>
> Iterate over the file descriptor, reading in line-by-line. This will be
> slow
> on a huge file, but probably not so bad if you're only doing it once. It
> makes
> the rest easier. Knuth has some sage advice on this point (*) :) Some
> feedback on progress to the user can be helpful here, if it is slow.
>
> Keep track of your offset into the file. There are two ways: use the
> tell()
> method of the file descriptor (but you will have to subtract the length of
> the
> current line), or just add up the line lengths as you process them.
>
> Scan each line for the open tag. Add the offset to the tag to the offset
> within
> the file of the current line, and push that to a stack. Scan for the end
> tag,
> when you find one, pop an address from the stack, and put the two
> (start/end)
> addresses a list for later. Keep doing this until you run out of file.
>
> Now, take that list, and pull off the address-pairs; seek() and read() them
> directly. Lather, rinse, repeat.
>
> Some off-the-cuff untested code:
>
> stk = []
> yummydataaddrs = []
>
> fileoff = 0
>
> fd = open('ginormous.xml', 'r')
> for line in fd:
> lineoff = line.index(start_tag)
> if fileoff != -1:
> stk.append(fileoff+lineoff)
>
> lineoff = line.index(end_tag)
> if lineoff != -1:
> yummydataaddr.append( (stk.pop(-1), fileoff+lineoff) )
>
> fileoff += len(line)
>
> for start,end in yummydataaddrs:
> fd.seek(start)
> print "here's your stupid data:", fd.read(end-start+1)
>
>
> You can simplify a bit if the tags are one a line by themselves, since you
> don't have to keep track of the offset with the line of the tag. The other
> simplification is if they aren't nested. You don't need to mess around
> with a
> stack in this case.
>
>
> (*) "Premature optimization is the root of all evil."
>
>
> Cheers
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
*"We act as though comfort and luxury were the chief requirements of life,
when all that we need to make us happy is something to be enthusiastic
about."
-- Albert Einstein*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/531dc1c8/attachment.html>
More information about the Tutor
mailing list