[Tutor] Trying to parse a HUGE(1gb) xml file in python

ashish makani ashish.makani at gmail.com
Tue Dec 21 05:11:03 CET 2010


This block of code made my day - especially yummydataaddrs & "here's your
stupid data"

> for start,end in yummydataaddrs:
>    fd.seek(start)
>    print "here's your stupid data:", fd.read(end-start+1)

Nothing is more impressive than solid code, with a good sense of humor.

Thanks for the code & especially since i am in a time crunch, this approach,
might get me what i need more quickly.

Thanks also for Knuth's awesome quote & reminded me of my stanford friend
who told me that Prof. Knuth, still holds a christmas tree lecture every
year...unfortunately inspite of being in the bay area this year, i missed it

Thanks a ton


p.s. To everybody

OT(off_topic): I moved to the bay area recently & am passionate about
technology in general & linux, python, c, embedded, mobile, wireless
I was wondering if any of you guys, are part of some bay area python( or
other tech) meetup ( as in do you guys meetup, in person) for like a tech
talk / discussion / brainstorming/ hack nights ?
If yes, i would love to know more & be a part of it

On Mon, Dec 20, 2010 at 9:27 PM, Chris Fuller <cfuller084 at thinkingplanet.net
> wrote:

> This isn't XML, it's an abomination of XML.  Best to not treat it as XML.
> Good thing you're only after one class of tags.  Here's what I'd do.  I'll
> give a general solution, but there are two parameters / four cases that
> could
> make the code simpler, I'll just point them out at the end.
> Iterate over the file descriptor, reading in line-by-line.  This will be
> slow
> on a huge file, but probably not so bad if you're only doing it once.  It
> makes
> the rest easier.  Knuth has some sage advice on this point (*) :)  Some
> feedback on progress to the user can be helpful here, if it is slow.
> Keep track of your offset into the file.  There are two ways: use the
> tell()
> method of the file descriptor (but you will have to subtract the length of
> the
> current line), or just add up the line lengths as you process them.
> Scan each line for the open tag.  Add the offset to the tag to the offset
> within
> the file of  the current line, and push that to a stack.  Scan for the end
> tag,
> when you find one, pop an address from the stack, and put the two
> (start/end)
> addresses a list for later.  Keep doing this until you run out of file.
> Now, take that list, and pull off the address-pairs; seek() and read() them
> directly.  Lather, rinse, repeat.
> Some off-the-cuff untested code:
> stk = []
> yummydataaddrs = []
> fileoff = 0
> fd = open('ginormous.xml', 'r')
> for line in fd:
>    lineoff = line.index(start_tag)
>    if fileoff != -1:
>        stk.append(fileoff+lineoff)
>    lineoff = line.index(end_tag)
>    if lineoff != -1:
>        yummydataaddr.append( (stk.pop(-1), fileoff+lineoff) )
>    fileoff += len(line)
> for start,end in yummydataaddrs:
>    fd.seek(start)
>    print "here's your stupid data:", fd.read(end-start+1)
> You can simplify a bit if the tags are one a line by themselves, since you
> don't have to keep track of the offset with the line of the tag.  The other
> simplification is if they aren't nested.  You don't need to mess around
> with a
> stack in this case.
> (*) "Premature optimization is the root of all evil."
> Cheers
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor

*"We act as though comfort and luxury were the chief requirements of life,
when all that we need to make us happy is something to be enthusiastic
-- Albert Einstein*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/531dc1c8/attachment.html>

More information about the Tutor mailing list