[Tutor] Trying to parse a HUGE(1gb) xml file in python

Chris Fuller cfuller084 at thinkingplanet.net
Tue Dec 21 03:27:44 CET 2010


This isn't XML, it's an abomination of XML.  Best to not treat it as XML.  
Good thing you're only after one class of tags.  Here's what I'd do.  I'll 
give a general solution, but there are two parameters / four cases that could 
make the code simpler, I'll just point them out at the end.

Iterate over the file descriptor, reading in line-by-line.  This will be slow 
on a huge file, but probably not so bad if you're only doing it once.  It makes 
the rest easier.  Knuth has some sage advice on this point (*) :)  Some 
feedback on progress to the user can be helpful here, if it is slow.

Keep track of your offset into the file.  There are two ways: use the tell() 
method of the file descriptor (but you will have to subtract the length of the 
current line), or just add up the line lengths as you process them.

Scan each line for the open tag.  Add the offset to the tag to the offset within 
the file of  the current line, and push that to a stack.  Scan for the end tag, 
when you find one, pop an address from the stack, and put the two (start/end) 
addresses a list for later.  Keep doing this until you run out of file.

Now, take that list, and pull off the address-pairs; seek() and read() them 
directly.  Lather, rinse, repeat.

Some off-the-cuff untested code:

stk = []
yummydataaddrs = []

fileoff = 0

fd = open('ginormous.xml', 'r')
for line in fd:
    lineoff = line.index(start_tag)
    if fileoff != -1:
        stk.append(fileoff+lineoff)

    lineoff = line.index(end_tag)
    if lineoff != -1:
        yummydataaddr.append( (stk.pop(-1), fileoff+lineoff) )

    fileoff += len(line)

for start,end in yummydataaddrs:
    fd.seek(start)
    print "here's your stupid data:", fd.read(end-start+1)


You can simplify a bit if the tags are one a line by themselves, since you 
don't have to keep track of the offset with the line of the tag.  The other 
simplification is if they aren't nested.  You don't need to mess around with a 
stack in this case.


(*) "Premature optimization is the root of all evil."


Cheers


More information about the Tutor mailing list