SAX/Python : read an xml from the end to the top

Peter Hansen peter at engcorp.com
Tue Mar 7 13:25:21 CET 2006


kepioo wrote:
> The input xml I am parsing is always well formed. It is coming out from
> another application that append to this xml. I didn't see the source
> code of the application, but i know that it is not re-writing the whole
> xml. I thinnk it is just removing the last root element, adding the new
> tags and writing again the </root> tag.

If the writers had a clue, they probably just seek to the end of the 
file minus len('</root>') (or whatever) and then overwrite with the new 
entry and another </root> element.  At least, that's what seemed like 
the obvious approach when I had to do this once.

Not that this is particularly relevant to the problem. ;-)

> I guess, i will parse it till I find the last reported event and update
> the output xml from there, reporting only the events I am interested
> in....I hope SAX won't take too much time to do all this...(let's say 1
> event = 10 tags, 5 events/minutes, xml file running for 1 month -->
> 5400 000 opening tags)...
> 
> What do you think?

I think (guessing wildly) you probably have a fairly restricted number 
of possibilities being written to this file, possibly as simple as the 
somewhat stereotypical '<entry text="blah blah"/>' type of thing which 
I've seen lots of times.

If so, you can simply treat this as a text file which you process 
manually, in whatever direct and crude fashion works best, such as by 
seeking 1000 chars back from the end (assuming new entries are always 
less than that length), scanning for the last "<entry" string, and 
slicing and dicing till you find the stuff you need.

In other words, screw SAX, just grab the data directly and forget about 
all those silly well-formed XML issues etc.  Go for the simplest thing 
that could possibly work, and if you don't need the complexity of SAX, 
don't use it.

-Peter




More information about the Python-list mailing list