[Tutor] ElementTree: finding a tag with specific attribute

Kent Johnson kent37 at tds.net
Sat Sep 17 04:21:16 CEST 2005


Kent Johnson wrote:
>>Bernard Lebel wrote:
>>>Btw in case you wonder, I don't use BeautifulSoup because somehow it
>>>takes 20-30 seconds to parse a 2000-line xml file, and I don't know
>>>why. ElementTree is proving very performing.
> 
> I took a bit of a look at this using the Python profiler. The most notable thing is the staggering number
> of times some functions are called. The first column (ncalls) is the
> total number of calls of a function. The second column (tottime) is
> the total time spent in the function, not counting the time spent in
> lower-level functions.
> 
> If you look at the list, for a while the functions are being called
> 777 times. This is probably the number of start tags in the document.
> But when you get to recursiveChildGenerator(), all of a sudden it is
> called 898655 times, over 1000 times for each call to _fetch()! This
> is a staggering number of calls, it is called 8 times for every
> character in the file!

I looked at this again and there is a bug in BS that causes this behaviour. It's kind of an interesting bug that is a side-effect of the way BS uses introspection to access child tags.

The problem begins at line 790:
        isResetNesting = self.RESET_NESTING_TAGS.has_key(name)

This looks innocent. The problem is that self.RESET_NESTING_TAGS is not defined. This forces a call to BeautifulStoneSoup.__getattr__() which calls Tag.__getattr__() and triggers a search for a child tag called RESET_NESTING_TAGS. I think the reason Bernard's file has such a hard time with this is because he has quite a few child tags under some of the tags. When each tag is created, the list of tags is iterated again.

Anyway I don't have time for a longer explanation right now. The fix is really simple - just add the line
    RESET_NESTING_TAGS = {}
after line 586.

I'll send a bug report to the author of BS.

Kent



More information about the Tutor mailing list