[Tutor] splice a string object based on embedded html tag...

Fri Jan 30 05:58:12 EST 2004

On Fri, 30 Jan 2004, Stella Rockford wrote:

> I am parsing some html apart with sgmllib the end result is to feed an
> RSS with info scraped with pycurl and whatever...
>
> when I run sgmllib.py on the html file object I am returned an extremely
> long list of "pieces"
>
> this is good, however,
>
> there is a lot of data I have no use for and it slows the parser down

Hi Stella,

Before trying to optimize things further: have you already looked into
PyXML?

    http://pyxml.sourceforge.net/

According to the web site, the PyXML project includes a module called
'sgmlop' that accelerates sgmllib by about a factor of 5.  Installing
pyxml adds a module called 'xml.parsers.sgmllib', and it should ideally be
a drop-in for the one in the Standard Library.

So before going further, try seeing if using that enhanced version of
sgmllib will fix your performance problems.  Once you have PyXML
installed, it should just be a matter of replacing the original import:

    import sgmllib

with the replacement from the PyXML project:

    from xml.parsers import sgmllib

Give it a whirl and tell us if it helps.  Also, how are you using sgmllib?
Are you keeping an intermediate list of chunks, or keeping state as you
scan across the file?  Computers are pretty darn fast today, and I can't
imagine parsing HTML being that slow, unless the file is tremendous.
*grin*

> I would like to SPLICE everything before and after this table off of the
> file object this would be the first operation on the object, but when I
> looked up string's methods I couldn't quite find what i am looking for
> to do this.

String splicing should be functional, though I'm not sure if it will apply
easily to your problem.  If you want to do string splicing, you'll need to
find the index positions where the target table begins and ends.

Strings support a 'find()' method that, given a string to search for,
tries to find the right position:

###
>>> s = "hello world"
>>> s.find('world')
6
>>> s.find('not here')
-1
###

Note that when find() can't find, it returns -1.

So you may be able to localize the table with some string searching.
More sophisticated string searching may involve something like the
"regular expression" library:

    http://www.amk.ca/python/howto/regex/

and if you are fairly sure what the table looks like, perhaps you can use
regexes to yank it out.

Anyway, I hope this helps.  Good luck to you!