Is possible to combine handle_data and regular expressions?

ProvoWallis gshepherd281281 at yahoo.com
Fri Jan 20 00:44:20 CET 2006


Hi,

I've experimented with regular expressions to solve my problems in the
past but I have seen so many comments about HTMLParser and sgmllib that
I thought I would try a different approach this time so I tried using
HTMLParser.

I want to search through my SGML file for various strings of text and
find out what section they're in. What I have here does this to a
certain extent but I was wondering if I could make handle_data and
regular expressions work together to make this work a little better.

For instance, when I search for "above" as I am here, I just get
something like this: '174.114[1]':'above' but this isn't very useful
b/c I want to know the context of above (i.e., the informaiton on
either side the above) and maybe even us a regular expression to filter
the search a little more.

Any ideas?

As always, I'd appreciate feedback on my efforts.

Thanks,

Greg

###

from HTMLParser import HTMLParser
import os, re
root = raw_input("Enter the path where the program should run: ")
fname = raw_input("Enter name of the file: ")
print


given,ext = os.path.splitext(fname)

inputFile = open(os.path.join(root,fname), 'r')

data =  inputFile.read()

class PartFinder(HTMLParser):

     _full = None
     _secDict = dict()

     def found(self):
         return self._secDict

     def handle_starttag(self, tag, attrs):
         if tag == "sec-main":
              self._main = dict(attrs).get('no')
              self._full = self._main

         if tag == "sec-sub1":
              self._subone = dict(attrs).get('no')
              self._full = self._main + '[' + self._subone + ']'

         if tag == "sec-sub2":
              self._subtwo = dict(attrs).get('no')
              self._full = self._main + '[' + self._subone + ']' + '['
+ self._subtwo + ']'


     def handle_data(self, data):
         if "Pt" in data:
              if not self._secDict.has_key(self._main):
                   self._secDict[self._full] = [data]
                   print self._secDict



if __name__ == "__main__":
     parser = PartFinder()
     parser.feed(data)
     x = parser.found()

     output_part = given + '.parts'
     outputFile = file(os.path.join(root,output_part), 'w')
     outputFile.write(str(x))
     outputFile.close()




More information about the Python-list mailing list