Documentation/Examples about the htmllib?
Doug Fort
dougfort at downright.com
Thu Mar 8 14:44:53 EST 2001
httplib seems to want to prettyprint only.
To get actual tags, we use sgmllib. I've attached the module we use to
extract <form> tag stuff. Getting <area> tags shoulld be similar.
--
Doug Fort <dougfort at downright.com>
Senior Meat Manager
Downright Software LLC
http://www.dougfort.net
Hermann Himmelbauer wrote:
> Hi,
> I want to parse a htmlpage with python, so I thought using the htmllib
> would be good for that task.
>
> In my html-page I have this tag:
> <area href="/html/page.html?key" .... >
>
> What I want to do is extract this link parameter "key".
>
> It would be perfect if the htmllib would extract me all the <area ...> tags
> into a list so that I could simply find the rigth tag and extract the key.
>
> I did manage to get data between a tag like <title> data </title> but could
> not get the data "in" the tag itself. Of course I could do this with a
> regular expression but I thought using this module would give me better
> results, what do you think?
>
> Does anyone have a clue?
>
> Best Regards,
> Hermann
>
> --
> ,_,
> (O,O) "There is more to life than increasing its speed."
> ( ) -- Gandhi
> -"-"--------------------------------------------------------------
--
Doug Fort (dougfort at downright.com)
Senior Meat Manager
Downright Software LLC
http://www.dougfort.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20010308/ed37edd9/attachment.html>
-------------- next part --------------
#!/usr/bin/env python
"""
FormFieldParser
This object parses HTML text and builds a dictionary of
dictionaries of form fields
$Id: formfieldparser.py,v 1.1 2001/01/26 15:18:30 dougfort Exp $
"""
__author__="Downright Software LLC"
__version__="$Revision: 1.1 $"[11:-2]
import sgmllib
import string
import cStringIO
import urllib
import re
import webnudge.util.misc
import webnudge.util.document
class FormFieldParserException:
def __init__(self, message):
self._message = message
def __str__(self):
return self._message
###########################################################
class FormFieldParser(sgmllib.SGMLParser):
###########################################################
"""
FormFieldParser class. Parse a page from a website,
creating a dictionary of dictionairies of form
fields
"""
#----------------------------------------------------------
def __init__(self):
#----------------------------------------------------------
"""
Constructor
"""
sgmllib.SGMLParser.__init__(self)
self._formcount = 0
self._formdict = {}
#----------------------------------------------------------
def parse(self, text):
#----------------------------------------------------------
"""
parse some text, without trashing javascript
"""
self.feed(text)
self.close()
return self._formdict
#----------------------------------------------------------
def start_form(self,attributes):
#----------------------------------------------------------
"""
start a form
"""
self._formdict[self._formcount] = {}
#----------------------------------------------------------
def end_form(self):
#----------------------------------------------------------
"""
end a form
"""
self._formcount += 1
#----------------------------------------------------------
def _storeformfield(self,attributes,multivalue=0):
#----------------------------------------------------------
"""
Capture name and value attributes of a form field
"""
tagname = None
tagvalue = ""
selected = 0
for key, value in attributes:
if key == "name":
tagname = value
continue
if key == "value":
tagvalue = value
continue
if key == "selected":
selected = 1
continue
if multivalue and not selected:
return
if tagname:
self._formdict[self._formcount][tagname] = tagvalue
#----------------------------------------------------------
def do_input(self,attributes):
#----------------------------------------------------------
"""
Capture <input> element
"""
self._storeformfield(attributes)
#----------------------------------------------------------
def do_option(self,attributes):
#----------------------------------------------------------
"""
Capture <option> element
"""
self._storeformfield(attributes, multivalue=1)
#----------------------------------------------------------
def do_select(self,attributes):
#----------------------------------------------------------
"""
Capture <select> element
"""
self._storeformfield(attributes, multivalue=1)
#----------------------------------------------------------
def do_textarea(self,attributes):
#----------------------------------------------------------
"""
Capture <textarea> element
"""
self._storeformfield(attributes)
#----------------------------------------------------------
if __name__ == "__main__":
#----------------------------------------------------------
"""
Code for commandline testing
"""
import sys
if len(sys.argv) != 2:
print "Usage: filteringparser.py <url>"
sys.exit(-1)
import webnudge.util.rawhtmlpage
page = webnudge.util.rawhtmlpage.RawHTMLPage()
page.load(sys.argv[1])
if not page:
print "*** Error *** %s" % (page._message)
sys.exit(-1)
result = FormFieldParser().parse(page._data)
sys.stdout.write(repr(result))
More information about the Python-list
mailing list