HTML Parser
Greg Jorgensen
gregj at pobox.com
Sat Dec 30 23:57:55 EST 2000
"Voitenko, Denis" <dvoitenko at qode.com> wrote:
> I am trying to write an HTML parser.
This has been done--look at the htmllib and sgmllib modules.
> I am starting off with a simple one
> like so:
>
> # html_parser.py
> import re
> import string
>
> newline=re.compile('\n')
> HTMLtags=re.compile('<.*>')
.* will match as many characters as possible, including (in your case) < and
>. You want this pattern, which will match as few characters as possible
surrounded by < and >:
HTMLtags = re.compile('<.*?>')
You can split using a literal character instead of a regular expression:
line = lines.split('\n')
The readlines() method the file object will save you the trouble, but you
don't need to split the input into lines at all if you just want to find the
HTML tags.
Here's my version:
# simple html tag processor
import sys
import re
rx = re.compile('(<.*?>)', re.MULTILINE)
# HTML text will come from a file.read()
html = '<html>\n<head>\n\t<title>Page Title</title>\n</head>\n<body>Hello,
World!</body>\n</html>\n'
# split the text into tags and stuff between tags
# the re.split() creates empty list elements for adjacent matches--those can
be ignored
# uppercase anything inside <..> and output the converted text
for s in rx.split(html):
if s == '': # re.split() artifact
continue
elif (s[0] == '<') and (s[-1] == '>'): # <tag>
sys.stdout.write(s.upper())
else: # everything else
sys.stdout.write(s)
--
Greg Jorgensen
PDXperts
Portland, Oregon, USA
gregj at pobox.com
More information about the Python-list
mailing list