HTML Regular Expression
Michael Morrison
borlak at home.com
Wed Jun 14 08:15:10 EDT 2000
This is my first regular expression, and it's working okay, but I was
wondering if there was a better way to do this.
I read in an HTML document using urllib.urlopen, which returns the document
complete with tabs. I don't want the tabs. So I made this regular
expression:
reobj = re.compile(r"(<.*?>|&#\d.{1,3})")
text = reobj.sub('', text)
text = string.replace(text, '\012', '')
How does that look? The <.*?> is for the tabs, the &#/d.{1,3} is for the
· characters and the like. I was told all the &# codes are 3 digits,
and that is the only way I could get the regular expression to erase 'em.
One last question, a pretty newbie one :) I run a simple loop to keep my
program going...
while going:
do_whatever
time.sleep(2)
I do this in pythonwin, but it freezes up in windows, and I can't use ctrl-d
or ctrl-c or anything to stop it without killing it. Is there a better
generic loop?
Thanks for any help and comments!
More information about the Python-list
mailing list