[Tutor] lstrip() question(Thanks all!)
Tim Johnson
tim at johnsons-web.com
Tue Feb 3 11:42:45 EST 2004
Hello All:
I wish I had that time to thank every one on this topic and respond to
all your comments, but I'm laboring over a hot keyboard and trying to
beat a deadline.
I solved my problem and got a very large amount of
informative input.
My thanks to all of you!
Regards
tim
* Danny Yoo <dyoo at hkn.eecs.berkeley.edu> [040202 18:00]:
> Just out of curiosity, why are you trying to do this? Would it be
> possible to use something like HTMLParser?
>
> http://www.python.org/doc/lib/module-HTMLParser.html
>
> I know it sounds like using the library might be overkill, but HTMLParser
> is meant to deal with the ugliness that is HTML. It can handle some
> strange situations like
>
>
> ###
> s = """<br
> ><Br/><bR class="f<o><o>!">this is a test"""
> ###
>
>
> where a regular expression for this might be more subtle than we might
> expect. (The example above is meant to be a nightmare case. *grin*)
>
>
> Using a real HTML parser normalizes this wackiness so that we don't see
> it. Here's a subclass of HTMLParser that shows how we might use it for
> the problem:
>
>
> ###
> from HTMLParser import HTMLParser
>
> class IgnoreLeadingBreaksParser(HTMLParser):
> def __init__(self):
> HTMLParser.__init__(self)
> self.seen_nonbreak_tag = False
> self.text = []
>
> def get_text(self):
> return ''.join(self.text)
>
> def handle_starttag(self, tag, attrs):
> if tag != 'br':
> self.seen_nonbreak_tag = True
> if self.seen_nonbreak_tag:
> self.text.append(self.get_starttag_text())
>
> def handle_endtag(self, tag):
> if tag != 'br':
> self.seen_nonbreak_tag = True
> if self.seen_nonbreak_tag:
> self.text.append('</%s>' % tag)
>
> def handle_data(self, data):
> self.seen_nonbreak_tag = True
> self.text.append(data)
>
>
> def ignore_leading_breaks(text):
> parser = IgnoreLeadingBreaksParser()
> parser.feed(text)
> return parser.get_text()
> ###
>
>
> Note: this is not quite production-quality yet. In particular, it doesn't
> handle comments or character references, so we may need to add more
> methods to the IgnoreLeadingBreaksParser so that it handles those cases
> too.
>
>
> Hope this helps!
--
Tim Johnson <tim at johnsons-web.com>
http://www.alaska-internet-solutions.com
More information about the Tutor
mailing list