[Tutor] lstrip() question(Thanks all!)

Tim Johnson tim at johnsons-web.com
Tue Feb 3 11:42:45 EST 2004


Hello All:

I wish I had that time to thank every one on this topic and respond to
all your comments, but I'm laboring over a hot keyboard and trying to
beat a deadline.

I solved my problem and got a very large amount of
informative input. 

My thanks to all of you!
Regards
tim

* Danny Yoo <dyoo at hkn.eecs.berkeley.edu> [040202 18:00]:
> Just out of curiosity, why are you trying to do this?  Would it be
> possible to use something like HTMLParser?
> 
>     http://www.python.org/doc/lib/module-HTMLParser.html
> 
> I know it sounds like using the library might be overkill, but HTMLParser
> is meant to deal with the ugliness that is HTML.  It can handle some
> strange situations like
> 
> 
> ###
> s = """<br
>      ><Br/><bR       class="f<o><o>!">this is a test"""
> ###
> 
> 
> where a regular expression for this might be more subtle than we might
> expect.  (The example above is meant to be a nightmare case.  *grin*)
> 
> 
> Using a real HTML parser normalizes this wackiness so that we don't see
> it.  Here's a subclass of HTMLParser that shows how we might use it for
> the problem:
> 
> 
> ###
> from HTMLParser import HTMLParser
> 
> class IgnoreLeadingBreaksParser(HTMLParser):
>     def __init__(self):
>         HTMLParser.__init__(self)
>         self.seen_nonbreak_tag = False
>         self.text = []
> 
>     def get_text(self):
>         return ''.join(self.text)
> 
>     def handle_starttag(self, tag, attrs):
>         if tag != 'br':
>             self.seen_nonbreak_tag = True
>         if self.seen_nonbreak_tag:
>             self.text.append(self.get_starttag_text())
> 
>     def handle_endtag(self, tag):
>         if tag != 'br':
>             self.seen_nonbreak_tag = True
>         if self.seen_nonbreak_tag:
>             self.text.append('</%s>' % tag)
> 
>     def handle_data(self, data):
>         self.seen_nonbreak_tag = True
>         self.text.append(data)
> 
> 
> def ignore_leading_breaks(text):
>     parser = IgnoreLeadingBreaksParser()
>     parser.feed(text)
>     return parser.get_text()
> ###
> 
> 
> Note: this is not quite production-quality yet.  In particular, it doesn't
> handle comments or character references, so we may need to add more
> methods to the IgnoreLeadingBreaksParser so that it handles those cases
> too.
> 
> 
> Hope this helps!

-- 
Tim Johnson <tim at johnsons-web.com>
      http://www.alaska-internet-solutions.com



More information about the Tutor mailing list