Removing an attribute from html with Regex
Stefan Behnel
stefan_ml at behnel.de
Thu Dec 30 03:53:53 EST 2010
Selvam, 30.12.2010 08:30:
> I have some HTML string which I would like to feed to BeautifulSoup.
>
> But, One malformed attribute breaks BeautifulSoup.
>
> <p style='terp_header' wrong_tag=' text1 ' text2 ' and 'para' '
> class='terp_header'> My String</p>
Didn't try with BS (and you forgot to say what "breaks" means exactly in
your case), but it parses in a somewhat reasonable way with lxml:
Python 3.2b2 (py3k:87572, Dec 29 2010, 21:25:38)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.html as H
>>> doc = H.fromstring('''
... <p style='terp_header' wrong_tag=' text1 ' text2 ' and 'para' '
... class='terp_header'> My String</p>
... ''')
>>> H.tostring(doc)
b'<p style="terp_header" wrong_tag=" text1 " text2 and \
class="terp_header"> My String</p>'
>>> doc.attrib
{'text2': '', 'and': '', 'style': 'terp_header', \
'wrong_tag': ' text1 ', 'class': 'terp_header'}
> I would like it to replace all the occurances of that attribute with an
> empty string.
>
> I am unable to figure out the exact regex, which can do this job.
>
> This is what, I have managed so far,
>
> m = re.compile("rml_except='([^']*)")
I assume "rml_accept" is the real name of the attribute?
You may be able to do this with a look-ahead expression, e.g.:
replace = re.compile('(wrong_tag\s*=\s*[^>=]*)(?=>|\s+\w+\s*=)').sub
html_data = replace('', html_data)
The trick is to match everything up to the next character that looks
reasonable again, i.e. a closing tag character (">") or another attribute.
Stefan
More information about the Python-list
mailing list