Question about how to do something in BeautifulSoup?
Peter Otten
__peter__ at web.de
Fri Jan 22 10:22:54 EST 2016
inhahe wrote:
> I hope this is an appropriate mailing list for BeautifulSoup questions,
> it's been a long time since I've used python-list and I don't remember if
> third-party modules are on topic. I did try posting to the BeautifulSoup
> mailing list on Google groups, but I've waited a day or two and my message
> hasn't been approved yet.
>
> Say I have the following HTML (I hope this shows up as plain text here
> rather than formatting):
>
> <div style="font-size: 20pt;"><span style="color:
> #000000;"><em><strong>"Is today the day?"</strong></em></span></div>
>
> And I want to extract the "Is today the day?" part. There are other places
> in the document with <em> and <strong>, but this is the only place that
> uses color #000000, so I want to extract anything that's within a color
> #000000 style, even if it's nested multiple levels deep within that.
>
> - Sometimes the color is defined as RGB(0, 0, 0) and sometimes it's
> defined as #000000
> - Sometimes the <strong> is within the <em> and sometimes the <em> is
> within the <strong>.
> - There may be other discrepancies I haven't noticed yet
>
> How can I do this in BeautifulSoup (or is this better done in lxml.html)?
> Thanks
I don't see how to do this with a lot of glue code, but it may get you
started:
def recursive_attr(elem, path):
path = path.split("/")
for name in path:
if elem is None:
break
elem = getattr(elem, name)
return elem
def find(soup):
for outer in soup.find_all(
"span",
style=re.compile(r"color:\s*(RGB\(0,\s*0,\s* 0\)|#000000)")):
for inner in [
recursive_attr(outer, "strong/em"),
recursive_attr(outer, "em/strong"),]:
if inner is not None:
yield inner.string
def normalize_ws(s):
return " ".join(s.split())
html = ...
soup = bs4.BeautifulSoup(html)
for match in find(soup):
print(normalize_ws(match))
More information about the Python-list
mailing list