[Tutor] beautiful soup raw text workarounds?
nathan Smith
nathan-tech at hotmail.com
Wed Aug 25 14:39:11 EDT 2021
Hi! I sure can :)
So as stated previously:
tags = soup.find_all()
Will get you a list of the tags in some html text, however, raw text, EG
that which is not in a tag is something else.
The method I used, and will explain below, is probably unnecessary as
BeautifulSoup arranges itself in a tree like state, so to access the
body tag it's soup.html.body but for my purposes what I did was:
1. Run from the top of the tree downward, collecting children on the way
and compile them into a list:
def extract_tags(element):
t=[element] # include the parent object
if(type(element)==bs4.Comment or type(element)==bs4.Stylesheet or
type(element)==bs4.element.NavigableString):
return t # These do not and can not have children
for child in element.children:
t.extend(extract_tags(child))
return t
The function above recursively gets all the elements from a parent so to
get all the elements (elements being tags and raw strings) you simply do:
soup=BeautifulSoup(your_html_code)
full_list=extract_tags(soup)
Then if you wanted to list only raw strings you could do:
for x in full_list:
if(type(x)==bs4.element.NavigableString):
print(str(x.string))
You have to use str(x.string) because Beautiful soup has it's own
subclass of string (I think that's the correct terminology) and from my
experience today, python will throw a fit if you try and combine it with
a regular string (for obvious reasons I guess, they're not the same type
of object).
I hope this helps someone! :)
Nathan
On 25/08/2021 12:34, Alan Gauld via Tutor wrote:
> On 24/08/2021 21:15, nathan Smith wrote:
>> I actually fixed this myself.
> Good, but it would be useful to share how, for the
> benefit of future readers...
>
> Surely "raw text" is still inside a tag, even if
> its only the top level <body> tag?
>
>> tags=soup.find_all()
>> which returns tags only.
>>
>> Raw text are not tags.
> So how did you extract it?
>
More information about the Tutor
mailing list