Beautiful soup : why does "string" not give me the string?
Gabriel Rossetti
gabriel.rossetti at arimaz.com
Wed Apr 1 08:15:58 EDT 2009
Jeremiah Dodds wrote:
>
>
> On Wed, Apr 1, 2009 at 8:25 AM, Gabriel Rossetti
> <gabriel.rossetti at arimaz.com <mailto:gabriel.rossetti at arimaz.com>> wrote:
>
> Hello everyone,
>
> I am using beautiful soup to parse some HTML and I came across
> something strange.
> Here is an illustration:
>
> >>> soup = BeautifulSoup(u'<div class="text">hello ça boume<br
> /></div')
> >>> soup
> <div class="text">hello ça boume<br /></div>
> >>> soup.find("div", "text")
> <div class="text">hello ça boume<br /></div>
> >>> soup.find("div", "text").string
> >>> soup.find("div", "text").next
> u'hello \xe7a boume'
>
> why does soup.find("div", "text").string not give me the string?
> Is it because there is a <br/>?
>
>
> IIRC, yes it is, and there's not much you can do about it other than
> use .next.string or .contents[0] or stripping out brs. See
> http://www.crummy.com/software/BeautifulSoup/documentation.html ,
> particularly the "Removing Elements" and "string" sections.
>
>
Ok, thanks, I also found that I can do this :
soup.find(text=lambda t: isinstance(t, basestring))
or this:
soup.find(text=True)
it seems faster than doing this :
[br.extract() for br in soup.findAll("br")]
soup.string
but I may be wrong.
Thanks again!
Gabriel
More information about the Python-list
mailing list