[Tutor] Get lines of web page properly segmented
Dennis Lee Bieber
wlfraed at ix.netcom.com
Fri Oct 22 14:32:09 EDT 2021
On Fri, 22 Oct 2021 18:06:24 +0200, Julius Hamilton
<juliushamilton100 at gmail.com> declaimed the following:
>
>I thought I could get better quality text by using Beautiful Soup, but I
>just tried the .get_text() method and I was surprised to find that the
>sentences are still broken by newlines. Maybe there are newlines even in
>the HTML, or maybe there were HTML tags embedding links in the text, and
>Beautiful Soup adds newlines when it extracts text.
>
You'll have to examine the raw HTML to determine what structure is in
use... By definition, plain "new lines" in HTML are only for the use of the
editor, they are not something rendered when viewing the page.
If someone coded, say
<p>This is some text <br>
broken up by hard coded <br>
breaks</p>
I suspect anything that tries to extract "text" is going to have line
breaks at the <br> locations. Even worse would be...
<p>This is some text </p><p>broken up by hard coded</p>
<p>paragraph tags</p>
Something like
<p>This is some text
broken up by simple
new-lines</p>
will be rendered by a browser as one line, only wrapping if the browser
window is narrower than the line. (Though the example at
https://www.crummy.com/software/BeautifulSoup/bs4/doc/ seems to be putting
out a new line for each anchor tag -- which is not something I'd expect!)
{and this is assuming /simple/ HTML -- not something where every tag
references some CSS tag which may be coming from another file}
You'll have to provide example raw HTML for review (provide the HTML
AND what you expect to extract from it -- so a sample with more than one of
your "segments". Ideally your "segments" will have <p></p> or other block
delimiters which you can parse.)
Some late comments: I would NOT use .get_text() on the whole HTML. I'd
try to iterate on the contents of the HTML looking for specific tags (and
classes of tags, if such are used). Then extract the contents of just those
tags for processing.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed at ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
More information about the Tutor
mailing list