[Tutor] Get lines of web page properly segmented

Fri Oct 22 13:51:02 EDT 2021

On 22/10/2021 17:06, Julius Hamilton wrote:

> I want to segment texts which sometimes have text in them that doesn’t end
> in a period - for example, the title of a section, or lines of code.

I'm not clear on what you mean by "segment". That's not a
technical term I recognise... I'm guessing you mean you
want to separate the text from its context and then store
the text in some kind of variable? Is that it?

> Usually when I retrieve text from a webpage, the sentences are broken up
> with newlines, like this:
> 
> And Jeffrey went
> to Paris that
> weekend to meet
> his family.

That's probably how the text appears in the original HTML.

> I cannot segment text on newlines if the sentences are broken by newlines,
> but I need to go preserve the separation between different lines of code,
> like:
> 
> print(x)
> x = 3
> quit()

If the code is not separated naturally(as would happen if it
was quoted inside a <PRE></PRE> section or a <div> with code
CSS  formatting for example then you will need some kind of
code parser. The most likely case would be JavaScript code
in the header section. There are JavaScript parsers available
but it does make things ,much more complex.

> I think I can either focus on getting the text from the source in a higher
> quality format, so that the sentences are already connected and not broken,
> or I have to find an efficient way to automatically join broken sentences
> but nothing else.

As with everything in programming you need to very specifically
define what all those things mean. What is a "higher quality format"?
How do you define a "sentence" - its not something that HTML knows
about to an HTML parser won't help.

You can get the raw text out of the tag, but its up to you to
mess with sentences. The usual definition relies on punctuation
marks to terminate a sentence (.!?) You then split the text on the
punctuation marks.

> I thought I could get better quality text by using Beautiful Soup, but I
> just tried the .get_text() method and I was surprised to find that the
> sentences are still broken by newlines. Maybe there are newlines even in
> the HTML, or maybe there were HTML tags embedding links in the text, and
> Beautiful Soup adds newlines when it extracts text.

Beautiful Soup just extracts the HTML content. If the HTML text has
newlines BS will give you those newlines. They are part of the source text.

> Or if this is inevitable, what is an effective way to join broken sentences
> automatically but nothing else? I think I’ll need AI for this.

You definitely should not need AI this is the kind of stuff
programmers have been doing since the dawn of programming.
Fundamental text manipulation, you just need to decide what
your definitions are and then split the text accordingly.

You should get most of the way just using the standard string
split() method. Possibly applying it more than once per text block.

You may also want to join the lines together into one long
string before extracting sentences, something like:

listOfLines = BS.get_text()
lines = [line.strip() for line in listOfLines]  #strip removes \n
text_block = ' '.join(lines)  # add spaces so words don't run  together
sentences = text_block.split(sentence_markers)

That may not work perfectly but its a starter.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos