Short, perfect program to read sentences of webpage

Wed Dec 8 15:58:52 EST 2021

On 2021-12-08 19:39, Julius Hamilton wrote:
> Hey,
> 
> This is something I have been working on for a very long time. It’s one of
> the reasons I got into programming at all. I’d really appreciate if people
> could input some advice on this.
> 
> This is a really simple program which extracts the text from webpages and
> displays them one sentence at a time. It’s meant to help you study dense
> material, especially documentation, with much more focus and comprehension.
> I actually hope it can be of help to people who have difficulty reading. I
> know it’s been of use to me at least.
> 
> This is a minimally acceptable way to pull it off currently:
> 
> deepreader.py:
> 
> import sys
> import requests
> import html2text
> import nltk
> 
> url = sys.argv[1]
> 
> # Get the html, pull out the text, and sentence-segment it in one line of
> code
> 
> sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text))
> 
> # Activate an elementary reader interface for the text
> 
> for index, sentence in enumerate(sentences):
> 
>    # Print the sentence
>    print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence +
> “\n”)
> 
You can shorten that with format strings:

     print("\n{}/{}: {}\n".format(index, len(sentences), sentence))

or even:

     print(f"\n{index}/{len(sentences)}: {sentence}\n")

>    # Wait for user key-press
>    x = input(“\n> “)
> 
> 
> EOF
> 
> 
> 
> That’s it.
> 
> A lot of refining is possible, and I’d really like to see how some more
> experienced people might handle it.
> 
> 1. The HTML extraction is not perfect. It doesn’t produce as clean text as
> I would like. Sometimes random links or tags get left in there. And the
> sentences are sometimes randomly broken by newlines.
> 
> 2. Neither is the segmentation perfect. I am currently researching
> developing an optimal segmenter with tools from Spacy.
> 
> Brevity is greatly valued. I mean, anyone who can make the program more
> perfect, that’s hugely appreciated. But if someone can do it in very few
> lines of code, that’s also appreciated.
> 
> Thanks very much,
> Julius
>