Short, perfect program to read sentences of webpage
MRAB
python at mrabarnett.plus.com
Wed Dec 8 15:58:52 EST 2021
On 2021-12-08 19:39, Julius Hamilton wrote:
> Hey,
>
> This is something I have been working on for a very long time. It’s one of
> the reasons I got into programming at all. I’d really appreciate if people
> could input some advice on this.
>
> This is a really simple program which extracts the text from webpages and
> displays them one sentence at a time. It’s meant to help you study dense
> material, especially documentation, with much more focus and comprehension.
> I actually hope it can be of help to people who have difficulty reading. I
> know it’s been of use to me at least.
>
> This is a minimally acceptable way to pull it off currently:
>
> deepreader.py:
>
> import sys
> import requests
> import html2text
> import nltk
>
> url = sys.argv[1]
>
> # Get the html, pull out the text, and sentence-segment it in one line of
> code
>
> sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text))
>
> # Activate an elementary reader interface for the text
>
> for index, sentence in enumerate(sentences):
>
> # Print the sentence
> print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence +
> “\n”)
>
You can shorten that with format strings:
print("\n{}/{}: {}\n".format(index, len(sentences), sentence))
or even:
print(f"\n{index}/{len(sentences)}: {sentence}\n")
> # Wait for user key-press
> x = input(“\n> “)
>
>
> EOF
>
>
>
> That’s it.
>
> A lot of refining is possible, and I’d really like to see how some more
> experienced people might handle it.
>
> 1. The HTML extraction is not perfect. It doesn’t produce as clean text as
> I would like. Sometimes random links or tags get left in there. And the
> sentences are sometimes randomly broken by newlines.
>
> 2. Neither is the segmentation perfect. I am currently researching
> developing an optimal segmenter with tools from Spacy.
>
> Brevity is greatly valued. I mean, anyone who can make the program more
> perfect, that’s hugely appreciated. But if someone can do it in very few
> lines of code, that’s also appreciated.
>
> Thanks very much,
> Julius
>
More information about the Python-list
mailing list