Short, perfect program to read sentences of webpage
Cameron Simpson
cs at cskk.id.au
Wed Dec 8 16:12:17 EST 2021
Assorted remarks inline below:
On 08Dec2021 20:39, Julius Hamilton <juliushamilton100 at gmail.com> wrote:
>deepreader.py:
>
>import sys
>import requests
>import html2text
>import nltk
>
>url = sys.argv[1]
I might spell this:
cmd, url = sys.argv
which enforces exactly one argument. And since you don't care about the
command name, maybe:
_, url = sys.argv
because "_" is a conventional name for "a value we do not care about".
>sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text))
Neat!
># Activate an elementary reader interface for the text
>for index, sentence in enumerate(sentences):
I would be inclined to count from 1, so "enumerate(sentences, 1)".
> # Print the sentence
> print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence +
>“\n”)
Personally, since print() adds a trailing newline, I would drop the
final +"\n". If you want an additional blank line, I would put it in the
input() call below:
> # Wait for user key-press
> x = input(“\n> “)
You're not using "x". Just discard input()'s return value:
input("\n> ")
>A lot of refining is possible, and I’d really like to see how some more
>experienced people might handle it.
>
>1. The HTML extraction is not perfect. It doesn’t produce as clean text as
>I would like. Sometimes random links or tags get left in there.
Maybe try beautifulsoup instead of html2text? The module name is "bs4".
>And the
>sentences are sometimes randomly broken by newlines.
I would flatten the newlines. Either the simple:
sentence = sentence.strip().replace("\n", " ")
or maybe better:
sentence = " ".join(sentence.split()
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Python-list
mailing list