Short, perfect program to read sentences of webpage
Cameron Simpson
cs at cskk.id.au
Wed Dec 8 17:42:07 EST 2021
On 08Dec2021 21:41, Stefan Ram <ram at zedat.fu-berlin.de> wrote:
>Julius Hamilton <juliushamilton100 at gmail.com> writes:
>>This is a really simple program which extracts the text from webpages and
>>displays them one sentence at a time.
>
> Our teacher said NLTK will not come up until next year, so
> I tried to do with regexps. It still has bugs, for example
> it can not tell the dot at the end of an abbreviation from
> the dot at the end of a sentence!
This is almost a classic demo of why regexps are a poor tool as a first
choice. You can do much with them, but they are cryptic and bug prone.
I am not seeking to mock you, but trying to make apparent why regexps
are to be avoided a lot of the time. They have their place.
You've read the whole re module docs I hope:
https://docs.python.org/3/library/re.html#module-re
>import re
>import urllib.request
>uri = r'''http://example.com/article''' # replace this with your URI!
>request = urllib.request.Request( uri )
>resource = urllib.request.urlopen( request )
>cs = resource.headers.get_content_charset()
>content = resource.read().decode( cs, errors="ignore" )
>content = re.sub( r'''[\r\n\t\s]+''', r''' ''', content )
You're not multiline, so I would recommend a plain raw string:
content = re.sub( r'[\r\n\t\s]+', r' ', content )
No need for \r in the class, \s covers that. From the docs:
\s
For Unicode (str) patterns:
Matches Unicode whitespace characters (which includes [
\t\n\r\f\v], and also many other characters, for example the
non-breaking spaces mandated by typography rules in many
languages). If the ASCII flag is used, only [ \t\n\r\f\v] is
matched.
>upper = r"[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]" # "[\\p{Lu}]"
>lower = r"[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]" # "[\\p{Ll}]"
This is very fragile - you have an arbitrary set of additional uppercase
characters, almost certainly incomplete, and visually hard to inspect
for completeness.
Instead, consider the \b (word boundary) and \w (word character)
markers, which will let you break strings up, and then maybe test the
results with str.isupper().
>digit = r"[0-9]" #"[\\p{Nd}]"
There's a \d character class for this, covers nondecimal digits too.
>firstwordstart = upper;
>firstwordnext = "(?:[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-])";
Again, an inline arbitrary list of characters. This is fragile.
>wordcharacter = "[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïð\
>ñòóôõöøùúûüýþÿ0-9-]"
Again inline. Why not construct it?
wordcharacter = upper + lower + digit
but I recommend \w instead, or for this: [\w\d]
>addition = "(?:(?:[']" + wordcharacter + "+)*[']?)?"
As a matter of good practice with regexp strings, use raw quotes:
addition = r"(?:(?:[']" + wordcharacter + r"+)*[']?)?"
even when there are no backslahes.
Seriously, doing this with regexps is difficult. A useful exercise for
learning regexps, but in the general case not the first tool to reach
for.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Python-list
mailing list