[Tutor] Parsing and collecting keywords from a webpage

Alan Gauld alan.gauld at yahoo.co.uk
Wed Jun 20 19:16:41 EDT 2018


On 20/06/18 20:32, Daniel Bosah wrote:
> # coding: latin-1
> from bs4 import BeautifulSoup
> from urllib.request import urlopen
> import re
> 
> #new point to add... make rest of function then compare a list of monuments
> notaries ( such as blvd, road, street, etc.) to a list of words containing
> them. if contained, pass into new set ( ref notes in case)
> 
> 
> def regex(url):
> 
>   html = urlopen(url).read()
>   soup = BeautifulSoup(html,"lxml") # why does lmxl fix it?

Fix what?
You haven't given us any clue what you are talking about.
Did you have a problem? If so what? And in what way did
lmxl fix it?

> What this code is doing is basically going through a webpage using
> BeautifulSoup and regex to compare a regexed list of words ( in regex ) to
> a list of keywords and then writing them to a textfile. The next function
> (regexparse) then goes and has a empty list (setss), then reads the
> textfile from the previous function.  What I want to do, in a for loop, is
> check to see if words in monum and the textfile ( from the regex function )
> are shared, and if so , those shared words get added to the empty
> list(setss) , then written to a file ( this code is going to be added to a
> web crawler, and is basically going to be adding words and phrases to a
> txtfile as it crawls through the internet. ).
> 
> However, every time I run the current code, I get all the
> textfile(sets.txt) from the previous ( regex ) function, even though all I
> want are words and pharse shared between the textfile from regex and the
> monum list from regexparse. How can I fix this?

So did lmxl fix it?
Since you are posting the question I assume not?
Can you clarify what exactly you are asking?


-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list