[Tutor] Parsing and collecting keywords from a webpage
Peter Otten
__peter__ at web.de
Thu Jun 21 05:03:25 EDT 2018
Daniel Bosah wrote:
> new_list = [x.encode('latin-1') for x in sorted(paul)]
I don't see why you would need bytes
> search = "(" + b"|".join(new_list).decode() + ")" + "" #re.complie needs
when your next step is to decode it. I'm not sure why it even works as the
default encoding is usually UTF-8.
> u'José Antonio (Pepillo) Salcedo'
Those parens combined with
>
> search = "(" + b"|".join(new_list).decode() + ")" + "" #re.complie needs
> string as first argument, so adds string to be first argument, and joins
> the strings together with john
>
> # print (type(search))
> pattern = re.compile(search)#compiles search to be a regex object
> reg = pattern.findall(str(soup))#calls findall on pattern, which findall
will cause findall() to return a list of 2-tuples:
>>> re.compile("(" + "|".join(["foo", "bar(baz)"]) + ")").findall("yadda foo
yadda bar(baz)")
[('foo', '')]
Applying re.escape() can prevent that:
>>> re.compile("(" + "|".join(re.escape(s) for s in ["foo", "bar(baz)"]) +
")").findall("yadda foo yadda bar(baz)")
['foo', 'bar(baz)']
> if i in reg and paul: # this loop checks to see if elements are in
> both the regexed parsed list and the list. If i is in both, it is added to
> list.
No it doesn't, it is equivalent to
if (i in reg) and bool(paul):
...
or, since paul is a list
if (i in reg) and len(paul) > 0:
...
for non-empty lists effectively
if i in reg:
...
> sets.append(str(i))
> with open('sets.txt', 'w') as f:
> f.write(str(sets))
This writes a single line of the form
['first', 'second item', ...]
> f.close()
No need to close() the file explicitly -- with open() already implies that
and operates more reliably (the file will be closed even if an exception is
raised in th e with-suite).
> def regexparse(regex):
> monum = [u'road', u'blvd',u'street', u'town', u'city',u'Bernardo
> Vega'] setss = []
>
> f = open('sets.txt', 'rt')
> f = list(f)
>From my explanation above follows that the list f contains a single string
(and one that does not occur in monum) so that setss should always be empty.
> for i in f:
> if i in f and i in monum:
> setss.append(i)
> #with open ('regex.txt','w') as q:
> #q.write(str(setss))
> # q.close()
> print (setss)
>
>
> if __name__ == '__main__':
> regexparse(regex('
>
https://en.wikipedia.org/wiki/List_of_people_from_the_Dominican_Republic'))
>
>
> What this code is doing is basically going through a webpage using
> BeautifulSoup and regex to compare a regexed list of words ( in regex ) to
> a list of keywords and then writing them to a textfile. The next function
> (regexparse) then goes and has a empty list (setss), then reads the
> textfile from the previous function. What I want to do, in a for loop, is
> check to see if words in monum and the textfile ( from the regex function
> ) are shared, and if so , those shared words get added to the empty
> list(setss) , then written to a file ( this code is going to be added to a
> web crawler, and is basically going to be adding words and phrases to a
> txtfile as it crawls through the internet. ).
>
> However, every time I run the current code, I get all the
> textfile(sets.txt) from the previous ( regex ) function, even though all I
> want are words and pharse shared between the textfile from regex and the
> monum list from regexparse. How can I fix this?
Don't write a complete script and then cross your fingers hoping that it
will work as expected -- that rarely happens even to people with more
experience; they just find their errors more quickly ;). Instead start with
the first step, add print calls generously, and only continue working on the
next step when you are sure that the first does exactly what you want.
Once your scripts get more complex replace visual inspection via print()
with a more formal approach
https://docs.python.org/dev/library/unittest.html
More information about the Tutor
mailing list