[Tutor] Parsing and collecting keywords from a webpage

Thu Jun 21 05:03:25 EDT 2018

Daniel Bosah wrote:

> new_list = [x.encode('latin-1') for x in sorted(paul)]

I don't see why you would need bytes

>   search = "(" + b"|".join(new_list).decode() + ")" + "" #re.complie needs

when your next step is to decode it. I'm not sure why it even works as the 
default encoding is usually UTF-8.

> u'José Antonio (Pepillo) Salcedo'

Those parens combined with
> 
>   search = "(" + b"|".join(new_list).decode() + ")" + "" #re.complie needs
> string as first argument, so adds string to be first argument, and joins
> the strings together with john
> 
>  # print (type(search))
>   pattern = re.compile(search)#compiles search to be a regex object
>   reg = pattern.findall(str(soup))#calls findall on pattern, which findall

will cause findall() to return a list of 2-tuples:

>>> re.compile("(" + "|".join(["foo", "bar(baz)"]) + ")").findall("yadda foo 
yadda bar(baz)")
[('foo', '')]

Applying re.escape() can prevent that:

>>> re.compile("(" + "|".join(re.escape(s) for s in ["foo", "bar(baz)"]) + 
")").findall("yadda foo yadda bar(baz)")
['foo', 'bar(baz)']

>      if i in reg and paul: # this loop checks to see if elements are in
> both the regexed parsed list and the list. If i is in both, it is added to
> list.

No it doesn't, it is equivalent to

if (i in reg) and bool(paul):
    ...

or, since paul is a list

if (i in reg) and len(paul) > 0:
    ...

for non-empty lists effectively

if i in reg:
    ...
>             sets.append(str(i))
>             with open('sets.txt', 'w') as f:
>                 f.write(str(sets))

This writes a single line of the form

['first', 'second item', ...]

>                 f.close()

No need  to close() the file explicitly -- with open() already implies that 
and operates more reliably (the file will be closed even if an exception is 
raised in th e with-suite).

> def regexparse(regex):
>     monum = [u'road', u'blvd',u'street', u'town', u'city',u'Bernardo
>     Vega'] setss = []
> 
>     f = open('sets.txt', 'rt')
>     f = list(f)

>From my explanation above follows that the list f contains a single string 
(and one that does not occur in monum) so that setss should always be empty.

>     for i in f:
>        if i in f and i in monum:
>               setss.append(i)
>             #with open ('regex.txt','w') as q:
>                 #q.write(str(setss))
>                # q.close()
>     print (setss)
> 
> 
> if __name__ == '__main__':
>    regexparse(regex('
> 
https://en.wikipedia.org/wiki/List_of_people_from_the_Dominican_Republic'))
> 
> 
> What this code is doing is basically going through a webpage using
> BeautifulSoup and regex to compare a regexed list of words ( in regex ) to
> a list of keywords and then writing them to a textfile. The next function
> (regexparse) then goes and has a empty list (setss), then reads the
> textfile from the previous function.  What I want to do, in a for loop, is
> check to see if words in monum and the textfile ( from the regex function
> ) are shared, and if so , those shared words get added to the empty
> list(setss) , then written to a file ( this code is going to be added to a
> web crawler, and is basically going to be adding words and phrases to a
> txtfile as it crawls through the internet. ).
> 
> However, every time I run the current code, I get all the
> textfile(sets.txt) from the previous ( regex ) function, even though all I
> want are words and pharse shared between the textfile from regex and the
> monum list from regexparse. How can I fix this?

Don't write a complete script and then cross your fingers hoping that it 
will work as expected -- that rarely happens even to people with more 
experience; they just find their errors more quickly ;). Instead start with 
the first step, add print calls generously, and only continue working on the 
next step when you are sure that the first does exactly what you want.

Once your scripts get more complex replace visual inspection via print()
with a more formal approach

https://docs.python.org/dev/library/unittest.html