[Tutor] (regular expression)

Martin A. Brown martin at linux-ip.net
Sat Dec 10 22:36:57 EST 2016


Hello Isaac,

This second posting you have made has provided more information 
about what you are trying to accomplish and how (and also was 
readable, where the first one looked like it got mangled by your 
mail user agent; it's best to try to post only plain text messages 
to this sort of mailing list).

I suspect that we can help you a bit more, now.

If we knew even more about what you were looking to do, we might be 
able to help you further (with all of the usual remarks about how we 
won't do your homework for you, but all of us volunteers will gladly 
help you understand the tools, the systems, the world of Python and 
anything else we can suggest in the realm of computers, computer 
science and problem solving).

I will credit the person who assigned this task for you, as this is 
not dissimilar from the sort of problem that one often has when 
facing a new practical computing problem.  Often (and in your case) 
there is opaque structure and hidden assumptions in the question 
which need to be understood.  See further below....

These were your four lines of code:

>with urllib.request.urlopen("https://www.sdstate.edu/electrical-engineering-and-computer-science") as cs:
>    cs_page = cs.read()
>    soup = BeautifulSoup(cs_page, "html.parser")
>    print(len(soup.body.find_all(string = ["Engineering","engineering"])))

The fourth line is an impressive attempt at compressing all of the 
searching, finding, counting and reporting steps into a single line.  

Your task (I think), is more complicated than that single line can 
express.  So, that will need to be expanded to a few more lines of 
code.

You may have heard these aphorisms before:

  * brevity is the soul of wit
  * fewer lines of code are better
  * prefer a short elegant solution

But, when complexity intrudes into brevity, the human mind 
struggles.  As a practitioner, I will say that I spend more of my 
time reading and understanding code than writing it, so writing 
simple, self-contained and understandable units of code leads to 
intelligibility for humans and composability for systems.

Try this at a Python console [1].

  import this

>i used control + f on the link in the code and i get 11 for ctrl + 
>f and 3 for the code

Applause!  Look at the raw data!  Study the raw data!  That is an 
excellent way to start to try to understand the raw data.  You must 
always go back to the raw input data and then consider whether your 
tooling or the data model in your program matches what you are 
trying to extract/compute/transform.

The answer (for number of occurrences of the word 'engineering', 
case-insensitive) that I get is close to your answer when searching 
with control + f, but is a bit larger than 11.

Anyway, here are my thoughts.  I will start with some tips that are 
relevant to your 4-line pasted program:

  * BeautifulSoup is wonderfully convenient, but also remember it 
    is another high-level tool; it is often forgiving where other 
    tools are more rigorous, however it is excellent for learning 
    and (I hope you see below) that it is a great tool for the 
    problem you are trying to solve

  * in your code, soup.body is a handle that points to the <body>
    tag of the HTML document you have fetched; so why can't you 
    simply find_all of the strings "Engineering" and "engineering" 
    in the text and count them?

      - find_all is a method that returns all of the tags in the
        structured document below (in this case) soup.body

      - your intent is not to count tags with the string
        'engineering' but rather , you are looking for that string 
        in the text (I think)

  * it is almost always a mistake to try to process HTML with 
    regular expressions, however, it seems that you are trying to 
    find all matches of the (case-insensitive) word 'engineering' in 
    the text of this document; that is something tailor-made for 
    regular expressions, so there's the Python regular expression 
    library, too:  'import re'

  * and on a minor note, since you are using urllib.request.open()
    in a with statement (using contexts this way is wonderful), you
    could collect the data from the network socket, then drop out of 
    the 'with' block to allow the context to close, so if your block 
    worked as you wanted, you could adjust it as follows:

      with urllib.request.urlopen(uri as cs:
          cs_page = cs.read()
      soup = BeautifulSoup(cs_page, "html.parser")
      print(len(soup.body.find_all(string = ["Engineering","engineering"])))

  * On a much more minor point, I'll mention that urllib / urllib2 
    are available with the main Python releases but there are other 
    libraries for handling fetching; I often recommend the 
    third-party requests [0] library, as it is both very Pythonic, 
    reasonably high-level and frightfully flexible

So, connecting the Zen of Python [1] to your problem, I would 
suggest making shorter, simpler lines and separating the logic.  See 
below:

Here are some code suggestions.

  * collect the relevant data:  Once you have fetched the text into 
    a variable, get just the part that you know you want to process 
    as pure text, for example:

      soup = BeautifulSoup(r.text, "html.parser")
      bodytext = soup.body.text

  * walk/process/compute the data:  search that text to find the 
    subset of data you wish to operate on or which are the answer:

      pattern = re.compile('engineering', re.I)
      matches = re.findall(pattern, bodytext)

  * report to the end user:  Finally, print it out

      print('Found "engineering" (case-insensitive) %d times.' % (len(matches),))

Good luck and enjoy Python,

-Martin

 [0] http://docs.python-requests.org/en/master/

     url = "https://www.sdstate.edu/electrical-engineering-and-computer-science"
     r = requests.get(url)
     if not r.ok:
         # -- die/return/handle-error here
     soup = BeautifulSoup(r.text, "html.parser")

 [1] You do use the Python console to explore Python, your data and your code,
     don't you?

     $ python3
     Python 3.4.5 (default, Jul 03 2016, 13:55:08) [GCC] on linux
     Type "help", "copyright", "credits" or "license" for more information.
     >>> import this
     The Zen of Python, by Tim Peters

     Beautiful is better than ugly.
     Explicit is better than implicit.
     Simple is better than complex.
     Complex is better than complicated.
     Flat is better than nested.
     Sparse is better than dense.
     Readability counts.
     Special cases aren't special enough to break the rules.
     Although practicality beats purity.
     Errors should never pass silently.
     Unless explicitly silenced.
     In the face of ambiguity, refuse the temptation to guess.
     There should be one-- and preferably only one --obvious way to do it.
     Although that way may not be obvious at first unless you're Dutch.
     Now is better than never.
     Although never is often better than *right* now.
     If the implementation is hard to explain, it's a bad idea.
     If the implementation is easy to explain, it may be a good idea.
     Namespaces are one honking great idea -- let's do more of those!

-- 
Martin A. Brown
http://linux-ip.net/


More information about the Tutor mailing list