[Tutor] (regular expression)
Martin A. Brown
martin at linux-ip.net
Sat Dec 10 22:36:57 EST 2016
Hello Isaac,
This second posting you have made has provided more information
about what you are trying to accomplish and how (and also was
readable, where the first one looked like it got mangled by your
mail user agent; it's best to try to post only plain text messages
to this sort of mailing list).
I suspect that we can help you a bit more, now.
If we knew even more about what you were looking to do, we might be
able to help you further (with all of the usual remarks about how we
won't do your homework for you, but all of us volunteers will gladly
help you understand the tools, the systems, the world of Python and
anything else we can suggest in the realm of computers, computer
science and problem solving).
I will credit the person who assigned this task for you, as this is
not dissimilar from the sort of problem that one often has when
facing a new practical computing problem. Often (and in your case)
there is opaque structure and hidden assumptions in the question
which need to be understood. See further below....
These were your four lines of code:
>with urllib.request.urlopen("https://www.sdstate.edu/electrical-engineering-and-computer-science") as cs:
> cs_page = cs.read()
> soup = BeautifulSoup(cs_page, "html.parser")
> print(len(soup.body.find_all(string = ["Engineering","engineering"])))
The fourth line is an impressive attempt at compressing all of the
searching, finding, counting and reporting steps into a single line.
Your task (I think), is more complicated than that single line can
express. So, that will need to be expanded to a few more lines of
code.
You may have heard these aphorisms before:
* brevity is the soul of wit
* fewer lines of code are better
* prefer a short elegant solution
But, when complexity intrudes into brevity, the human mind
struggles. As a practitioner, I will say that I spend more of my
time reading and understanding code than writing it, so writing
simple, self-contained and understandable units of code leads to
intelligibility for humans and composability for systems.
Try this at a Python console [1].
import this
>i used control + f on the link in the code and i get 11 for ctrl +
>f and 3 for the code
Applause! Look at the raw data! Study the raw data! That is an
excellent way to start to try to understand the raw data. You must
always go back to the raw input data and then consider whether your
tooling or the data model in your program matches what you are
trying to extract/compute/transform.
The answer (for number of occurrences of the word 'engineering',
case-insensitive) that I get is close to your answer when searching
with control + f, but is a bit larger than 11.
Anyway, here are my thoughts. I will start with some tips that are
relevant to your 4-line pasted program:
* BeautifulSoup is wonderfully convenient, but also remember it
is another high-level tool; it is often forgiving where other
tools are more rigorous, however it is excellent for learning
and (I hope you see below) that it is a great tool for the
problem you are trying to solve
* in your code, soup.body is a handle that points to the <body>
tag of the HTML document you have fetched; so why can't you
simply find_all of the strings "Engineering" and "engineering"
in the text and count them?
- find_all is a method that returns all of the tags in the
structured document below (in this case) soup.body
- your intent is not to count tags with the string
'engineering' but rather , you are looking for that string
in the text (I think)
* it is almost always a mistake to try to process HTML with
regular expressions, however, it seems that you are trying to
find all matches of the (case-insensitive) word 'engineering' in
the text of this document; that is something tailor-made for
regular expressions, so there's the Python regular expression
library, too: 'import re'
* and on a minor note, since you are using urllib.request.open()
in a with statement (using contexts this way is wonderful), you
could collect the data from the network socket, then drop out of
the 'with' block to allow the context to close, so if your block
worked as you wanted, you could adjust it as follows:
with urllib.request.urlopen(uri as cs:
cs_page = cs.read()
soup = BeautifulSoup(cs_page, "html.parser")
print(len(soup.body.find_all(string = ["Engineering","engineering"])))
* On a much more minor point, I'll mention that urllib / urllib2
are available with the main Python releases but there are other
libraries for handling fetching; I often recommend the
third-party requests [0] library, as it is both very Pythonic,
reasonably high-level and frightfully flexible
So, connecting the Zen of Python [1] to your problem, I would
suggest making shorter, simpler lines and separating the logic. See
below:
Here are some code suggestions.
* collect the relevant data: Once you have fetched the text into
a variable, get just the part that you know you want to process
as pure text, for example:
soup = BeautifulSoup(r.text, "html.parser")
bodytext = soup.body.text
* walk/process/compute the data: search that text to find the
subset of data you wish to operate on or which are the answer:
pattern = re.compile('engineering', re.I)
matches = re.findall(pattern, bodytext)
* report to the end user: Finally, print it out
print('Found "engineering" (case-insensitive) %d times.' % (len(matches),))
Good luck and enjoy Python,
-Martin
[0] http://docs.python-requests.org/en/master/
url = "https://www.sdstate.edu/electrical-engineering-and-computer-science"
r = requests.get(url)
if not r.ok:
# -- die/return/handle-error here
soup = BeautifulSoup(r.text, "html.parser")
[1] You do use the Python console to explore Python, your data and your code,
don't you?
$ python3
Python 3.4.5 (default, Jul 03 2016, 13:55:08) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list