[Baypiggies] More dramatic material?
Glen Jarvis
glen at glenjarvis.com
Wed Feb 24 03:23:30 CET 2010
I've been wanting to prepare materials for regular expressions for a very
long time. I'd habitually done what I needed without them. And, because I
could get by, it was a crutch.
A Linux/Unix certification program at UC Berkeley has pushed me along this
way too (using ed, sed, awk, etc.) Regardless, I'm trying to come up with a
dramatic example of why to use regular expressions. I imagined to see a
hundred-fold increase in the regular expression numbers. However, I've
gotten only about twice the efficiency thus far.
I started with the Guttenberg Project download of War and Peace. I thought,
that's gotta be a large amount of text.
Here I use a pythonic, easy to read, simple example to count. We assume that
there is no word 'the' spanning across the end-of-one line and started on a
new line. Also, this is slightly unfair in that I'm reading one line at a
time instead of the whole buffer together (more on this later):
f = open("./war_and_peace.txt", 'r')
total = 0
for line in f.readlines():
total = total + line.upper().count("THE")
print "Total:", total
f.close()
Several runs on my laptop show this typical type of response:
real 0m0.203s
user 0m0.176s
sys 0m0.023s
I was disappointed it was so fast. Where was the drama in that :)
Regardless, i did the same run with the following regular expressions. Now,
I didn't see that split let me do a good case insensitive search (see
comment in code below), so it could be this could be dramatically sped up by
using this approach:
import re
f = open("./war_and_peace.txt", 'r')
contents = f.read()
# I didn't see I could do case insensitive searching with this flag
#m = re.split(r'the', contents, re.I)
m = re.split(r'[Tt][Hh][Ee]', contents)
print len(m)-1
I get the same actual results, but in the following time (and see this on
average).
real 0m0.154s
user 0m0.124s
sys 0m0.026s
Well, that's not *that* much of an increase.. Good old python - already
pretty fast (and in my mind *much* easier to read than regular expressions).
I tried to do something more dramatic -- and avoid the upper case issue to
keep the example more fair.
I downloaded the FASTA Format of the Human Genome (Chromosome 1) from this
address:
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/
Although it's not really a sensical Biological search (mutations, inserts,
deletions, etc. keep this from being as 'clean' as regular computer
science), I wanted to just be dramatic. So, I searched for 'TGGCCC' in both
files (again, we'll ignore any end of line boundaries -- just trying to show
the speed performance).
Again, repetition gives very similar numbers. Using good old pure python
(removing the 'upper()':
real 0m5.268s
user 0m4.474s
sys 0m0.715s
Using the equivalent of a fast regular expression (no special matching
needed):
m = re.split(r'TGGCCC', contents)
we get the same results and in time:
real 0m5.118s
user 0m2.702s
sys 0m1.214s
Now, we're looking at a large increase. But, again, a factor of about two or
less… Can you think of a better example than this? Something more 'wow'.. of
course we can multiply the numbers by putting in a huge for loop… but, I was
hoping for something more straight forward and obvious why regular
expressions are so fast…
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100223/506c2653/attachment.html>
More information about the Baypiggies
mailing list