[Baypiggies] More dramatic material?

Glen Jarvis glen at glenjarvis.com
Wed Feb 24 03:23:30 CET 2010

I've been wanting to prepare materials for regular expressions for a very
long time. I'd habitually done what I needed without them. And, because I
could get by, it was a crutch.

A Linux/Unix certification program at UC Berkeley has pushed me along this
way too (using ed, sed, awk, etc.)  Regardless, I'm trying to come up with a
dramatic example of why to use regular expressions. I imagined to see a
hundred-fold increase in the regular expression numbers. However, I've
gotten only about twice the efficiency thus far.

I started with the Guttenberg Project download of War and Peace. I thought,
that's gotta be a large amount of text.

Here I use a pythonic, easy to read, simple example to count. We assume that
there is no word 'the' spanning across the end-of-one line and started on a
new line. Also, this is slightly unfair in that I'm reading one line at a
time instead of the whole buffer together (more on this later):

f = open("./war_and_peace.txt", 'r')

total = 0
for line in f.readlines():
    total = total + line.upper().count("THE")

print "Total:", total

Several runs on my laptop show this typical type of response:
real 0m0.203s
user 0m0.176s
sys 0m0.023s

I was disappointed it was so fast. Where was the drama in that :)

Regardless, i did the same run with the following regular expressions. Now,
I didn't see that split let me do a good case insensitive search (see
comment in code below), so it could be this could be dramatically sped up by
using this approach:

import re

f = open("./war_and_peace.txt", 'r')
contents = f.read()

# I didn't see I could do case insensitive searching with this flag
#m = re.split(r'the', contents, re.I)
m = re.split(r'[Tt][Hh][Ee]', contents)

print len(m)-1

I get the same actual results, but in the following time (and see this on
real 0m0.154s
user 0m0.124s
sys 0m0.026s

Well, that's not *that* much of an increase.. Good old python - already
pretty fast (and in my mind *much* easier to read than regular expressions).

I tried to do something more dramatic -- and avoid the upper case issue to
keep the example more fair.

I downloaded the FASTA Format of the Human Genome (Chromosome 1) from this


Although it's not really a sensical Biological search (mutations, inserts,
deletions, etc. keep this from being as 'clean' as regular computer
science), I wanted to just be dramatic. So, I searched for 'TGGCCC' in both
files (again, we'll ignore any end of line boundaries -- just trying to show
the speed performance).

Again, repetition gives very similar numbers. Using good old pure python
(removing the 'upper()':
real 0m5.268s
user 0m4.474s
sys 0m0.715s

Using the equivalent of a fast regular expression (no special matching

m = re.split(r'TGGCCC', contents)

we get the same results and in time:

real 0m5.118s
user 0m2.702s
sys 0m1.214s

Now, we're looking at a large increase. But, again, a factor of about two or
less… Can you think of a better example than this? Something more 'wow'.. of
course we can multiply the numbers by putting in a huge for loop… but, I was
hoping for something more straight forward and obvious why regular
expressions are so fast…
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100223/506c2653/attachment.html>

More information about the Baypiggies mailing list