[Baypiggies] More dramatic material?

Wed Feb 24 12:47:32 CET 2010

When searching for a simple string literal, there's no reason that a
non-regex function like Python's find(), count(), or index() need be slower
than a regex function.   If it's slower, this just indicates a potential to
further optimize the implementation.

A _War and Peace_ search where regular expressions would shine more is this
one, which finds all words which contain four vowels in a row:

text = open("WnP.txt").read()  # in production code, we'd close the file!
words = re.findall("\w*[aeiou]{4}\w*", text, re.I)
unique_words = set(words)

Using pure Python, you'd be hard-pressed to find something as concise or
fast that does not use regular expressions.

If some of the extended regular expression syntax seems too cryptic, you
might prefer an alternative style that hierarchically builds up your
expression from sub-expressions, which are in turn built from basic regex
constructs:

optional_letters = "[A-Za-z]*"
vowel = "[AEIOUaeiou]"
pattern = optional_letters + 4*vowel + optional_letters
unique_words = set(re.findall(pattern, text))

In your chromosome 1 example, you should again be able to see regular
expressions shine when searching for a more general pattern.  For instance,
you might want to find long runs of a given base, using a pattern like
"GGGGG+".

import gzip
text = gzip.open("chr1.fa.gz").read()
runs = re.findall("GGGGG+", text)
longest_run = max(runs, key=len)

-Damon

On Tue, Feb 23, 2010 at 6:23 PM, Glen Jarvis <glen at glenjarvis.com> wrote:

> I've been wanting to prepare materials for regular expressions for a very
> long time. I'd habitually done what I needed without them. And, because I
> could get by, it was a crutch.
>
> A Linux/Unix certification program at UC Berkeley has pushed me along this
> way too (using ed, sed, awk, etc.)  Regardless, I'm trying to come up with a
> dramatic example of why to use regular expressions. I imagined to see a
> hundred-fold increase in the regular expression numbers. However, I've
> gotten only about twice the efficiency thus far.
>
> I started with the Guttenberg Project download of War and Peace. I thought,
> that's gotta be a large amount of text.
>
> Here I use a pythonic, easy to read, simple example to count. We assume
> that there is no word 'the' spanning across the end-of-one line and started
> on a new line. Also, this is slightly unfair in that I'm reading one line at
> a time instead of the whole buffer together (more on this later):
>
> f = open("./war_and_peace.txt", 'r')
>
> total = 0
> for line in f.readlines():
>     total = total + line.upper().count("THE")
>
> print "Total:", total
> f.close()
>
> Several runs on my laptop show this typical type of response:
> real 0m0.203s
> user 0m0.176s
> sys 0m0.023s
>
> I was disappointed it was so fast. Where was the drama in that :)
>
> Regardless, i did the same run with the following regular expressions. Now,
> I didn't see that split let me do a good case insensitive search (see
> comment in code below), so it could be this could be dramatically sped up by
> using this approach:
>
> import re
>
> f = open("./war_and_peace.txt", 'r')
> contents = f.read()
>
> # I didn't see I could do case insensitive searching with this flag
> #m = re.split(r'the', contents, re.I)
> m = re.split(r'[Tt][Hh][Ee]', contents)
>
> print len(m)-1
>
> I get the same actual results, but in the following time (and see this on
> average).
> real 0m0.154s
> user 0m0.124s
> sys 0m0.026s
>
> Well, that's not *that* much of an increase.. Good old python - already
> pretty fast (and in my mind *much* easier to read than regular expressions).
>
> I tried to do something more dramatic -- and avoid the upper case issue to
> keep the example more fair.
>
> I downloaded the FASTA Format of the Human Genome (Chromosome 1) from this
> address:
>
> ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/
>
>
> Although it's not really a sensical Biological search (mutations, inserts,
> deletions, etc. keep this from being as 'clean' as regular computer
> science), I wanted to just be dramatic. So, I searched for 'TGGCCC' in both
> files (again, we'll ignore any end of line boundaries -- just trying to show
> the speed performance).
>
> Again, repetition gives very similar numbers. Using good old pure python
> (removing the 'upper()':
> real 0m5.268s
> user 0m4.474s
> sys 0m0.715s
>
> Using the equivalent of a fast regular expression (no special matching
> needed):
>
> m = re.split(r'TGGCCC', contents)
>
> we get the same results and in time:
>
> real 0m5.118s
> user 0m2.702s
> sys 0m1.214s
>
> Now, we're looking at a large increase. But, again, a factor of about two
> or less… Can you think of a better example than this? Something more 'wow'..
> of course we can multiply the numbers by putting in a huge for loop… but, I
> was hoping for something more straight forward and obvious why regular
> expressions are so fast…
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100224/f6e9ae3b/attachment.html>