When searching for a simple string literal, there's no reason that a non-regex function like Python's find(), count(), or index() need be slower than a regex function. If it's slower, this just indicates a potential to further optimize the implementation.<div>
<br></div><div>A _War and Peace_ search where regular expressions would shine more is this one, which finds all words which contain four vowels in a row:</div><div><div><div><br></div><div><font class="Apple-style-span" face="'courier new', monospace">text = open("WnP.txt").read() # in production code, we'd close the file!</font></div>
<div><font class="Apple-style-span" face="'courier new', monospace">words = re.findall("\w*[aeiou]{4}\w*", text, re.I)</font></div><div><font class="Apple-style-span" face="'courier new', monospace">unique_words = set(words)</font></div>
<div><br></div><div>Using pure Python, you'd be hard-pressed to find something as concise or fast that does not use regular expressions.</div><div><br></div><div>If some of the extended regular expression syntax seems too cryptic, you might prefer an alternative style that hierarchically builds up your expression from sub-expressions, which are in turn built from basic regex constructs:</div>
<div><br></div><div><font class="Apple-style-span" face="'courier new', monospace">optional_letters = "[A-Za-z]*"</font></div><div><font class="Apple-style-span" face="'courier new', monospace">vowel = "[AEIOUaeiou]"</font></div>
<div><font class="Apple-style-span" face="'courier new', monospace">pattern = optional_letters + 4*vowel + optional_letters</font></div><div><font class="Apple-style-span" face="'courier new', monospace">unique_words = set(re.findall(pattern, text))</font></div>
<div><br></div><div>In your chromosome 1 example, you should again be able to see regular expressions shine when searching for a more general pattern. For instance, you might want to find long runs of a given base, using a pattern like "GGGGG+".</div>
<div><br></div><div><font class="Apple-style-span" face="'courier new', monospace">import gzip</font></div><div><font class="Apple-style-span" face="'courier new', monospace">text = gzip.open("chr1.fa.gz").read()</font></div>
<div><font class="Apple-style-span" face="'courier new', monospace">runs = re.findall("GGGGG+", text)</font></div><div><font class="Apple-style-span" face="'courier new', monospace">longest_run = max(runs, key=len)</font></div>
<div><font class="Apple-style-span" face="arial, helvetica, sans-serif"><font class="Apple-style-span" face="'courier new', monospace"><br></font></font></div><div><font class="Apple-style-span" face="'courier new', monospace"><br>
</font></div><div>-Damon</div><div><br></div><div><div><div><br><div class="gmail_quote">On Tue, Feb 23, 2010 at 6:23 PM, Glen Jarvis <span dir="ltr"><<a href="mailto:glen@glenjarvis.com" target="_blank">glen@glenjarvis.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>I've been wanting to prepare materials for regular expressions for a very long time. I'd habitually done what I needed without them. And, because I could get by, it was a crutch. </div>
<div><br></div><div>A Linux/Unix certification program at UC Berkeley has pushed me along this way too (using ed, sed, awk, etc.) Regardless, I'm trying to come up with a dramatic example of why to use regular expressions. I imagined to see a hundred-fold increase in the regular expression numbers. However, I've gotten only about twice the efficiency thus far.</div>
<div><br></div><div>I started with the Guttenberg Project download of War and Peace. I thought, that's gotta be a large amount of text.</div><div><br></div><div>Here I use a pythonic, easy to read, simple example to count. We assume that there is no word 'the' spanning across the end-of-one line and started on a new line. Also, this is slightly unfair in that I'm reading one line at a time instead of the whole buffer together (more on this later):</div>
<div><br></div><div>f = open("./war_and_peace.txt", 'r')</div><div><br></div><div>total = 0</div><div>for line in f.readlines():</div><div> total = total + line.upper().count("THE")</div><div>
<br></div><div>print "Total:", total</div><div>f.close()</div><div><br></div><div>Several runs on my laptop show this typical type of response:</div><div>real<span style="white-space:pre">        </span>0m0.203s</div>
<div>user<span style="white-space:pre">        </span>0m0.176s</div><div>sys<span style="white-space:pre">        </span>0m0.023s</div><div><br></div><div>I was disappointed it was so fast. Where was the drama in that :)</div>
<div><br></div><div>Regardless, i did the same run with the following regular expressions. Now, I didn't see that split let me do a good case insensitive search (see comment in code below), so it could be this could be dramatically sped up by using this approach:</div>
<div><br></div><div>import re</div><div><br></div><div>f = open("./war_and_peace.txt", 'r')</div><div>contents = f.read()</div><div><br></div><div># I didn't see I could do case insensitive searching with this flag</div>
<div>#m = re.split(r'the', contents, re.I)</div><div>m = re.split(r'[Tt][Hh][Ee]', contents)</div><div><br></div><div>print len(m)-1</div><div><br></div><div>I get the same actual results, but in the following time (and see this on average).</div>
<div>real<span style="white-space:pre">        </span>0m0.154s</div><div>user<span style="white-space:pre">        </span>0m0.124s</div><div>sys<span style="white-space:pre">        </span>0m0.026s</div>
<div><br></div><div>Well, that's not *that* much of an increase.. Good old python - already pretty fast (and in my mind *much* easier to read than regular expressions).</div><div><br></div><div>I tried to do something more dramatic -- and avoid the upper case issue to keep the example more fair.</div>
<div><br></div><div>I downloaded the FASTA Format of the Human Genome (Chromosome 1) from this address:</div><div><br></div><div><a href="ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/" target="_blank">ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_01/</a></div>
<div><br></div><div><br></div><div>Although it's not really a sensical Biological search (mutations, inserts, deletions, etc. keep this from being as 'clean' as regular computer science), I wanted to just be dramatic. So, I searched for 'TGGCCC' in both files (again, we'll ignore any end of line boundaries -- just trying to show the speed performance).</div>
<div><br></div><div>Again, repetition gives very similar numbers. Using good old pure python (removing the 'upper()':</div><div>real<span style="white-space:pre">        </span>0m5.268s</div><div>user<span style="white-space:pre">        </span>0m4.474s</div>
<div>sys<span style="white-space:pre">        </span>0m0.715s</div><div><br></div><div>Using the equivalent of a fast regular expression (no special matching needed):</div><div><br></div><div>m = re.split(r'TGGCCC', contents)</div>
<div><br></div><div>we get the same results and in time:</div><div><br></div><div>real<span style="white-space:pre">        </span>0m5.118s</div><div>user<span style="white-space:pre">        </span>0m2.702s</div>
<div>sys<span style="white-space:pre">        </span>0m1.214s</div><div><br></div><div>Now, we're looking at a large increase. But, again, a factor of about two or less… Can you think of a better example than this? Something more 'wow'.. of course we can multiply the numbers by putting in a huge for loop… but, I was hoping for something more straight forward and obvious why regular expressions are so fast…</div>
<br>_______________________________________________<br>
Baypiggies mailing list<br>
<a href="mailto:Baypiggies@python.org" target="_blank">Baypiggies@python.org</a><br>
To change your subscription options or unsubscribe:<br>
<a href="http://mail.python.org/mailman/listinfo/baypiggies" target="_blank">http://mail.python.org/mailman/listinfo/baypiggies</a><br></blockquote></div><br></div></div></div></div>
</div>