But so did the equivalent regular expression, yes?  I compared that the count gave the same results in both. <br><br>But, you&#39;re right. My example is poor since I said I was counting the word &#39;the&#39;. As I&#39;m sure it was obvious to you, the DNA example wasn&#39;t a real life example either :(   Who searches DNA for exact matches only instead of BLASTing? I dunno.. maybe there&#39;s a reason for it. But, the best example for the medium...<br>

<br>Cheers,<br><br><br>Glen<br><br><div class="gmail_quote">On Wed, Feb 24, 2010 at 7:28 PM, Andrew Dalke <span dir="ltr">&lt;<a href="mailto:dalke@dalkescientific.com">dalke@dalkescientific.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">On Feb 23, 2010, at 9:23 PM, Glen Jarvis wrote:<br>

&gt; total = 0<br>

&gt; for line in f.readlines():<br>

&gt;     total = total + line.upper().count(&quot;THE&quot;)<br>

<br>

</div>I don&#39;t see that anyone else mentioned this - your code here also matches &quot;there&quot; and &quot;other&quot; and quite a few other words. You can&#39;t easily use count to count the number of &quot;the&quot; words with this approach, since it&#39;s hard to include lines like<br>


<br>

   late Emperor, and was nicknamed &#39;the King of Prussia.&#39; He is very clever<br>

and<br>

   &quot;&#39;The Brook,&#39;&quot; suggested Nicholas.<br>

<br>

but exclude lines like<br>

<br>

   &quot;There, leave Bruin alone; here&#39;s a bet on.&quot;<br>

and<br>

   bathe.<br>

<br>

<br>

Regular expressions make this rather easy to do, since you can define a pattern like<br>

<br>

   &quot;(^|\W)the(\W|$)&quot;, re.I<br>

<br>

to match &quot;the&quot; which is either at the start of the line or has a non-word character, followed by &quot;the&quot;, followed by a non-word character or the end of line.<br>

<br>

<br>

<br>

Plus, performance-wise it&#39;s best to work with a block of text at a time, and not a line at a time. You can get some of the speed advantage by using the default iterator for a file instead of &#39;readline&#39; -- the former reads a block of text at a time and breaks up the newlines manually, while the latter reads a character at a time, which takes a lot of system calls.<br>


<br>

That is, try timing it with:<br>

<br>

total = 0<br>

for line in f:<br>

<div class="im">    total = total + line.upper().count(&quot;THE&quot;)<br>

<br>

</div>Or, try using:<br>

<br>

total = 0<br>

while 1:<br>

    block = f.read(10000) + f.readline()<br>

    if not block:<br>

        break<br>

    total = total + block.upper().count(&quot;THE&quot;)<br>

<br>

<br>

[Using a FASTA file]<br>

<div class="im">&gt; Again, repetition gives very similar numbers. Using good old pure python (removing the &#39;upper()&#39;:<br>

&gt; real  0m5.268s<br>

&gt; user  0m4.474s<br>

&gt; sys   0m0.715s<br>

&gt;<br>

&gt; Using the equivalent of a fast regular expression (no special matching needed):<br>

&gt;<br>

&gt; m = re.split(r&#39;TGGCCC&#39;, contents)<br>

&gt;<br>

&gt; we get the same results and in time:<br>

&gt;<br>

&gt; real  0m5.118s<br>

&gt; user  0m2.702s<br>

&gt; sys   0m1.214s<br>

&gt;<br>

&gt; Now, we&#39;re looking at a large increase. But, again, a factor of about two or less…<br>

<br>

</div>Is that repeatable and consistent? I&#39;m surprised at those numbers, since both implementations should essentially be the same. That is, I think the re module looks at the input string, realizes its a constant string, and uses the same algorithm as string.find.<br>


<div class="im"><br>

&gt; Can you think of a better example than this? Something more &#39;wow&#39;..<br>

<br>

</div>Staying in bioinformatics, try parsing fields from a BLAST output file. For something more general, look in the Python standard library for instances of re.compile. Here&#39;s a definition from decimal.py:<br>

<br>

_parser = re.compile(r&quot;&quot;&quot;        # A numeric string consists of:<br>

#    \s*<br>

    (?P&lt;sign&gt;[-+])?              # an optional sign, followed by either...<br>

    (<br>

        (?=\d|\.\d)              # ...a number (with at least one digit)<br>

        (?P&lt;int&gt;\d*)             # having a (possibly empty) integer part<br>

        (\.(?P&lt;frac&gt;\d*))?       # followed by an optional fractional part<br>

        (E(?P&lt;exp&gt;[-+]?\d+))?    # followed by an optional exponent, or...<br>

    |<br>

        Inf(inity)?              # ...an infinity, or...<br>

    |<br>

        (?P&lt;signal&gt;s)?           # ...an (optionally signaling)<br>

        NaN                      # NaN<br>

        (?P&lt;diag&gt;\d*)            # with (possibly empty) diagnostic info.<br>

    )<br>

#    \s*<br>

    \Z<br>

&quot;&quot;&quot;, re.VERBOSE | re.IGNORECASE | re.UNICODE).match<br>

<br>

<br>

Try doing that without a regular expression.<br>

<div><div></div><div class="h5"><br>

<br>

                                Andrew<br>

                                <a href="mailto:dalke@dalkescientific.com">dalke@dalkescientific.com</a><br>

<br>

<br>

_______________________________________________<br>

Baypiggies mailing list<br>

<a href="mailto:Baypiggies@python.org">Baypiggies@python.org</a><br>

To change your subscription options or unsubscribe:<br>

<a href="http://mail.python.org/mailman/listinfo/baypiggies" target="_blank">http://mail.python.org/mailman/listinfo/baypiggies</a><br>

</div></div></blockquote></div><br>