[Baypiggies] More dramatic material?

Glen Jarvis glen at glenjarvis.com
Thu Feb 25 05:35:56 CET 2010


But so did the equivalent regular expression, yes?  I compared that the
count gave the same results in both.

But, you're right. My example is poor since I said I was counting the word
'the'. As I'm sure it was obvious to you, the DNA example wasn't a real life
example either :(   Who searches DNA for exact matches only instead of
BLASTing? I dunno.. maybe there's a reason for it. But, the best example for
the medium...

Cheers,


Glen

On Wed, Feb 24, 2010 at 7:28 PM, Andrew Dalke <dalke at dalkescientific.com>wrote:

> On Feb 23, 2010, at 9:23 PM, Glen Jarvis wrote:
> > total = 0
> > for line in f.readlines():
> >     total = total + line.upper().count("THE")
>
> I don't see that anyone else mentioned this - your code here also matches
> "there" and "other" and quite a few other words. You can't easily use count
> to count the number of "the" words with this approach, since it's hard to
> include lines like
>
>   late Emperor, and was nicknamed 'the King of Prussia.' He is very clever
> and
>   "'The Brook,'" suggested Nicholas.
>
> but exclude lines like
>
>   "There, leave Bruin alone; here's a bet on."
> and
>   bathe.
>
>
> Regular expressions make this rather easy to do, since you can define a
> pattern like
>
>   "(^|\W)the(\W|$)", re.I
>
> to match "the" which is either at the start of the line or has a non-word
> character, followed by "the", followed by a non-word character or the end of
> line.
>
>
>
> Plus, performance-wise it's best to work with a block of text at a time,
> and not a line at a time. You can get some of the speed advantage by using
> the default iterator for a file instead of 'readline' -- the former reads a
> block of text at a time and breaks up the newlines manually, while the
> latter reads a character at a time, which takes a lot of system calls.
>
> That is, try timing it with:
>
> total = 0
> for line in f:
>     total = total + line.upper().count("THE")
>
> Or, try using:
>
> total = 0
> while 1:
>    block = f.read(10000) + f.readline()
>    if not block:
>        break
>    total = total + block.upper().count("THE")
>
>
> [Using a FASTA file]
> > Again, repetition gives very similar numbers. Using good old pure python
> (removing the 'upper()':
> > real  0m5.268s
> > user  0m4.474s
> > sys   0m0.715s
> >
> > Using the equivalent of a fast regular expression (no special matching
> needed):
> >
> > m = re.split(r'TGGCCC', contents)
> >
> > we get the same results and in time:
> >
> > real  0m5.118s
> > user  0m2.702s
> > sys   0m1.214s
> >
> > Now, we're looking at a large increase. But, again, a factor of about two
> or less…
>
> Is that repeatable and consistent? I'm surprised at those numbers, since
> both implementations should essentially be the same. That is, I think the re
> module looks at the input string, realizes its a constant string, and uses
> the same algorithm as string.find.
>
> > Can you think of a better example than this? Something more 'wow'..
>
> Staying in bioinformatics, try parsing fields from a BLAST output file. For
> something more general, look in the Python standard library for instances of
> re.compile. Here's a definition from decimal.py:
>
> _parser = re.compile(r"""        # A numeric string consists of:
> #    \s*
>    (?P<sign>[-+])?              # an optional sign, followed by either...
>    (
>        (?=\d|\.\d)              # ...a number (with at least one digit)
>        (?P<int>\d*)             # having a (possibly empty) integer part
>        (\.(?P<frac>\d*))?       # followed by an optional fractional part
>        (E(?P<exp>[-+]?\d+))?    # followed by an optional exponent, or...
>    |
>        Inf(inity)?              # ...an infinity, or...
>    |
>        (?P<signal>s)?           # ...an (optionally signaling)
>        NaN                      # NaN
>        (?P<diag>\d*)            # with (possibly empty) diagnostic info.
>    )
> #    \s*
>    \Z
> """, re.VERBOSE | re.IGNORECASE | re.UNICODE).match
>
>
> Try doing that without a regular expression.
>
>
>                                Andrew
>                                dalke at dalkescientific.com
>
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20100224/10202844/attachment-0001.html>


More information about the Baypiggies mailing list