[Baypiggies] More dramatic material?
glen at glenjarvis.com
Thu Feb 25 05:35:56 CET 2010
But so did the equivalent regular expression, yes? I compared that the
count gave the same results in both.
But, you're right. My example is poor since I said I was counting the word
'the'. As I'm sure it was obvious to you, the DNA example wasn't a real life
example either :( Who searches DNA for exact matches only instead of
BLASTing? I dunno.. maybe there's a reason for it. But, the best example for
On Wed, Feb 24, 2010 at 7:28 PM, Andrew Dalke <dalke at dalkescientific.com>wrote:
> On Feb 23, 2010, at 9:23 PM, Glen Jarvis wrote:
> > total = 0
> > for line in f.readlines():
> > total = total + line.upper().count("THE")
> I don't see that anyone else mentioned this - your code here also matches
> "there" and "other" and quite a few other words. You can't easily use count
> to count the number of "the" words with this approach, since it's hard to
> include lines like
> late Emperor, and was nicknamed 'the King of Prussia.' He is very clever
> "'The Brook,'" suggested Nicholas.
> but exclude lines like
> "There, leave Bruin alone; here's a bet on."
> Regular expressions make this rather easy to do, since you can define a
> pattern like
> "(^|\W)the(\W|$)", re.I
> to match "the" which is either at the start of the line or has a non-word
> character, followed by "the", followed by a non-word character or the end of
> Plus, performance-wise it's best to work with a block of text at a time,
> and not a line at a time. You can get some of the speed advantage by using
> the default iterator for a file instead of 'readline' -- the former reads a
> block of text at a time and breaks up the newlines manually, while the
> latter reads a character at a time, which takes a lot of system calls.
> That is, try timing it with:
> total = 0
> for line in f:
> total = total + line.upper().count("THE")
> Or, try using:
> total = 0
> while 1:
> block = f.read(10000) + f.readline()
> if not block:
> total = total + block.upper().count("THE")
> [Using a FASTA file]
> > Again, repetition gives very similar numbers. Using good old pure python
> (removing the 'upper()':
> > real 0m5.268s
> > user 0m4.474s
> > sys 0m0.715s
> > Using the equivalent of a fast regular expression (no special matching
> > m = re.split(r'TGGCCC', contents)
> > we get the same results and in time:
> > real 0m5.118s
> > user 0m2.702s
> > sys 0m1.214s
> > Now, we're looking at a large increase. But, again, a factor of about two
> or less…
> Is that repeatable and consistent? I'm surprised at those numbers, since
> both implementations should essentially be the same. That is, I think the re
> module looks at the input string, realizes its a constant string, and uses
> the same algorithm as string.find.
> > Can you think of a better example than this? Something more 'wow'..
> Staying in bioinformatics, try parsing fields from a BLAST output file. For
> something more general, look in the Python standard library for instances of
> re.compile. Here's a definition from decimal.py:
> _parser = re.compile(r""" # A numeric string consists of:
> # \s*
> (?P<sign>[-+])? # an optional sign, followed by either...
> (?=\d|\.\d) # ...a number (with at least one digit)
> (?P<int>\d*) # having a (possibly empty) integer part
> (\.(?P<frac>\d*))? # followed by an optional fractional part
> (E(?P<exp>[-+]?\d+))? # followed by an optional exponent, or...
> Inf(inity)? # ...an infinity, or...
> (?P<signal>s)? # ...an (optionally signaling)
> NaN # NaN
> (?P<diag>\d*) # with (possibly empty) diagnostic info.
> # \s*
> """, re.VERBOSE | re.IGNORECASE | re.UNICODE).match
> Try doing that without a regular expression.
> dalke at dalkescientific.com
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Baypiggies