[Baypiggies] More dramatic material?

Thu Feb 25 04:28:03 CET 2010

On Feb 23, 2010, at 9:23 PM, Glen Jarvis wrote:
> total = 0
> for line in f.readlines():
>     total = total + line.upper().count("THE")

I don't see that anyone else mentioned this - your code here also matches "there" and "other" and quite a few other words. You can't easily use count to count the number of "the" words with this approach, since it's hard to include lines like

   late Emperor, and was nicknamed 'the King of Prussia.' He is very clever
and
   "'The Brook,'" suggested Nicholas.

but exclude lines like

   "There, leave Bruin alone; here's a bet on."
and
   bathe.

Regular expressions make this rather easy to do, since you can define a pattern like

   "(^|\W)the(\W|$)", re.I

to match "the" which is either at the start of the line or has a non-word character, followed by "the", followed by a non-word character or the end of line.

Plus, performance-wise it's best to work with a block of text at a time, and not a line at a time. You can get some of the speed advantage by using the default iterator for a file instead of 'readline' -- the former reads a block of text at a time and breaks up the newlines manually, while the latter reads a character at a time, which takes a lot of system calls.

That is, try timing it with:

total = 0
for line in f:
    total = total + line.upper().count("THE")

Or, try using:

total = 0
while 1:
    block = f.read(10000) + f.readline()
    if not block:
        break
    total = total + block.upper().count("THE")

[Using a FASTA file]
> Again, repetition gives very similar numbers. Using good old pure python (removing the 'upper()':
> real	0m5.268s
> user	0m4.474s
> sys	0m0.715s
> 
> Using the equivalent of a fast regular expression (no special matching needed):
> 
> m = re.split(r'TGGCCC', contents)
> 
> we get the same results and in time:
> 
> real	0m5.118s
> user	0m2.702s
> sys	0m1.214s
> 
> Now, we're looking at a large increase. But, again, a factor of about two or less…

Is that repeatable and consistent? I'm surprised at those numbers, since both implementations should essentially be the same. That is, I think the re module looks at the input string, realizes its a constant string, and uses the same algorithm as string.find.

> Can you think of a better example than this? Something more 'wow'..

Staying in bioinformatics, try parsing fields from a BLAST output file. For something more general, look in the Python standard library for instances of re.compile. Here's a definition from decimal.py:

_parser = re.compile(r"""        # A numeric string consists of:
#    \s*
    (?P<sign>[-+])?              # an optional sign, followed by either...
    (
        (?=\d|\.\d)              # ...a number (with at least one digit)
        (?P<int>\d*)             # having a (possibly empty) integer part
        (\.(?P<frac>\d*))?       # followed by an optional fractional part
        (E(?P<exp>[-+]?\d+))?    # followed by an optional exponent, or...
    |
        Inf(inity)?              # ...an infinity, or...
    |
        (?P<signal>s)?           # ...an (optionally signaling)
        NaN                      # NaN
        (?P<diag>\d*)            # with (possibly empty) diagnostic info.
    )
#    \s*
    \Z
""", re.VERBOSE | re.IGNORECASE | re.UNICODE).match

Try doing that without a regular expression.

				Andrew
				dalke at dalkescientific.com