[Tutor] file.read..... Abort Problem

Kent Johnson kent37 at tds.net
Thu Oct 20 22:21:25 CEST 2005


Danny Yoo wrote:
> 
> On Thu, 20 Oct 2005, Tomas Markus wrote:
>>what is the most effective way to check a file for not allowed
>>characters or how to check it for allowed only characters (which might
>>be i.e. ASCII only).
> 
> 
> If the file is small enough to fit into memory, you might use regular
> expressions as a sledgehammer.  See:
> 
>     http://www.amk.ca/python/howto/regex/
> 
> for a small tutorial on regular expressions.  But unless performance is a
> real concern, doing a character-by-character scan shouldn't be too
> horrendous.

Hi Danny,

I was going to ask why you think regex is a sledgehammer for this one, then I decided to try the two alternatives and found out it is actually faster to scan for individual characters than to use a regex and look for them all at once!

Here is a program that scans a string for test chars, either using a single regex search or by individually searching for the test chars. The test data set doesn't include any of the test chars so it is a worst case (neither scan terminates early):

# FindAny.py
import re, string

data = string.letters * 2500

testchars = string.digits + string.whitespace
testRe = re.compile('[' + testchars + ']')

def findRe():
    return testRe.search(data) is not None

def findScan():
    for c in testchars:
        if c in data:
            return True
    return False


and here are the results of timing calls to findRe() and findScan():

F:\Tutor>python -m timeit -s "from FindAny import findRe, findScan" "findRe()"
100 loops, best of 3: 2.29 msec per loop

F:\Tutor>python -m timeit -s "from FindAny import findRe, findScan" "findScan()"
100 loops, best of 3: 2.04 msec per loop

Surprised the heck out of me!

When in doubt, measure! When you think you know, measure anyway, you are probably wrong!
Kent



More information about the Tutor mailing list