[Tutor] file.read..... Abort Problem

Thu Oct 20 22:21:25 CEST 2005

Danny Yoo wrote:
> 
> On Thu, 20 Oct 2005, Tomas Markus wrote:
>>what is the most effective way to check a file for not allowed
>>characters or how to check it for allowed only characters (which might
>>be i.e. ASCII only).
> 
> 
> If the file is small enough to fit into memory, you might use regular
> expressions as a sledgehammer.  See:
> 
>     http://www.amk.ca/python/howto/regex/
> 
> for a small tutorial on regular expressions.  But unless performance is a
> real concern, doing a character-by-character scan shouldn't be too
> horrendous.

Hi Danny,

I was going to ask why you think regex is a sledgehammer for this one, then I decided to try the two alternatives and found out it is actually faster to scan for individual characters than to use a regex and look for them all at once!

Here is a program that scans a string for test chars, either using a single regex search or by individually searching for the test chars. The test data set doesn't include any of the test chars so it is a worst case (neither scan terminates early):

# FindAny.py
import re, string

data = string.letters * 2500

testchars = string.digits + string.whitespace
testRe = re.compile('[' + testchars + ']')

def findRe():
    return testRe.search(data) is not None

def findScan():
    for c in testchars:
        if c in data:
            return True
    return False

and here are the results of timing calls to findRe() and findScan():

F:\Tutor>python -m timeit -s "from FindAny import findRe, findScan" "findRe()"
100 loops, best of 3: 2.29 msec per loop

F:\Tutor>python -m timeit -s "from FindAny import findRe, findScan" "findScan()"
100 loops, best of 3: 2.04 msec per loop

Surprised the heck out of me!

When in doubt, measure! When you think you know, measure anyway, you are probably wrong!
Kent