Faster 'if char in string' test
Peter Otten
__peter__ at web.de
Fri Jun 25 04:29:43 EDT 2004
Kamilche wrote:
> Heheh, I'm a Python newbie. I can't even understand the code you have
> written, but I'll try.
If I don't understand Python code, I tend to blame the code. I'll try to
either explain or rewrite the "dark" pieces below.
> You know, as I was writing stuff like fn(**args.__dict__) yesterday
> (which for me is pretty advanced!), it occurred to me that Python is
> not 'simpler than C', like I had originally thought. :-D It's better
> in many ways, but it has its dark alleyways where a novice can get
> stabbed in the back, too.
Python and C are both simple. Python has friendlier error messages and less
bloat, i. e. manual garbage collection in programs that actually work.
Recent additions to Python and the newstyle/classic class schism, while all
reasonable in themselves, make it harder to "know it all". If decorators
take off, this will become even worse.
Enough pessimism - Python is still much more compact than C++, and if anyone
has the power to say no, then it's Guido van Rossum. If a way to integrate
type inferencing smoothly is actually found, why should anyone want to
program in C at all?
Peter
Now the revised code:
import itertools, re, string
_validchars = string.ascii_letters + string.digits + \
"!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
_allchars = string.maketrans("", "")
_invalidchars = _allchars.translate(_allchars, _validchars)
# repeat(v) gives an "endless" sequence of v, so
# zip("abc", repeat(None)) == [("a", None), ("b", None), ("c", None)]
# In Python 2.4 I would use a set but in 2.3.x dicts are faster.
# _invalidset = set(_invalidcars)
_invaliddict = dict(zip(_invalidchars, itertools.repeat(None)))
# contains only valid characters
valid = "This is a string to test for invalid characters."
# contains both valid and invalid characters
invalid = valid + _invalidchars + valid
# A LONG string containing MANY invalid characters to make
# the translate() solution look bad
massaged = (_invalidchars + valid) * 100
# supposing _invalidchars == "abc" we get the regex re.compile("[abc]").
# rInvalid.search(s) would then return a non-None result whenever s
# contains an a, b, or c. The nice thing for our purpose is that it stops
# at the first match, i. e. if the first char is invalid, we don't care
# about the rest
rInvalid = re.compile("[%s]" % re.escape(_invalidchars))
def validate_list(s, invalid=_invalidchars):
# since the set of invalid characters is used many times,
# it is faster to have it as a local variable.
# I don't know why, but achieving this via a default parameter
# is even faster than a
# invalid = _invalidchars
# assignment at the beginning of the function
for c in s:
if c in invalid:
return False
return True
def validate_dict(s, invalid=_invaliddict):
# exactly the same as validate_list, but the
# c in invalid
# is faster because invalid is a dictionary
for c in s:
if c in invalid:
return False
return True
def validate_translate(s):
return len(s.translate(_allchars, _invalidchars)) == len(s)
def validate_regex(s):
return not rInvalid.search(s)
def validate_noop(s, valid=valid):
""" Not a valid implementation. Only to estimate the function
call overhead.
"""
return s is valid
def measure(func, data):
""" Run the speed test for a func/data pair and print the result.
func: name of the function to be tested
data: name of the variable containing the sample data
"""
print " ", data,
setup = "from findcharspeed import %s, %s" % (func, data)
run = "%s(%s)" % (func, data)
# example of the above:
# from findcharspeed import validate_list, valid
# validate_list(valid) # this statement will be timed
# timeit.main() expects parameters from the commandline, so we have
# to add the appropriate switches when invoking it programmatically.
timeit.main(["-s" + setup, run])
if __name__ == "__main__":
import timeit
# functions I want to time
tests = [validate_list, validate_dict, validate_translate,
validate_regex, validate_noop]
# make sure the functions return the correct result
for f in tests:
assert f(valid)
assert not f(invalid)
# calculate the timings
for f in tests:
print f.__name__
# test each candidate function with 3 different sample strings
for data in ["valid", "invalid", "massaged"]:
measure(f.__name__, data)
More information about the Python-list
mailing list