Fastest way to detect a non-ASCII character in a list of strings.
Stefan Behnel
stefan_ml at behnel.de
Mon Oct 18 02:41:55 EDT 2010
Dun Peal, 17.10.2010 21:59:
> `all_ascii(L)` is a function that accepts a list of strings L, and
> returns True if all of those strings contain only ASCII chars, False
> otherwise.
>
> What's the fastest way to implement `all_ascii(L)`?
>
> My ideas so far are:
>
> 1. Match against a regexp with a character range: `[ -~]`
> 2. Use s.decode('ascii')
> 3. `return all(31< ord(c)< 127 for s in L for c in s)`
>
> Any other ideas? Which one do you think will be fastest?
You can't beat Cython for this kind of task. If it's really a list of
(unicode) strings, you can do this:
def only_allowed_characters(list strings):
cdef unicode s
for s in strings:
for c in s:
if c < 31 or c > 127:
return False
return True
Or, a bit shorter, using Cython 0.13:
def only_allowed_characters(list strings):
cdef unicode s
return any((c < 31 or c > 127)
for s in strings for c in s)
Both are untested. Basically the same should work for byte strings. You can
also support both string types efficiently with an isinstance() type test
inside of the outer loop.
Also see here:
http://behnel.de/cgi-bin/weblog_basic/index.php?p=49
http://docs.cython.org/src/tutorial/strings.html
Stefan
More information about the Python-list
mailing list