Fastest way to detect a non-ASCII character in a list of strings.

Mon Oct 18 02:41:55 EDT 2010

Dun Peal, 17.10.2010 21:59:
> `all_ascii(L)` is a function that accepts a list of strings L, and
> returns True if all of those strings contain only ASCII chars, False
> otherwise.
>
> What's the fastest way to implement `all_ascii(L)`?
>
> My ideas so far are:
>
> 1. Match against a regexp with a character range: `[ -~]`
> 2. Use s.decode('ascii')
> 3. `return all(31<  ord(c)<  127 for s in L for c in s)`
>
> Any other ideas?  Which one do you think will be fastest?

You can't beat Cython for this kind of task. If it's really a list of 
(unicode) strings, you can do this:

     def only_allowed_characters(list strings):
         cdef unicode s
         for s in strings:
             for c in s:
                 if c < 31 or c > 127:
                     return False
         return True

Or, a bit shorter, using Cython 0.13:

     def only_allowed_characters(list strings):
         cdef unicode s
         return any((c < 31 or c > 127)
                    for s in strings for c in s)

Both are untested. Basically the same should work for byte strings. You can 
also support both string types efficiently with an isinstance() type test 
inside of the outer loop.

Also see here:

http://behnel.de/cgi-bin/weblog_basic/index.php?p=49
http://docs.cython.org/src/tutorial/strings.html

Stefan