Fastest way to detect a non-ASCII character in a list of strings.
Steven D'Aprano
steve-REMOVE-THIS at cybersource.com.au
Sun Oct 17 22:47:09 EDT 2010
On Mon, 18 Oct 2010 01:04:09 +0100, Rhodri James wrote:
> On Sun, 17 Oct 2010 20:59:22 +0100, Dun Peal <dunpealer at gmail.com>
> wrote:
>
>> `all_ascii(L)` is a function that accepts a list of strings L, and
>> returns True if all of those strings contain only ASCII chars, False
>> otherwise.
>>
>> What's the fastest way to implement `all_ascii(L)`?
>>
>> My ideas so far are:
>>
>> 1. Match against a regexp with a character range: `[ -~]` 2. Use
>> s.decode('ascii')
>> 3. `return all(31< ord(c) < 127 for s in L for c in s)`
>
> Don't call it "all_ascii" when you don't mean that; all_printable would
> be more accurate,
Neither is accurate. all_ascii would be:
all(ord(c) <= 127 for c in string for string in L)
all_printable would be considerably harder. As far as I can tell, there's
no simple way to tell if a character is printable. You can look at the
Unicode category, given by unicodedata.category(c), and then decide
whether or not it is printable.
(Note though that printable characters will not necessarily print, since
the later relies on there being a glyph available to print. Not all fonts
include glyphs for all printable character.)
It might be easier to just ignore control characters, and assume
everything else is printable:
all(unicodedata.category(c) != 'Cc' for c in string for string in L)
If you limit yourself to bytes instead of strings, it's easier:
import string
all(c in string.printable for c in s for s in L)
As for what is faster, that's what timeit and the profiler are for:
timeit to find out which is faster, and the profiler to find out whether
it's worse spending the time to find out which is faster.
--
Steven
More information about the Python-list
mailing list