Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Oct 31 19:02:15 EDT 2011
On Mon, 31 Oct 2011 17:47:06 -0400, Dave Angel wrote:
> On 10/31/2011 03:54 PM, python at bdurham.com wrote:
>> Wondering if there's a fast/efficient built-in way to determine if a
>> string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or
>> Tab?
>>
>> I know I can look at the chars of a string individually and compare
>> them against a set of legal chars using standard Python code (and this
>> works fine), but I will be working with some very large files in the
>> 100's Gb to several Tb size range so I'd thought I'd check to see if
>> there was a built-in in C that might handle this type of check more
>> efficiently.
>>
>> Does this sound like a use case for cython or pypy?
>>
>> Thanks,
>> Malcolm
>>
> How about doing a .replace() method call, with all those characters
> turning into '', and then see if there's anything left?
No offense Dave, but do you really think that making a copy of as much as
a terabyte of data is *more* efficient than merely scanning the data and
stopping on the first non-ASCII character you see?
There is no way of telling whether a string includes non-ASCII characters
without actually inspecting each character. So in the event that the
string *is* fully ASCII text, you have to check every character, there
can be no shortcuts. However, there is a shortcut if the string isn't
fully ASCII text: once you've found a single non-text character, stop. So
the absolute least amount of work you can do is:
# Define legal characters:
LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
# everybody forgets about formfeed... \f
# and are you sure you want to include chr(127) as a text char?
def is_ascii_text(text):
for c in text:
if c not in LEGAL:
return False
return True
Algorithmically, that's as efficient as possible: there's no faster way
of performing the test, although one implementation may be faster or
slower than another. (PyPy is likely to be faster than CPython, for
example.)
But can we get better in Python? Yes, I think so. First off, CPython is
optimized for local variable lookups over globals, and since you are
looking up the global LEGAL potentially 1000000000000 times, even a 1%
saving per lookup will help a lot. So the first step is to make a local
reference, by adding this line just above the for loop:
legal = LEGAL
But we can do even better still. Each time we test for "c not in legal",
we do a linear search of 100 characters. On average, that will mean
comparing 50 characters for equality at best. We can do better by using a
set or frozenset, which gives us approximately constant time lookups:
legal = frozenset(LEGAL)
Finally, we can try to do as much work as possible in fast C code and as
little as necessary in relatively slow Python:
def is_ascii_text(text):
legal = frozenset(LEGAL)
return all(c in legal for c in text)
Since all() is guaranteed to keep short-cut semantics, that will be as
fast as possible in Python, and quite possibly just as fast as any C
extension you might write.
If that's still too slow, use smaller files or get a faster computer :)
--
Steven
More information about the Python-list
mailing list