Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Terry Reedy tjreedy at udel.edu
Tue Nov 1 19:27:44 EDT 2011


On 11/1/2011 2:56 AM, Steven D'Aprano wrote:
> On Mon, 31 Oct 2011 20:44:45 -0400, Terry Reedy wrote:
>
> [...]
>>> def is_ascii_text(text):
>>>       for c in text:
>>>           if c not in LEGAL:
>>>               return False
>>>       return True
>>
>> If text is 3.x bytes, this does not work ;-). OP did not specify bytes
>> or unicode or Python version.
>
> The OP specified *string*.

A. People sometimes use 'string' loosely, to include text stored in bytes.

B. We are solving slightly different problems. The OP specified 
terabytes of ascii text on disk that he wants to check for 
contamination. For that purpose, using 3.x, it is sensible to read the 
data in binary mode into bytes objects rather than decoding into 
unicode. (The exception would be if the text uses some 7 bit encoding 
like UTF-7. But that is relatively unlikely for disk storage.)

It is irrelevant to that specified purpose whether one calls the 
internal implementation type a 'string' or not. While my 3.2 bytes 
version was only slightly faster, given data in memory, adding decoding 
time for your string version and any other extra overhead for text mode 
reading would make the bytes version look even better.

I am pretty sure the best disk reading speed would come from reading 
blocks of 4k*N, for some N, in binary mode. If the Python code were 
compliled (with Cython, for instance), the process might be input bound, 
depending on the system.

>> 'c in legal' has to get hash(c) and look
>> that up in the hash table, possible skipping around a bit if t If text
>> is byte string rather than unicode, a simple lookup 'mask[c]', where
>> mask is a 0-1 byte array, should be faster (see my other post).
>
> Oooh, clever! I like! It's not necessary to assume bytes, nor is it
> necessary to create a bitmask the size of the entire Unicode range.

You are right; I had been thinking it would be.

> Here's a version for strings (bytes or unicode in Python 2, unicode in
> Python 3):
>
> LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
> MASK = ''.join('\01' if chr(n) in LEGAL else '\0' for n in range(128))
>
> # Untested
> def is_ascii_text(text):
>      for c in text:
>          n = ord(c)
>          if n>= len(MASK) or MASK[n] == '\0': return False
>      return True

> Optimizing it is left as an exercise :)

The test for n >= len() can be accomplished with try:..except IndexError 
around the loop construct. Then the explicit loop can be replaced with 
any(), as before.

> I *suspect*, even with any optimizations, that this will be slower than
> the version using a set.

If you suspect that because of the need with true 'strings' for 
MASK[ord(c)] instead of MASK[c], you are right. Redoing the test runs 
with unicode strings instead of bytes, the set lookup time is about the 
same (9.2 seconds) whereas the 100 000 000 ord() calls add over 4 
seconds to the MASK lookups, raising them up to 13.1 seconds.

-- 
Terry Jan Reedy




More information about the Python-list mailing list