Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue Nov 1 02:56:51 EDT 2011
On Mon, 31 Oct 2011 20:44:45 -0400, Terry Reedy wrote:
[...]
>> def is_ascii_text(text):
>> for c in text:
>> if c not in LEGAL:
>> return False
>> return True
>
> If text is 3.x bytes, this does not work ;-). OP did not specify bytes
> or unicode or Python version.
The OP specified *string*. Whether that's a byte-string in Python 2, or a
unicode string in Python 3, it will still work, so long as both `text`
and `LEGAL` are strings.
[steve at sylar ~]$ python2 -c "print 'a' in 'abc'" # both byte strings
True
[steve at sylar ~]$ python2 -c "print u'a' in u'abc'" # both unicode strings
True
[steve at sylar ~]$ python3 -c "print('a' in 'abc')" # both unicode strings
True
Mixing bytes and characters may or may not do what you expect.
[steve at sylar ~]$ python2 -c "print 'a' in u'abc'" # one of each
True
Whether that's a good thing or a bad thing is open for debate :)
>> Algorithmically, that's as efficient as possible:
>
> This is a bit strange since you go on to explain that it is inefficient
> -- O(n*k) where n = text length and k = legal length -- whereas below is
> O(n).
But since k is a constant, O(nk) is O(n).
When using Big Oh notation, it is acceptable to hand-wave away a lot of
detail. For example, it's common to assume that n+7, n//7 and n**7 all
take the same constant amount of time, which is patently untrue: division
and exponentiation both require more work than addition. In this case,
relative to the size of the input text, both a linear string search and a
constant-time hash lookup are O(1). That doesn't imply that they
*literally* take constant time.
[...]
>> Since all() is guaranteed to keep short-cut semantics, that will be as
>> fast as possible in Python,
>
> A dangerous statement to make.
I like living dangerously :)
> 'c in legal' has to get hash(c) and look
> that up in the hash table, possible skipping around a bit if t If text
> is byte string rather than unicode, a simple lookup 'mask[c]', where
> mask is a 0-1 byte array, should be faster (see my other post).
Oooh, clever! I like! It's not necessary to assume bytes, nor is it
necessary to create a bitmask the size of the entire Unicode range.
Here's a version for strings (bytes or unicode in Python 2, unicode in
Python 3):
LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
MASK = ''.join('\01' if chr(n) in LEGAL else '\0' for n in range(128))
# Untested
def is_ascii_text(text):
for c in text:
n = ord(c)
if n >= len(MASK) or MASK[n] == '\0': return False
return True
Optimizing it is left as an exercise :)
I *suspect*, even with any optimizations, that this will be slower than
the version using a set.
> On my
> new Pentium Win 7 machine, it is -- by albout 5%. For 100,000,000 legal
> bytes, a minimum of 8.69 versus 9.17 seconds.
On my old Linux box, I get the opposite result: set lookup is a tiny bit
faster, coincidentally also by almost 5%.
>>> from time import time
>>> legal_set = frozenset(range(32, 128))
>>> text = b'a' * 100000000
>>> t = time(); all(c in legal_set for c in text); time()-t
True
27.174341917037964
>>>
>>> legal_ray = 128 * b'\1'
>>> t = time(); all(legal_ray[c] for c in text); time()-t
True
28.39691996574402
--
Steven
More information about the Python-list
mailing list