Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?
Peter Otten
__peter__ at web.de
Tue Nov 1 05:21:14 EDT 2011
Steven D'Aprano wrote:
> On Mon, 31 Oct 2011 22:12:26 -0400, Dave Angel wrote:
>
>> I would claim that a well-written (in C) translate function, without
>> using the delete option, should be much quicker than any python loop,
>> even if it does copy the data.
>
> I think you are selling short the speed of the Python interpreter. Even
> for short strings, it's faster to iterate over a string in Python 3 than
> to copy it with translate:
>
>>>> from timeit import Timer
>>>> t1 = Timer('for c in text: pass', 'text = "abcd"')
>>>> t2 = Timer('text.translate(mapping)',
> ... 'text = "abcd"; mapping = "".maketrans("", "")')
>>>> min(t1.repeat())
> 0.450606107711792
>>>> min(t2.repeat())
> 0.9279451370239258
Lies, damn lies, and benchmarks ;)
Copying is fast:
>>> Timer("text + 'x'", "text='abcde '*10**6").timeit(100)
1.819761037826538
>>> Timer("for c in text: pass", "text='abcde '*10**6").timeit(100)
18.89239192008972
The problem with str.translate() (unicode.translate() in 2.x) is that it
needs a dictionary lookup for every character. However, if like the OP you
are going to read data from a file to check whether it's (a subset of)
ascii, there's no point converting to a string, and for bytes (where a
lookup table with the byte as an index into that table can be used) the
numbers look quite different:
>>> t1 = Timer("for c in text: pass", "text = b'abcd '*10**6")
>>> t1.timeit(100)
15.818882942199707
>>> t2 = Timer("text.translate(mapping)", "text = b'abcd '*10**6; mapping =
b''.maketrans(b'', b'')")
>>> t2.timeit(100)
2.821769952774048
More information about the Python-list
mailing list