[New-bugs-announce] [issue1970] Speedup unicode whitespace and linebreak detection

Wed Jan 30 01:44:35 CET 2008

New submission from Antoine Pitrou:

Currently the PyUnicode type uses a function call and several lookups
per character to detect whitespace and linebreaks. This slows down
considerably the split(), rsplit() and splitlines() methods. Since the
overwhelming majority of whitespace and linebreaks are ASCII characters,
it makes sense to have a fast lookup table for the common case. Patch
attached (also with another tiny change which helps compiler
optimization of split/rsplit here).

(this may also help other methods like strip() a bit, but in that case
the impact of whitespace detection is probably negligible)

Some numbers:

# With patch
$ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.splitlines()"
10000 loops, best of 3: 127 usec per loop
$ ./python -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 457 usec per loop

# Without patch
$ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()"
"s.splitlines()"
10000 loops, best of 3: 175 usec per loop
$ ./python-orig -m timeit -s "s=open('LICENSE', 'r').read()" "s.split()"
1000 loops, best of 3: 571 usec per loop

----------
components: Interpreter Core
files: unispace.patch
messages: 61837
nosy: pitrou
severity: normal
status: open
title: Speedup unicode whitespace and linebreak detection
versions: Python 3.0
Added file: http://bugs.python.org/file9321/unispace.patch

__________________________________
Tracker <report at bugs.python.org>
<http://bugs.python.org/issue1970>
__________________________________