[Tutor] clean text

Kent Johnson kent37 at tds.net
Tue May 19 13:18:23 CEST 2009


On Tue, May 19, 2009 at 5:36 AM, spir <denis.spir at free.fr> wrote:
> Hello,
>
> This is a follow the post on performance issues.
> Using a profiler, I realized that inside error message creation, most of the time was spent in a tool func used to clean up source text output.
> The issue is that when the source text holds control chars such as \n, then the error message is hardly readible. MY solution is to replace such chars with their repr():
>
> def _cleanRepr(text):
>        ''' text with control chars replaced by repr() equivalent '''
>        result = ""
>        for char in text:
>                n = ord(char)
>                if (n < 32) or (n > 126 and n < 160):
>                        char = repr(char)[1:-1]
>                result += char
>        return result
>
> For any reason, this func is extremely slow. While the rest of error message creation looks very complicated, this seemingly innocent consume > 90% of the time. The issue is that I cannot use repr(text), because repr will replace all non-ASCII characters. I need to replace only control characters.
> How else could I do that?

I would get rid of the calls to ord() and repr() to start. There is no
need for ord() at all, just compare characters directly:
if char < '\x20' or '\x7e' < char < '\xa0':

To eliminate repr() you could create a dict mapping chars to their
repr and look up in that.

You should also look for a way to get the loop into C code. One way to
do that would be to use a regex to search and replace. The replacement
pattern in a call to re.sub() can be a function call; the function can
do the dict lookup. Here is something to try:

import re

# Create a dictionary of replacement strings
repl = {}
for i in range(0, 32) + range(127, 160):
    c = chr(i)
    repl[c] = repr(c)[1:-1]

def sub(m):
    ''' Helper function for re.sub(). m will be a Match object. '''
    return repl[m.group()]

# Regex to match control chars
controlsRe = re.compile(r'[\x00-\x1f\x7f-\x9f]')

def replaceControls(s):
    ''' Replace all control chars in s with their repr() '''
    return controlsRe.sub(sub, s)


for s in [
    'Simple test',
    'Harder\ntest',
    'Final\x00\x1f\x20\x7e\x7f\x9f\xa0test'
]:
    print replaceControls(s)


Kent


More information about the Tutor mailing list