[Tutor] clean text

spir denis.spir at free.fr
Tue May 19 20:22:17 CEST 2009


Le Tue, 19 May 2009 10:49:15 -0700,
Emile van Sebille <emile at fenx.com> s'exprima ainsi:

> On 5/19/2009 10:19 AM spir said...
> > Le Tue, 19 May 2009 11:36:17 +0200,
> > spir <denis.spir at free.fr> s'exprima ainsi:
> > 
> > [...]
> > 
> > Thank you Albert, Kent, Sanders, Lie, Malcolm.
> > 
> > This time regex wins! Thought it wouldn't because of the additional func
> > call (too bad we cannot pass a mapping to re.sub). Actually the diff. is
> > very small ;-) The relevant  change is indeed using a dict. Replacing
> > string concat with ''.join() is slower (tested with 10 times and 100
> > times bigger strings too). Strange... Membership test in a set is only
> > very slightly faster than in dict keys.
> 
> Hmm... this seems faster assuming it does the same thing...
> 
> xlate = dict( (chr(c),chr(c)) for c in range(256))
> xlate.update(control_char_map)
> 
> def cleanRepr5(text):
>      return "".join([ xlate[c] for c in text ])
> 
> 
> Emile

Thank you, Emile.
I thought at this solution (having a dict for all chars). But I cannot use it because later I will extend the app to cope with unicode (~ 100_000 chars). So that I really need to filter which chars have to be converted.
A useful help I guess would be to have a builtin func that returns conventional char/string repr without "'...'" around.

Denis

PS
By the way, you don't need (anymore) to build a list comprehension for an outer func that walks through a sequence:
   "".join( xlate[c] for c in text )
is a shortcut for
   "".join( (xlate[c] for c in text) )
[a generator expression already inside () needs no additional parens -- as long as there is no additional arg -- see PEP 289 http://www.python.org/dev/peps/pep-0289/]
------
la vita e estrany


More information about the Tutor mailing list