function to remove and punctuation
Thomas 'PointedEars' Lahn
PointedEars at web.de
Sun Apr 10 10:23:12 EDT 2016
Peter Otten wrote:
> geshdus at gmail.com wrote:
>> how to write a function taking a string parameter, which returns it after
>> you delete the spaces, punctuation marks, accented characters in python ?
>
> Looks like you want to remove more characters than you want to keep. In
> this case I'd decide what characters too keep first, e. g. (assuming
> Python 3)
However, with *that* approach (which is different from the OP’s request),
regular expression matching might turn out to be more efficient:
-----------------------------------------------------------
import re
print("".join(re.findall(r'[a-z]+', "...", re.IGNORECASE)))
-----------------------------------------------------------
With the OP’s original request, they may still be the better approach.
For example:
----------------------------------------------------------------------
import re
print("".join(re.sub(r'[\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
flags=re.IGNORECASE)))
----------------------------------------------------------------------
or
----------------------------------------------------------------------
import re
print("".join(re.findall(r'[^\s,;.?!ÀÁÈÉÌÍÒÓÙÚÝ]+', "", "...",
flags=re.IGNORECASE)))
----------------------------------------------------------------------
>>>> import string
>>>> keep = string.ascii_letters + string.digits
>>>> keep
> 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
>
> Now you can iterate over the characters and check if you want to preserve
> it for each of them:
The good thing about this part of the approach you suggested is that you can
build regular expressions from strings, too:
keep = '[' + 'a-z' + r'\d' + ']'
>>>> def clean(s, keep):
> ... return "".join(c for c in s if c in keep)
> ...
Why would one prefer this over "".filter(lambda: c in keep, s)?
>>>> clean("<alpha> äöü ::42", keep)
> 'alpha42'
>>>> clean("<alpha> äöü ::42", string.ascii_letters)
> 'alpha'
>
> If you are dealing with a lot of text you can make this a bit more
> efficient with the str.translate() method. Create a mapping that maps all
> characters that you want to keep to themselves
>
>>>> m = str.maketrans(keep, keep)
>>>> m[ord("a")]
> 97
>>>> m[ord(">")]
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> KeyError: 62
>
> and all characters that you want to discard to None
Why would creating a *larger* list for *more* operations be *more*
efficient?
--
PointedEars
Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.
More information about the Python-list
mailing list