stripping unwanted chars from string
John Machin
sjmachin at lexicon.net
Thu May 4 01:01:06 EDT 2006
On 4/05/2006 1:36 PM, Edward Elliott wrote:
> I'm looking for the "best" way to strip a large set of chars from a filename
> string (my definition of best usually means succinct and readable). I
> only want to allow alphanumeric chars, dashes, and periods. This is what I
> would write in **** (bless me father, for I have sinned...):
[expletives deleted] and it was wrong anyway (according to your
requirements);
using \w would keep '_' which is *NOT* alphanumeric.
> I could just use re.sub like the second example, but that's a bit overkill.
> I'm trying to figure out if there's a good way to do the same thing with
> string methods. string.translate seems to do what I want, the problem is
> specifying the set of chars to remove. Obviously hardcoding them all is a
> non-starter.
>
> Working with chars seems to be a bit of a pain. There's no equivalent of
> the range function, one has to do something like this:
>
>>>> [chr(x) for x in range(ord('a'), ord('z')+1)]
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
> 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>> alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought
required!! Monkey see, monkey type.
>>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
>>> fixer = lambda x: ''.join(c for c in x if c in keepchars)
>>> fixer('qwe!@#456.--Howzat?')
'qwe456.--Howzat'
>>>
>
> Do that twice for letters, once for numbers, add in a few others, and I get
> the chars I want to keep. Then I'd invert the set and call translate.
> It's a mess and not worth the trouble. Unless there's some way to expand a
> compact representation of a char list and obtain its complement, it looks
> like I'll have to use a regex.
>
> Ideally, there would be a mythical charset module that works like this:
>
>>>> keep = charset.expand (r'\w.-') # or r'a-zA-Z0-9_.-'
Where'd that '_' come from?
>>>> toss = charset.invert (keep)
>
> Sadly I can find no such beast. Anyone have any insight? As of now,
> regexes look like the best solution.
I'll leave it to somebody else to dredge up the standard riposte to your
last sentence :-)
One point on your requirements: replacing unwanted characters instead of
deleting them may be better -- theoretically possible problems with
deleting are: (1) duplicates (foo and foo_ become the same) (2) '_'
becomes '' which is not a valid filename. And a legibility problem: if
you hate '_' and ' ' so much, why not change them to '-'?
Oh and just in case the fix was accidentally applied to a path:
keepchars.update(os.sep)
if os.altsep: keepchars.update(os.altsep)
HTH,
John
More information about the Python-list
mailing list