stripping unwanted chars from string
Edward Elliott
nobody at 127.0.0.1
Thu May 4 01:54:19 EDT 2006
John Machin wrote:
> [expletives deleted] and it was wrong anyway (according to your
> requirements);
> using \w would keep '_' which is *NOT* alphanumeric.
Actually the perl is correct, the explanation was the faulty part. When in
doubt, trust the code. Plus I explicitly allowed _ further down, so the
mistake should have been fairly obvious.
> >>> alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought
> required!! Monkey see, monkey type.
I won't dignify that with a response. The code that is, I could give a toss
about the comments. If you enjoy using such verbose, error-prone
representations in your code, god help anyone maintaining it. Including
you six months later. Quick, find the difference between these sets at a
glance:
'qwertyuiopasdfghjklzxcvbnm'
'abcdefghijklmnopqrstuvwxyz'
'abcdefghijklmnopprstuvwxyz'
'abcdefghijk1mnopqrstuvwxyz'
'qwertyuopasdfghjklzxcvbnm' # no fair peeking
And I won't even bring up locales.
> >>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
> >>> fixer = lambda x: ''.join(c for c in x if c in keepchars)
Those darn monkeys, always think they're so clever! ;)
if "you can" == "you should": do(it)
else: do(not)
>> Sadly I can find no such beast. Anyone have any insight? As of now,
>> regexes look like the best solution.
>
> I'll leave it to somebody else to dredge up the standard riposte to your
> last sentence :-)
If the monstrosity above is the best you've got, regexes are clearly the
better solution. Readable trumps inscrutable any day.
> One point on your requirements: replacing unwanted characters instead of
> deleting them may be better -- theoretically possible problems with
> deleting are: (1) duplicates (foo and foo_ become the same) (2) '_'
> becomes '' which is not a valid filename.
Which is why I perform checks for emptiness and uniqueness after the strip.
I decided long ago that stripping is preferable to replacement here.
> And a legibility problem: if
> you hate '_' and ' ' so much, why not change them to '-'?
_ is allowed. And I do prefer -, but not for legibility. It doesn't
require me to hit Shift.
> Oh and just in case the fix was accidentally applied to a path:
>
> keepchars.update(os.sep)
> if os.altsep: keepchars.update(os.altsep)
Nope, like I said this is strictly a filename. Stripping out path
components is the first thing I do. But thanks for pointing out these
common pitfalls for members of our studio audience. Tell him what he's
won, Johnny! ;)
More information about the Python-list
mailing list