file_name_fixer.py

Wed Jan 25 04:33:23 EST 2006

Steven D'Aprano wrote:

> > or you can use a more well-suited function:
> >
> >     # replace runs of _ and . with a single character
> >     newname = re.sub("_+", "_", newname)
> >     newname = re.sub("\.+", ".", newname)
>
> You know, I really must sit down and learn how to use
> reg exes one of these days. But somehow, every time I
> try, I get the feeling that the work required to learn
> to use them effectively is infinitely greater than the
> work required to re-invent the wheel every time.

here's all you need to understand the code above:

    . ^ $ * + ? ( ) [] { } | \ are reserved characters
    all other characters match themselves
    reserved characters must be escaped to match themselves;
        to match a dot, use \. (which the RE engine sees as \.)
    + means match one or more of the preceeding item
        so _+ matches one or more underscores, and \.+ matches
        one or more dots
    re.sub(pattern, replacement, text) replaces all matches for
        the given pattern in text with the given replacement string

so re.sub("_+", "_", newname) replaces runs of underscores with
a single underscore.

> > or, slightly more obscure:
> >
> >     newname = re.sub("([_.])\\1+", "\\1", newname)
>
> _Slightly_?

this introduces three new concepts:

    [ ] defines a set of characters
        so [_.] will match either _ or .
    ( ) defines a group of matched characters.
    \\1 (which the RE engine sees as \1) refers to the first group
        this can be used both in the pattern and in the replacement
        string

so re.sub("([_.])\\1+", "\\1", newname) replaces runs consisting
of either a . or an _ followed by one or more copies of itself, with
a single instance of itself.

(using r-strings lets you remove some of extra backslashes, btw)

</F>