[Tutor] German Umlaut [string formatting with dictionaries]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu, 4 Apr 2002 14:07:38 -0800 (PST)


> That's better! Here's my latest regex:
>
> reg = re.compile(r"\b[%s]\w+-?(?:,\s\b[%s]\w+)*\sund\s\b[%s]\w+"
>
> %(string.uppercase,string.uppercase,string.uppercase),re.UNICODE)
>
>
> It works (matches also "Berlin, London,Wien  und Paris"),
> but I don't like it because of
> %(string.uppercase,string.uppercase,string.uppercase).

The above using string formatting with tuples, and it is a bit wordy
because the tuples interpolate by their position --- the first '%s' gets
replaced by the first element in the tuple, the second '%s' with the
second element, and so on.


If we use a dictionary approach, we can make things look a little nicer:

###
reg = re.compile(r"""\b[%(CAPITALS)s]\w+           ## A capitalized word
                     -?                            ## followed by an

                                                   ## optional hyphen.
                     (?:,\s\b[%(CAPITALS)s]\w+     ## Don't make a group,
                                                   ## but match as many
                                                   ## capitalized words
                                                   ## separated by
                                                   ## commas,
                     )*

                     \s und \s                     ## followed by "und"

                     \b[%(CAPITALS)s]\w+"""        ## and the last
                                                   ## capitalized word
                     % { 'CAPITALS' : string.uppercase },
                 re.VERBOSE)
###

That is, every time Python sees '%(CAPITALS)s', it'll look within our
dictionary to substitute values.  This approach isn't positional, since
Python will actually look up the appropriate values in our dictioanry.
This approach allows us to avoid repeating 'string.uppercase' so many
times.


Also, I've broken the regular expression across several lines, and placed
comments to explain what the line noise means.  *grin* We can do this as
long as we tell the re.compile() function that we're being "verbose"
about this.  Verbose regular expressions allow us to put spaces and
comments for humans to read, and before the regular expression gets
processed, the system should strip out the spaces and comments.


Hope this helps!