[Tutor] Re: lists in re?

Wed Sep 10 00:51:59 EDT 2003

Karl's solution is really nice, shorter and more elegant than the brute 
force approache that I have devised. On the other hand, this time mine 
doesn't use list comprehensions, nor lambdas. And it also handles 
[http:\\bla.com] as well as [http:\\bla.com|Bla] with a single regex, so 
I've decided to post it anyway :).

This is a possible approach:

1. Make a dictionary mapping the uncompiled regexes to a format-string 
(strings containing "%") wherein matches are substituted

2. Build a dictionary of the compiled regexes, mapping them to the same 
format-strings

3. Loop through the keys of that dictionary and use the sub() method of 
the compiled method to substitute. This method can optionally take a 
function as parameter instead of a constant string, we'll use this option.

4. In the replace function, build a dictionary of the groups found by 
the regex (make sure you build your regex in a smart way!)

5. Feed that dictionary to the format string (you can look up the format 
string in the dictionary of compiled regexes, see step 2)

6. Return the result and continue.

I've attached the code which does this at the bottom of this mail. It takes:

"""
Something can be __underlined__, ''italic'' or **bold**.
You can also insert **[http://offsite.org|offsite links]** or
[http://links.com] __without__ title.
"""

and turns it into:

"""
Something can be underlined, italic or bold.
You can also insert <a href="http://offsite.org">offsite 
links</a> or
<a href="http://links.com">http://links.com</a> without title.
"""

However, you should pay attention to a number of things:

- a pass by a regex must not affect a previous pass by another regex. 
This is particularly a problem with the detection of loose URLs: do not 
change into <a>-links URLs which are inside the href attribute for 
example (the URL parser should not find a hit in 
'href="http://bla.com"'). A couple of weeks ago, I started a thread on 
exactly this same topic. I have solved it in a safe and complex way 
(detects all links you through at it), but an easier solution was also 
posted, which has a minor (not fatal) flaw.

- you need a way to distinguish between images and links found between 
square parentheses. My code doesn't do images at all, it converts 
everything to links. I think you can build in the distinction into the 
regex, so that the [link] regex is not triggered by links ending on 
gif/png/jpg/jpeg. Obviously, the IMG regex [image.gif] shouldn't match 
any URL if it doesn't end in gif/png/jpg/jpeg.

And now to the code. Note that only the "regs" dictionary needs to be 
modified in order to add more parsing abilities. Without the comments, 
the thing is about 14 lines.

===
# define the sample text
text = """
Something can be __underlined__, ''italic'' or **bold**.
You can also insert **[http://offsite.org|offsite links]** or
[http://links.com] __without__ title.
"""

# define the regexes as plain-text, mapping them to a
# format-string representation corresponding with
# the number of the match group in the result

regs = {"__(.*)__": "<u>%(1)s</u>", # underlined
         "''(.*)''": "<i>%(1)s</i>", # italic
         "\*\*(.*)\*\*": "<b>%(1)s</b>", # bold
         "\[(.+?)(?:\|+(.*?))*?\]": """<a href="%(1)s">%(2)s</a>"""
                       # link parser, matches [link|title] or [link]
}

# compile the regexes, converting regs to a dictionary of compiled
# regexes
import re
compiled = {}
for oneregex in regs.keys():
     compiled[re.compile(oneregex)] = regs[oneregex]

# write a function for formatting
def replace(matchobj):
     gr = {} # contains the groups extracted by the regex, mapping a
             # str(groupnumber) to that group's contents (string)
     # make a dictionary of all available groups, used for formatting
     # with the formatstring
     # assuming there are never more than 4 groups; you can increase
     # this number if necessary,
     # it doesn't matter as far as the code is concerned
     for i in range(5): # means at max 4 groups not counting group 0
         try: # in case there are fewer groups
             val = matchobj.string[matchobj.start(i):matchobj.end(i)]
             if val in ['', None]:
                 raise ValueError # force exception; this occurs for
                                  # [link] without |title
         except:
             if gr.has_key(str(i-1)):
                 # perhaps otherwise errors might occur
                 val = gr[str(i-1)] # this comes in handy for links which
                                    # have no caption specified!
             else:
                 val = "" # keep it empty
         gr[str(i)] = val
     # look up the format string for this item in the compiled dictionary
     formatstring = compiled[matchobj.re]
     # insert the extracted values into the format string
     return formatstring % gr

for regex in compiled.keys():
     text = regex.sub(replace, text)

print text
===

How about that? :)

Andrei

=====
Mail address in header catches spam. Real contact info (decode with rot13):
cebwrpg5 at bcrenznvy.pbz. Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V 
ernq gur yvfg, fb gurer'f ab arrq gb PP.