[Tutor] String matching?

Tue Dec 7 13:58:37 CET 2004

Regular expressions are a bit tricky to understand but well worth the trouble - they are a powerful 
tool. The Regex HOW-TO is one place to start:
http://www.amk.ca/python/howto/regex/

Of course, Jamie Zawinsky famously said, "Some people, when confronted with a problem, think 'I 
know, I'll use regular expressions.'  Now they have two problems."

You can do a lot of cleanup with a few simple string substitutions:

test = ''' <app=
let
  code=3D"fphover.class" height=3D"24" width=3D"138"><param name=3D"color"<applet
         code
<ap=
plet '''

test2 = test.replace('=\n', '')
test2 = test2.replace('=3D"', '="')
print test2

prints =>

  <applet
  code="fphover.class" height="24" width="138"><param name="color"<applet
         code
<applet

This is probably a good first step even if you want to use regular expressions to parse out the rest 
of the data from the applet tag.

OK, here is a brute-force regex that will find the text 'applet' with '=\n' perhaps between any pair 
of characters:

appRe = r'(=\n)?'.join(list('applet'))
print appRe

=> a(=\n)?p(=\n)?p(=\n)?l(=\n)?e(=\n)?t

The (=\n)? between each pair of letters means, optionally match =\n here.

You can use re.finditer to show all the matches:

import re

for match in re.finditer(appRe, test):
     print
     print match.group(0)

=>
app=
let

applet

ap=
plet

A couple other options:
elementtidy reads HTML, cleans it up and creates a tree model of the source. You can easily modify 
the tree model and write it out again. This has the bonus of giving you well-formed XHTML at the end 
of the process. It is based on HTML Tidy and Fredrik Lundh's elementtree package which is very easy 
to use.
http://www.effbot.org/zone/element-tidylib.htm

Beautiful Soup is an HTML parser that is designed to read bad HTML and give access to the tags. I'm 
not sure if it gives you any help for rewriting, though.
http://www.crummy.com/software/BeautifulSoup/

HTH
Kent

Liam Clarke wrote:
> Hi all, 
> 
> I have a large amount of HTML that a previous person has liberally
> sprinkled a huge amount of applets through, instead of html links,
> which kills my browser to open.
> 
> So, want to go through and replace all applets with nice simple links,
> and want to use Python to find the applet, extract a name and an URL,
> and create the link.
> 
> My problem is, somewhere in my copying and pasting into the text file
> that the HTMl currently resides in, it got all messed up it would
> seem, and there's a bunch of strange '=' all through it. (Someone said
> that the code had been generated in Frontpage. Is that a good thing or
> bad thing?)
> 
> So, I want to search for <applet code=, but it may be in the file as 
> 
> <app=
> let
>  code
> 
> or <applet
>         code
> 
> or <ap=
> plet 
> 
> etc. etc. (Full example of yuck here
> http://www.rafb.net/paste/results/WcKPCy64.html)
> 
> So, I want to be write a search that will match <applet code and
> <app=\nlet code (etc. etc.) without having to strip the file of '='
> and '\n'.
> 
> I was thinking the re module is for this sort of stuff? Truth is, I
> wouldn't know where to begin with it, it seems somewhat powerful.
> 
> Or, there's a much easier way, which I'm missing totally. If there is,
> I'd be very grateful for pointers.
> 
> Thanks for any help you can offer.
> 
> Liam Clarke
>