[Tutor] String matching?
Kent Johnson
kent37 at tds.net
Tue Dec 7 13:58:37 CET 2004
Regular expressions are a bit tricky to understand but well worth the trouble - they are a powerful
tool. The Regex HOW-TO is one place to start:
http://www.amk.ca/python/howto/regex/
Of course, Jamie Zawinsky famously said, "Some people, when confronted with a problem, think 'I
know, I'll use regular expressions.' Now they have two problems."
You can do a lot of cleanup with a few simple string substitutions:
test = ''' <app=
let
code=3D"fphover.class" height=3D"24" width=3D"138"><param name=3D"color"<applet
code
<ap=
plet '''
test2 = test.replace('=\n', '')
test2 = test2.replace('=3D"', '="')
print test2
prints =>
<applet
code="fphover.class" height="24" width="138"><param name="color"<applet
code
<applet
This is probably a good first step even if you want to use regular expressions to parse out the rest
of the data from the applet tag.
OK, here is a brute-force regex that will find the text 'applet' with '=\n' perhaps between any pair
of characters:
appRe = r'(=\n)?'.join(list('applet'))
print appRe
=> a(=\n)?p(=\n)?p(=\n)?l(=\n)?e(=\n)?t
The (=\n)? between each pair of letters means, optionally match =\n here.
You can use re.finditer to show all the matches:
import re
for match in re.finditer(appRe, test):
print
print match.group(0)
=>
app=
let
applet
ap=
plet
A couple other options:
elementtidy reads HTML, cleans it up and creates a tree model of the source. You can easily modify
the tree model and write it out again. This has the bonus of giving you well-formed XHTML at the end
of the process. It is based on HTML Tidy and Fredrik Lundh's elementtree package which is very easy
to use.
http://www.effbot.org/zone/element-tidylib.htm
Beautiful Soup is an HTML parser that is designed to read bad HTML and give access to the tags. I'm
not sure if it gives you any help for rewriting, though.
http://www.crummy.com/software/BeautifulSoup/
HTH
Kent
Liam Clarke wrote:
> Hi all,
>
> I have a large amount of HTML that a previous person has liberally
> sprinkled a huge amount of applets through, instead of html links,
> which kills my browser to open.
>
> So, want to go through and replace all applets with nice simple links,
> and want to use Python to find the applet, extract a name and an URL,
> and create the link.
>
> My problem is, somewhere in my copying and pasting into the text file
> that the HTMl currently resides in, it got all messed up it would
> seem, and there's a bunch of strange '=' all through it. (Someone said
> that the code had been generated in Frontpage. Is that a good thing or
> bad thing?)
>
> So, I want to search for <applet code=, but it may be in the file as
>
> <app=
> let
> code
>
> or <applet
> code
>
> or <ap=
> plet
>
> etc. etc. (Full example of yuck here
> http://www.rafb.net/paste/results/WcKPCy64.html)
>
> So, I want to be write a search that will match <applet code and
> <app=\nlet code (etc. etc.) without having to strip the file of '='
> and '\n'.
>
> I was thinking the re module is for this sort of stuff? Truth is, I
> wouldn't know where to begin with it, it seems somewhat powerful.
>
> Or, there's a much easier way, which I'm missing totally. If there is,
> I'd be very grateful for pointers.
>
> Thanks for any help you can offer.
>
> Liam Clarke
>
More information about the Tutor
mailing list