replace link in html

Pete Goodeve pete at jwgibbs.cchem.Berkeley.EDU
Thu May 29 18:37:39 EDT 2003


In article <7fe97cc4.0305290151.5bcbb454 at posting.google.com>,
Xah Lee <xah at xahlee.org> wrote:
>how can i write a script that goes thru a directory, find each .html
>file, replace links of the form
>
><a href="....suffix">some text</a>
>
>by some other text, where the suffix is one of .mov, .nb and others.
>Note the link may not be in one line, so i need some kind of html
>parser instead just regex search replace. (that's where i'd be stuck)
>
Actually, provided that you can read each html file as a long single
string, you *can* use a regular expression to find and replace, even
if the link spans several lines.  Just specify 're.DOTALL' in the
flags parameter of your 're.compile' call, and period will match newlines
as well as other characters, so you aren't bounded by end-of-line.
(Check the section on the 're' module in your Library Reference.)

Not sure exactly what your goal is, so I won't try to think up any
code for it, but I have a simpler job that I can describe, which might
help.

I want to simply remove all html tags from a text (in which a single
tag might itself occupy more than one line), so I just do (where 'str'
starts as the original text):

	pat = re.compile('<.*?>', re.DOTALL)
	str = string.join(pat.split(str), "")

(My actual code is a bit more complicated than shown, because I'm also
removing entities and doing other things at the same time.)
Note that the '.*?' spans the shortest string between brackets, otherwise
it would span and discard from the beginning of the first tag to the end
of the last!

To do what you need to, you'd probably have to extract and recombine
subexpressions and so on, but it should be doable.

HTH
					-- Pete --





More information about the Python-list mailing list