[Tutor] Search and Replace

Kent Johnson kent37 at tds.net
Fri Jun 24 19:14:44 CEST 2005


Reed L. O'Brien wrote:
> I ma trying to write a script to search adn replace a sizable chink of
> text in about 460 html pages.
> It is an old form that usesa search engine no linger availabe.
> 
> I am wondering if anyone has any advice on the best way to go about that.
> There are more than one layout place ment for the form, but I would be
> happy to correct the last few by hand as more than 90% are the same.
> 
> So my ideas have been,
> use regex to find <form>.*</form> and replace it with <form>newform</form>.
> Unfortunately there is more than just search form.  So this would just
> clobber all of them.  So I could <form>.*knownName of
> SearchButton.*</form> --> <form>newform</form>

If you are sure 'knownName of SearchButton' only occurs in the form you want to replace, this seems like a good option. Only use non-greedy matching
<form>.*?knownName of SearchButton.*?</form>

Without the ? you will match from the start of the first form in the page, to the end of the last form, as long as the search form is one of them.

> 
> Or should I read each file in as a big string and break on the form
> tags, test the strings as necessary ad operate on them if the conditions
> are met.   Unfortunaltely  I think there are wide variances in white
> space and lines breaks.  Even the order of the tags is inconsistent.  So
> I think I am stuck with the first option...
> 
> Unless there is  some module or package I haven't found that works on
> html in just the way that I want.  I found htmlXtract but it is for
> Python 1.5 and not immediately intuitive.

You might be able to find a module that will read the HTML into a structured form, work on that, and write it out again. Whether this is easy or practical depends a lot on how well-formed your HTML is, and how important it is to keep exactly the same form when you write it back out. You could take a look at ElementTidy for example.
http://effbot.org/zone/element-tidylib.htm

But I think the regex solution sounds good.

Kent



More information about the Tutor mailing list