replacing words in HTML file
Cameron Simpson
cs at zip.com.au
Thu Apr 29 18:47:27 EDT 2010
On 29Apr2010 05:03, james_027 <cai.haibin at gmail.com> wrote:
| On Apr 29, 5:31 am, Cameron Simpson <c... at zip.com.au> wrote:
| > On 28Apr2010 22:03, Daniel Fetchinson <fetchin... at googlemail.com> wrote:
| > | > Any idea how I can replace words in a html file? Meaning only the
| > | > content will get replace while the html tags, javascript, & css are
| > | > remain untouch.
[...]
| > The only way to get this right is to parse the file, then walk the doc
| > tree enditing only the text parts.
| >
| > The BeautifulSoup module (3rd party, but a single .py file and trivial to
| > fetch and use, though it has some dependencies) does a good job of this,
| > coping even with typical not quite right HTML. It gives you a parse
| > tree you can easily walk, and you can modify it in place and write it
| > straight back out.
|
| Thanks for all your input. Cameron Simpson get the idea of what I am
| trying to do. I've been looking at beautiful soup so far I don't know
| how to perform search and replace within it.
Well the BeautifulSoup web page helped me:
http://www.crummy.com/software/BeautifulSoup/documentation.html
Here's a function from a script I wrote to bulk edit a web site. I was
replacing OBJECT and EMBED nodes with modern versions:
def recurse(node):
global didmod
for O in node.contents:
if isinstance(O,Tag):
for attr in 'src', 'href':
if attr in O:
rurl=O[attr]
rurlpath=pathwrt(rurl,SRCPATH)
if not os.path.exists(rurlpath):
print >>sys.stderr, "%s: MISSING: %s" % (SRCPATH, rurlpath,)
O2=None
if O.name == "object":
O2, SUBOBJ = fixmsobj(O)
elif O.name == "embed":
O2, SUBOBJ = fixembed(O)
if O2 is not None:
O.replaceWith(O2)
SUBOBJ.replaceWith(O)
##print >>sys.stderr, "%s: update: new OBJECT: %s" % (SRCPATH, str(O2), )
didmod=True
continue
recurse(O)
but you have only to change it a little to modify things that aren't Tag
objects. The calling end looks like this:
with open(SRCPATH) as srcfp:
srctext = srcfp.read()
SOUP = BeautifulSoup(srctext)
didmod = False # icky global set by recurse()
recurse(SOUP)
if didmod:
srctext = str(SOUP)
If didmod becomes True we recompute srctext and resave the file (or save it
to a copy).
Cheers,
--
Cameron Simpson <cs at zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/
Democracy is the theory that the people know what they want, and deserve to
get it good and hard. - H.L. Mencken
More information about the Python-list
mailing list