Html: replacing tags
Lee Harr
missive at frontiernet.net
Fri Jun 13 17:55:32 EDT 2003
>> I'm working on an RSS aggregator and I'd like to replace all
>> img-tags in a piece of html with links to the image, thereby
>> using the alt-text of the img as link text (if present). The
>> rest of the html, including tags, should stay as-is. I'm capable
>> of doing this in what feels like the dumb way (parsing it with
>> regexes for example, or plain old string splitting and rejoining),
>> but I have this impression the HTMLParser or htmllib module should
>> be able to help me with this task.
>>
>> However, I can't figure out how (if?) I can make a parser do this.
>
> Yes, HTMLParser only parses, but you do this subclassing, and you can
> override behaviour. What I do is to subclass HTMLParser and subclass
> all methods to add their parameters nearly as is to a list of the
> class object. Then, when the parsing has finished you can retrieve
> this list and join in to get a string with the original HTML.
>
> Of course, inside the handle_start|end|tag you can test the tag
> being parsed and insert it as is or subsitute it with something else.
>
I needed to do something very similar recently. I was making a mirror
of a website for burning on to a cdrom, so all links needed to be made
relative instead of absolute.
It seems like this may be a very common thing to do (replacing tags).
If someone makes a general solution, it might be nice if this
functionality were in the standard library.
My solution was to get a list of the tags and then just
line.replace(old_tag, new_tag)
through the file.
Problem is it tends to find things that should not be replaced.
More information about the Python-list
mailing list