Right tool and method to strip off html files (python, sed, awk?)
sebzzz at gmail.com
sebzzz at gmail.com
Fri Jul 13 20:57:38 CEST 2007
I'm in the process of refactoring a lot of HTML documents and I'm
using html tidy to do a part of this
work. (clean up, change to xhtml and remove font and center tags)
Now, Tidy will just do a part of the work I need to
do, I have to remove all the presentational tags and attributes from
the pages (in other words rip off the pages) including the tables that
are used for disposition of content (how to differentiate?).
I thought about doing that with python (for which I'm in process of
learning), but maybe an other tool (like sed?) would be better suited
for this job.
I kind of know generally what I need to do:
1- Find all html files in the folders (sub-folders ...)
2- Do some file I/O and feed Sed or Python or what else with the file.
3- Apply recursively some regular expression on the file to do the
things a want. (delete when it encounters certain tags, certain
4- Write the changed file, and go through all the files like that.
But I don't know how to do it for real, the syntax and everything. I
also want to pick-up the tool that's the easiest for this job. I heard
about BeautifulSoup and lxml for Python, but I don't know if those
modules would help.
Now, I know I'm not a the best place to ask if python is the right
choice (anyways even my little finger tells me it is), but if I can do
the same thing more simply with another tool it would be good to know.
An other argument for the other tools is that I know how to use the
find unix program to find the files and feed them to grep or sed, but
I still don't know what's the syntax with python (fetch files, change
them than write them) and I don't know if I should read the files and
treat them as a whole or just line by line. Of course I could mix
commands with some python, find command to my program's standard
input, and my command's standard output to the original file. But I do
I control STDIN and STDOUT with python?
Sorry if that's a lot of questions in one, and I will probably get a
lot of RTFM (which I'm doing btw), but I feel I little lost in all
that right now.
Any help would be really appreciated.
More information about the Python-list