Right tool and method to strip off html files (python, sed, awk?)

Eric_Dexter at msn.com Eric_Dexter at msn.com
Fri Jul 13 20:07:45 EDT 2007


On Jul 13, 1:57 pm, seb... at gmail.com wrote:
> Hi,
>
> I'm in the process of refactoring a lot of HTML documents and I'm
> using html tidy to do a part of this
> work. (clean up, change to xhtml and remove font and center tags)
>
> Now, Tidy will just do a part of the work I need to
> do, I have to remove all the presentational tags and attributes from
> the pages (in other words rip off the pages) including the tables that
> are used for disposition of content (how to differentiate?).
>
> I thought about doing that with python (for which I'm in process of
> learning), but maybe an other tool (like sed?) would be better suited
> for this job.
>
> I kind of know generally what I need to do:
>
> 1- Find all html files in the folders (sub-folders ...)
> 2- Do some file I/O and feed Sed or Python or what else with the file.
> 3- Apply recursively some regular expression on the file to do the
> things a want. (delete when it encounters certain tags, certain
> attributes)
> 4- Write the changed file, and go through all the files like that.
>
> But I don't know how to do it for real, the syntax and everything. I
> also want to pick-up the tool that's the easiest for this job. I heard
> about BeautifulSoup and lxml for Python, but I don't know if those
> modules would help.
>
> Now, I know I'm not a the best place to ask if python is the right
> choice (anyways even my little finger tells me it is), but if I can do
> the same thing more simply with another tool it would be good to know.
>
> An other argument for the other tools is that I know how to use the
> find unix program to find the files and feed them to grep or sed, but
> I still don't know what's the syntax with python (fetch files, change
> them than write them) and I don't know if I should read the files and
> treat them as a whole or just line by line. Of course I could mix
> commands with some python, find command to my program's standard
> input, and my command's standard output to the original file. But I do
> I control STDIN and STDOUT with python?
>
> Sorry if that's a lot of questions in one, and I will probably get a
> lot of RTFM (which I'm doing btw), but I feel I little lost in all
> that right now.
>
> Any help would be really appreciated.
> Thanks

You might find a text editor is the way to go..  you can use autoit
either through python or by itself to control the text editor you
use..  I just downloaded pspad and it looks like it will do that.  It
may be a pain to script though.

http://sourceforge.net/projects/dex-tracker/




More information about the Python-list mailing list