[Tutor] Extracting data from HTML files

bob bgailer at alum.rpi.edu
Wed Dec 28 23:23:22 CET 2005

At 01:26 PM 12/28/2005, motorolaguy at gmx.net wrote:
>I`m trying to make a python script for extracting certain data from HTML
>files....Say for example the HTML file has the following format:
>Taking in to account that each HTML file has a load of code in between each
>[...], what I want to do is extract the information for each field.In this
>case what I want to do is the script to read Category1, filename.exe and

Check out BeautifulSoup http://www.crummy.com/software/BeautifulSoup/

>And later on insert this in to a mysql database, or read the
>info and generate a CSV file to make db insertion easier.
>Since all the files are generated by a script each field I want to read
>is,from what I`ve seen, in the same line number so this could make things
>easier.But not all fields are of the same length.
>I`ve read Chapter 8 of Dive in to Python so I`m basing my work on that.
>I also thought regexes might be useful for this but I suck at using regexes
>so that`s another problem.
>Do any of you have an idea of where I could get a good start on this and if
>there`s any modules (like sgmllib.py) that might come in handy for this.
>Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
>Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner
>Tutor maillist  -  Tutor at python.org

More information about the Tutor mailing list