[Tutor] Extracting data from HTML files

bob bgailer at alum.rpi.edu
Wed Dec 28 23:23:22 CET 2005


At 01:26 PM 12/28/2005, motorolaguy at gmx.net wrote:
>[snip]
>I`m trying to make a python script for extracting certain data from HTML
>files....Say for example the HTML file has the following format:
><strong>Category:</strong>Category1<br><br>
>[...]
><strong>Name:</strong>Filename.exe<br><br>
>[...]
><strong>Description:</strong>Description1.<br><br>
>
>Taking in to account that each HTML file has a load of code in between each
>[...], what I want to do is extract the information for each field.In this
>case what I want to do is the script to read Category1, filename.exe and
>Description1.

Check out BeautifulSoup http://www.crummy.com/software/BeautifulSoup/

>And later on insert this in to a mysql database, or read the
>info and generate a CSV file to make db insertion easier.
>Since all the files are generated by a script each field I want to read
>is,from what I`ve seen, in the same line number so this could make things
>easier.But not all fields are of the same length.
>I`ve read Chapter 8 of Dive in to Python so I`m basing my work on that.
>I also thought regexes might be useful for this but I suck at using regexes
>so that`s another problem.
>Do any of you have an idea of where I could get a good start on this and if
>there`s any modules (like sgmllib.py) that might come in handy for this.
>Thanks!
>
>--
>Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
>Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner
>
>_______________________________________________
>Tutor maillist  -  Tutor at python.org
>http://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list