[Tutor] Extracting data from HTML files

Kent Johnson kent37 at tds.net
Thu Dec 29 04:16:47 CET 2005


motorolaguy at gmx.net wrote:
> I`m trying to make a python script for extracting certain data from HTML
> files.These files are from a template so they all have the same formatting.I
> just want to extract the data from certain fields.It would also be nice to
> insert it into a mysql database, but I`ll leave that for later since I`m
> stuck in just reading the files.
> Say for example the HTML file has the following format:
> 
> <strong>Category:</strong>Category1<br><br>
> [...]
> <strong>Name:</strong>Filename.exe<br><br>
> [...]
> <strong>Description:</strong>Description1.<br><br>


Since your data is all in the same form, I think a regex will easily 
find this data. Something like

import re
catRe = re.compile(r'<strong>Category:</strong>(.*?)<br><br>')
data = ...read the HTML file here
m = catRe.search(data)
category = m.group(1)

> I also thought regexes might be useful for this but I suck at using regexes
> so that`s another problem.

Regexes take some effort to learn but it is worth it, they are a very 
useful tool in many contexts, not just Python. Have you read the regex 
HOW-TO?
http://www.amk.ca/python/howto/regex/

Kent



More information about the Tutor mailing list