python fast HTML data extraction library
Paul McGuire
ptmcg at austin.rr.com
Wed Jul 22 21:53:33 EDT 2009
On Jul 22, 5:43 pm, Filip <pink... at gmail.com> wrote:
>
> My library, rather than parsing the whole input into a tree, processes
> it like a char stream with regular expressions.
>
Filip -
In general, parsing HTML with re's is fraught with easily-overlooked
deviations from the norm. But since you have stepped up to the task,
here are some comments on your re's:
# You should use raw string literals throughout, as in:
# blah_re = re.compile(r'sljdflsflds')
# (note the leading r before the string literal). raw string
literals
# really help keep your re expressions clean, so that you don't ever
# have to double up any '\' characters.
# Attributes might be enclosed in single quotes, or not enclosed in
any quotes at all.
attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
re.UNICODE | re.IGNORECASE)
# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)
# what about HTML entities defined using hex syntax, such as &#xxxx;
amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)
How would you extract data from a table? For instance, how would you
extract the data entries from the table at this URL:
http://tf.nist.gov/tf-cgi/servers.cgi ? This would be a good example
snippet for your module documentation.
Try extracting all of the <a href=...>sldjlsfjd</a> links from
yahoo.com, and see how much of what you expect actually gets matched.
Good luck!
-- Paul
More information about the Python-list
mailing list