python fast HTML data extraction library

Filip pinkeen at gmail.com
Sun Jul 26 19:44:54 EDT 2009


On Jul 23, 3:53 am, Paul McGuire <pt... at austin.rr.com> wrote:
> # You should use raw string literals throughout, as in:
> # blah_re = re.compile(r'sljdflsflds')
> # (note the leading r before the string literal).  raw string
> literals
> # really help keep your re expressions clean, so that you don't ever
> # have to double up any '\' characters.

Thanks, I didn't know about that, updated my code.

> # Attributes might be enclosed in single quotes, or not enclosed in
> any quotes at all.
> attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
> re.UNICODE | re.IGNORECASE)

Of course, you mean attribute's *value* can be enclosed in single/
double quotes?
To be true, I haven't seen single quote variant in HTML lately but I
checked it and it seems to be in the specs and it can be even quite
useful (man learns something every day).
Thank you for pointing that one out, I updated the code accordingly
(just realized that condition check REs need an update too :/).

As far as the lack of value quoting is concerned, I am not so sure I
need this - It would significanly obfuscate my REs and this practice
is rather deprecated, considered unsafe
and I've seen it only in very old websites.

> How would you extract data from a table?  For instance, how would you
> extract the data entries from the table at this URL:http://tf.nist.gov/tf-cgi/servers.cgi?  This would be a good example
> snippet for your module documentation.

This really seems like a nice example. I'll surely explain it in my
docs (examples are surely needed there ;)).

> Try extracting all of the <a href=...>sldjlsfjd</a> links from
> yahoo.com, and see how much of what you expect actually gets matched.

The library was used in my humble production environment, processing a
few hundred thousand+ of pages and spitting out about 10000 SQL
records so it does work quite good with a simple task like extracting
all links. However, I can't really say that the task introduced enough
diversity (there were only 9 different page templates) to say that the
library is 'tested'...

On Jul 26, 5:51 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Jul 23, 11:53 am, Paul McGuire <pt... at austin.rr.com> wrote:
>
> > On Jul 22, 5:43 pm, Filip <pink... at gmail.com> wrote:
>
> > # Needs re.IGNORECASE, and can have tag attributes, such as <BR
> > CLEAR="ALL">
> > line_break_re = re.compile('<br\/?>', re.UNICODE)
>
> Just in case somebody actually uses valid XHTML :-) it might be a good
> idea to allow for <br />
>
> > # what about HTML entities defined using hex syntax, such as &#xxxx;
> > amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)
>
> What about the decimal syntax ones? E.g. not only   and &#xa0;
> but also  
>
> Also, entity names can contain digits e.g. &sup1; &frac34;

Thanks for pointing this out, I fixed that. Although it has very
little impact on how the library performs its main task (I'd like to
see some comments on that ;)).



More information about the Python-list mailing list