python fast HTML data extraction library
John Machin
sjmachin at lexicon.net
Sun Jul 26 11:51:39 EDT 2009
On Jul 23, 11:53 am, Paul McGuire <pt... at austin.rr.com> wrote:
> On Jul 22, 5:43 pm, Filip <pink... at gmail.com> wrote:
>
> # Needs re.IGNORECASE, and can have tag attributes, such as <BR
> CLEAR="ALL">
> line_break_re = re.compile('<br\/?>', re.UNICODE)
Just in case somebody actually uses valid XHTML :-) it might be a good
idea to allow for <br />
> # what about HTML entities defined using hex syntax, such as &#xxxx;
> amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)
What about the decimal syntax ones? E.g. not only and  
but also
Also, entity names can contain digits e.g. ¹ ¾
More information about the Python-list
mailing list