HTML File Parsing

Felipe De Bene ttboy86 at gmail.com
Thu Oct 30 12:19:07 EDT 2008


On Oct 28, 6:18 pm, Stefan Behnel <stefan... at behnel.de> wrote:
> Felipe De Bene wrote:
> > I'm having problems parsing anHTMLfile with the following syntax :
>
> > <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
> >     <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
> >     <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
> > BGCOLOR='#c0c0c0'>Date</TH>
> > and so on....
>
> > whenever I feed the parser with such file I get the error :
>
> > HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
> > line 515, column 45
>
> YourHTMLpage is notHTML, i.e. it is broken. Python's HTMLParser is not made
> for parsing brokenHTML. However, you can use the parse of lxml.htmlto fix up
> yourHTMLfor you.
>
> http://codespeak.net/lxml/
>
> Stefan

Actually i fetch from an application that i thought it should act like
this and as I told you, the program is ready to be shipped so
rewriting an entire class that has public methods would be a real
pain. I really had to find a way to work this out by using the
python's parser instead of external libraries. But thanks anyway for
the clue, I might start working on a similar project next and this
library may be a good and a less painful path. Thanks :D
Felipe.




More information about the Python-list mailing list