handle <BR> tags
newsgroups at jhrothjr.com
Fri Aug 1 00:06:44 CEST 2003
"Behrang Dadsetan" <ben at dadsetan.com> wrote in message
news:bgc3ad$o57$1 at online.de...
> Luca Calderano wrote:
> > Hi guys...
> > I've done a subclass of SGMLParser
> > to handle the contents of a web page,
> > but i'm not able to handle the <BR> tag
> > can someone help me???
> > S.G.A S.p.A.
> > Nucleo Sistemi Informativi
> > Luca Calderano
> I do not know SGMLParser.. but HTML is not SGML nor any subset. It is
> some ill language which one even rarely finds "pure" (written in the way
> the spec says it MUST be)
> I believe SGML does not like none closing tags. BR is one of the many
> none closing tags in HTML (also look at IMG or HR)
> Depending on what you are doing you should maybe use XHTML as an input
> if you can (XML well-formed HTML, XML being a subset of SGML) or you
> should probably look for a completely different parser "technology".
> Maybe HTMLParser will help you a little more.
> Do not forget, random downloaded HTML from Internet is often broken.
> You might rather want to use tidylib (corrects broken HTML code into
> XHTML) and a XHTML/SGML parser or a DOM.
> Hope it helps even though the effort I took to check my statements was
> small :)
You're basically correct, though. You can't parse HTML with either
an SGML or an XML parser. You also can't parse it reliably if it has
More information about the Python-list