handle <BR> tags

John Roth newsgroups at jhrothjr.com
Fri Aug 1 00:06:44 CEST 2003


"Behrang Dadsetan" <ben at dadsetan.com> wrote in message
news:bgc3ad$o57$1 at online.de...
> Luca Calderano wrote:
> > Hi guys...
> >
> > I've done a subclass of SGMLParser
> > to handle the contents of a web page,
> > but i'm not able to handle the <BR> tag
> >
> > can someone help me???
> >
> >       S.G.A S.p.A.
> > Nucleo Sistemi Informativi
> >      Luca Calderano
> >
> >
> I do not know SGMLParser.. but HTML is not SGML nor any subset. It is
> some ill language which one even rarely finds "pure" (written in the way
> the spec says it MUST be)
>
> I believe SGML does not like none closing tags. BR is one of the many
> none closing tags in HTML (also look at IMG or HR)
>
> Depending on what you are doing you should maybe use XHTML as an input
> if you can (XML well-formed HTML, XML being a subset of SGML) or you
> should probably look for a completely different parser "technology".
> Maybe HTMLParser will help you a little more.
>
> Do not forget, random downloaded HTML from Internet is often broken.
> You might rather want to use tidylib (corrects broken HTML code into
> XHTML) and a XHTML/SGML parser or a DOM.
>
> Hope it helps even though the effort I took to check my statements was
> small :)

You're basically correct, though. You can't parse HTML with either
an SGML or an XML parser. You also can't parse it reliably if it has
Javascript embedded that generates HTML.

John Roth
>
> Regards,
> Ben.
>






More information about the Python-list mailing list