htmllib: CR in CDATA

Mark Nottingham mnot at pobox.com
Tue Jun 22 12:29:23 CEST 1999


OK, I'm starting to have a really nice conversation with myself now ;-)

htmllib DOESN'T change the newline to a single space - it leaves it in.

CDATA is a sequence of characters from the document character set and may
include character entities. User agents should interpret attribute values as
follows:
          Replace character entities with characters,
          Ignore line feeds,
          Replace each carriage return or tab with a single space.
     User agents may ignore leading and trailing white space in CDATA
attribute values (e.g., "   myval   " may be interpreted as "myval").
Authors should not declare attribute values with leading or trailing white
space.

If I'm wrong this time, someone please save us both the trouble and shoot
me.

Example:

#!/opt/local/bin/python
body = '''\
<HTML>
<A HREF="image.
jpg">
</HTML>
'''
import formatter, htmllib, sys, string
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(body)
print parser.anchorlist



----- Original Message -----
From: Mark Nottingham <mnot at pobox.com>
To: Python <python-list at cwi.nl>
Sent: Tuesday, June 22, 1999 7:31
Subject: Re: htmllib: CR in CDATA


> Whooops, nevermind, I misread the spec -- carriage returns are turned into
> spaces (which is what htmllib does) - *line feeds* should be ignored...
>
> --
> "Get me the phone book."
>   "Which one?"
> "Doesn't matter."
>
>
> ----- Original Message -----
> From: Mark Nottingham <mnot at pobox.com>
> To: Python <python-list at cwi.nl>
> Sent: Tuesday, June 22, 1999 12:55
> Subject: htmllib: CR in CDATA
>
>
> > It appears that htmllib doesn't ignore returns in CDATA fields, as HTML
> 4.0
> > says it should:
> > http://www.w3.org/TR/REC-html40/types.html#type-cdata
> > http://www.w3.org/TR/REC-html40/sgml/dtd.html
> >
> > As a result, htmllib improperly parses any CDATA element that wraps
across
> a
> > line; this affects elements like
> >
> > <A href="foo.
> > gif">
> >
> > I'm happy to work up a patch, but I thought I'd ask around first. It may
> be
> > a bit involved to fix it properly; every CDATA should be handled this
way,
> > which practically means almost every tag attribute.
> >
> > Regards,
> >
> >
> > Mark Nottingham, Melbourne Australia
> > mnot at pobox.com  http://www.mnot.net/
> >
> >
> >
>
>





More information about the Python-list mailing list