[XML-SIG] well-formed xml

Mike Brown mike@skew.org
Thu, 26 Sep 2002 23:33:12 -0600 (MDT)


Mark McEahern wrote:
> I'm obviously missing something because this seemingly innocent chunk of
> xhtml:
> 
>   from xml.dom import minidom
> 
>   s = "<a href='http://google.com/search?hl=en&q=foobar'>search</a>"

FAQ. It's not well-formed XML. In an XML document, a bare "&" always denotes
the beginning of an entity reference, unless it is in a CDATA section.

HTML has a similar rule, but you're allowed to get away with bare ampersands
in part because HTML has a fixed set of entities (so a reference to one that's
unknown is probably not a reference at all, therefore "&amp;" can be assumed),
and in part because HTML browsers are not required to report such things as
errors (lenience=easier document authoring and more usable documents), whereas
XML parsers are required to do so (stricter rules force more predictable
documents, for easier processing).

Please be aware that things like
 - 'raw' characters vs numeric character references vs entity references, 
 - whether or not character data is in CDATA sections, 
 - the character-to-byte encoding of the document,
 - attribute order,
 - the type of quotes around attribute values,
 - whitespace between attributes in an element's start tag,
 - extraneous whitespace in attribute values, and
 - whether an empty element is written like <foo/> or <foo></foo>,
are all considered lexical fluff, things that have no bearing on what 
semantic, logical information is carried in the document. It is the parser's
job to see past all that stuff and just tell the application what the 
important bits are: the hierarchy of elements, attributes, character data, and 
processing instructions. HTML processors do pretty much the same thing. Thus 
it is more correct to use "&amp;" in HTML where an ampersand is *meant*, even 
though you can often get away with a bare one.

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/