[Expat-discuss] Refering to XML Schema and Expat

Karl Waclawek karl@waclawek.net
Tue, 24 Sep 2002 23:12:03 -0400


> On 24 Sep, Karl Waclawek wrote:
> >
> >> But beside the deficiencies of my code, I should mention, that I
> >> think, it's not possible, to write a 100 percent compliant DTD
> >> validator on top of the current expat code. There are two major
> >> problems, I see: It is not possible at the moment, to check for the
> >> Validity constraint: Proper Declaration/PE Nesting, and second it is
> >> not possible, to validate XML documents, that have standalone="yes"
> >> and an external subset (the problem is, that there is no reliable way
> >> to know, if expat has done additional normalization on attribute
> >> types other than CDATA (see XML rec 3.3.3). If somebody is interested,
> >> I may elaborate this notes a bit more.
> >
> > Please do - on both points!
>
> I should have known it.. ;-) Ok, since it's you, that asks.

Before commenting below: Excellent read, thanks!

> First the validity constraints Proper Declaration/PE Nesting and
> Proper Group/PE Nesting. I've missed the second in my note, but both
> are similar problems.
>
> Take a look at this example document
>
> <!DOCTYPE e1 SYSTEM "e1.dtd">
> <e1/>
>
> with this external subset e1.dtd
>
> <!ENTITY % pe1 "EMPTY>">
> <!ELEMENT e1 %pe1;
>
> Expat is perfectly happy with this. That's OK, since expat is a
> well-formedness parser and don't have to care about validity
> constraints. But this example doesn't fullfill the validity constraint
> Proper Declaration/PE Nesting.

If Expat reported entity boundaries for internal entity references,
would you then be able to detect this error?

> The same is true for this example of not fullfilled validity
> constraint Proper Group/PE Nesting:
>
> <!DOCTYPE e1 SYSTEM "e2.dtd">
> <e1><e2/></e1>
>
> with e2.dtd
>
> <!ENTITY % p1 "(e2">
> <!ENTITY % p2 "|e3)">
> <!ELEMENT e1 %p1;%p2;>
> <!ELEMENT e2 EMPTY>
> <!ELEMENT e3 EMPTY>
>
> Expat accepts this, which is of course OK. But this is not valid. (The
> OASIS xml test suite includes a few tests for both constraints.)
>
> It does not help much, to analyse the replacement text of a parameter
> entity inside the XML_EntityDeclHandler, to find such 'ill-formed'
> parameter entities (beside, that this would be far from plain simple
> to do). This is, because this parameter entities are not really
> 'ill-formed' - they follow the production rules for a parameter entity
> and are therefor 'legal' inside the DTD. It is only a validation
> error, if they are used, as in the examples above. For example, if
> e1.dtd in the first example is
>
> <!ENTITY % pe1 "EMPTY>">
> <!ELEMENT e1 EMPTY>
>
> then the document is valid. So, even if I would analyze the
> replacement text in some clever way in my validation layer, this would
> not help, because there is currently no way, to get noticed about when
> a parameter entity is used (and in which markup context).

OK, so there is where the InternalEntityRefHandler comes in,
as a solution for both cases.
As already discussed, this is on our roadmap, but only once
the new API is in place which allows reporting of internal
entity boundaries (PE or GE). Do you agree that we have
a solution in our sights? At least, mid-term this might be
implemented - all depending on time available.

> Second the problem with standalone documents. If a document has a
> Standalone Document Declaration, this does not necessarily mean, it
> doesn't have any external entities. It means in the words of the
> recommendation:
>
>   "In a standalone document declaration, the value "yes" indicates
>    that there are no external markup declarations which affect the
>    information passed from the XML processor to the application."
>
> External markup declarations could affect the information passed from
> the XML processor to the applications throu, for example, attribute
> defaults (if the defaulted attribute is omitted in the document) or
> entity declarations. Most of this can be handled in some way. But I
> think its not possible, to detect a special problem with attribute
> value normalization in a reliable way.
>
> Please take a look at this example (it follows the example in the XML
> recommendation, section 3.3.3)
>
> <?xml version="1.0" standalone="yes"?>
> <!DOCTYPE e1 SYSTEM "e3.dtd" [
> <!ENTITY d "&#xD;">
> <!ENTITY a "&#xA;">
> <!ENTITY da "&#xD;&#xA;">
> ]>
> <e1 a="&d;&d;A&a;&a;B&da;"/>
>
> with e3.dtd
>
> <!ELEMENT e1 EMPTY>
> <!ATTLIST e1
>           a NMTOKENS #IMPLIED>
>
> Since a validating parser must always read all external entities,
> expat knows, that the attribute a is of type NMTOKENS. If expat knows
> the type of an attribute, it does the additional attribute value
> normalization, described in 3.3.3. Therefor, the element start handler
> will see "A B" as Value of the attribute a. The problem is, that the
> information about the attribute type in the external entity has
> affected the information. If a would have the type CDATA, the
> attribute value would have been " A B ". Therefor, the
> standalone="yes" claim of this document is false, the document is not
> valid.
>
> If the documents reads like this:
>
> <?xml version="1.0" standalone="yes"?>
> <!DOCTYPE e1 SYSTEM "e3.dtd" [
> <!ENTITY d "&#xD;">
> <!ENTITY a "&#xA;">
> <!ENTITY da "&#xD;&#xA;">
> ]>
> <e1 a="A B"/>
>
> (with the same e3.dtd) the document would be valid.
>
> And there is not way to know, that expat has done additional
> normalization according to the attribute type. Therefor, the both
> cases are indistinguishable from expat handler level.

Looking at the spec:

<spec excerpt="Validity constraint: Standalone Document Declaration">
  The standalone document declaration must have the value "no" if any external
  markup declarations contain declarations of:
  a.. attributes with default values, if elements to which these attributes
      apply appear in the document without specifications of values for these
      attributes, or
  b.. entities (other than amp, lt, gt, apos, quot), if references to those 
      entities appear in the document, or
  c.. attributes with values subject to normalization, where the attribute
      appears in the document with a value which will change as a result of 
      normalization, or
  d.. element types with element content, if white space occurs directly
      within any instance of those types.
</spec>

It seems none of these applies here. c) looks the closest, but there
is no attribute value declared in the external subset.
Is there another constraint that applies?

> Well, has somebody really followed this explanations? (Well, my
> english is to clumsy, sorry.) On the other hand, this both problems
> are the only 'unsolvable' ones (without changing the expat sources, of
> course), that I'm aware of, which prevents one from writing full DTD
> validation on top of expat.

If we could eliminate the second problem - depending on your
reply - than we might have at least a solution already in
the planning stages.

Karl