[Expat-discuss] Refering to XML Schema and Expat

rolf@pointsman.de rolf@pointsman.de
Wed, 25 Sep 2002 03:29:42 +0200 (MEST)


On 24 Sep, Karl Waclawek wrote:
> 
>> But beside the deficiencies of my code, I should mention, that I
>> think, it's not possible, to write a 100 percent compliant DTD
>> validator on top of the current expat code. There are two major
>> problems, I see: It is not possible at the moment, to check for the
>> Validity constraint: Proper Declaration/PE Nesting, and second it is
>> not possible, to validate XML documents, that have standalone="yes"
>> and an external subset (the problem is, that there is no reliable way
>> to know, if expat has done additional normalization on attribute
>> types other than CDATA (see XML rec 3.3.3). If somebody is interested,
>> I may elaborate this notes a bit more.
> 
> Please do - on both points!

I should have known it.. ;-) Ok, since it's you, that asks.

First the validity constraints Proper Declaration/PE Nesting and
Proper Group/PE Nesting. I've missed the second in my note, but both
are similar problems.

Take a look at this example document

<!DOCTYPE e1 SYSTEM "e1.dtd">
<e1/>

with this external subset e1.dtd

<!ENTITY % pe1 "EMPTY>">
<!ELEMENT e1 %pe1;

Expat is perfectly happy with this. That's OK, since expat is a
well-formedness parser and don't have to care about validity
constraints. But this example doesn't fullfill the validity constraint
Proper Declaration/PE Nesting.

The same is true for this example of not fullfilled validity
constraint Proper Group/PE Nesting:

<!DOCTYPE e1 SYSTEM "e2.dtd">
<e1><e2/></e1>

with e2.dtd

<!ENTITY % p1 "(e2">
<!ENTITY % p2 "|e3)">
<!ELEMENT e1 %p1;%p2;>
<!ELEMENT e2 EMPTY>
<!ELEMENT e3 EMPTY>

Expat accepts this, which is of course OK. But this is not valid. (The
OASIS xml test suite includes a few tests for both constraints.)

It does not help much, to analyse the replacement text of a parameter
entity inside the XML_EntityDeclHandler, to find such 'ill-formed'
parameter entities (beside, that this would be far from plain simple
to do). This is, because this parameter entities are not really
'ill-formed' - they follow the production rules for a parameter entity
and are therefor 'legal' inside the DTD. It is only a validation
error, if they are used, as in the examples above. For example, if
e1.dtd in the first example is

<!ENTITY % pe1 "EMPTY>">
<!ELEMENT e1 EMPTY>

then the document is valid. So, even if I would analyze the
replacement text in some clever way in my validation layer, this would
not help, because there is currently no way, to get noticed about when
a parameter entity is used (and in which markup context).


Second the problem with standalone documents. If a document has a
Standalone Document Declaration, this does not necessarily mean, it
doesn't have any external entities. It means in the words of the
recommendation: 

  "In a standalone document declaration, the value "yes" indicates
   that there are no external markup declarations which affect the
   information passed from the XML processor to the application."

External markup declarations could affect the information passed from
the XML processor to the applications throu, for example, attribute
defaults (if the defaulted attribute is omitted in the document) or
entity declarations. Most of this can be handled in some way. But I
think its not possible, to detect a special problem with attribute
value normalization in a reliable way.

Please take a look at this example (it follows the example in the XML
recommendation, section 3.3.3)

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE e1 SYSTEM "e3.dtd" [
<!ENTITY d "&#xD;">
<!ENTITY a "&#xA;">
<!ENTITY da "&#xD;&#xA;">
]>
<e1 a="&d;&d;A&a;&a;B&da;"/>

with e3.dtd

<!ELEMENT e1 EMPTY>
<!ATTLIST e1
          a NMTOKENS #IMPLIED>

Since a validating parser must always read all external entities,
expat knows, that the attribute a is of type NMTOKENS. If expat knows
the type of an attribute, it does the additional attribute value
normalization, described in 3.3.3. Therefor, the element start handler
will see "A B" as Value of the attribute a. The problem is, that the
information about the attribute type in the external entity has
affected the information. If a would have the type CDATA, the
attribute value would have been " A B ". Therefor, the
standalone="yes" claim of this document is false, the document is not
valid. 

If the documents reads like this:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE e1 SYSTEM "e3.dtd" [
<!ENTITY d "&#xD;">
<!ENTITY a "&#xA;">
<!ENTITY da "&#xD;&#xA;">
]>
<e1 a="A B"/>

(with the same e3.dtd) the document would be valid.

And there is not way to know, that expat has done additional
normalization according to the attribute type. Therefor, the both
cases are indistinguishable from expat handler level.


Well, has somebody really followed this explanations? (Well, my
english is to clumsy, sorry.) On the other hand, this both problems
are the only 'unsolvable' ones (without changing the expat sources, of
course), that I'm aware of, which prevents one from writing full DTD
validation on top of expat.

rolf