[Expat-discuss] (no subject)
rolf@pointsman.de
rolf@pointsman.de
Mon, 14 May 2001 03:33:50 +0200 (MEST)
Yes, late response... Ahem,
On 3 Mar, Fred L. Drake, Jr. wrote:
>
> Sam TH writes:
> > Expat is a non-validating parser. I do not know of any plans to
> > change that status. However, the latest version is 1.95.1, and that
> > has a number of improvements. You might want to check it out.
>
> Actually, I think you can perform pretty much complete validation on
> top of the information Expat provides now, but haven't dug in with
> that in mind. Perhaps "xmlval" would be a nice example program, if
> anyone has time to write it?
With some experiences in this direction under my belt I have to say
you are almost right. Yes, with 1.95.1, it's possible to come close
(from some practical viewpoint of things I would say very close) to
"complete validation" and it's indeed not very hard, but a lot of fun.
On the other side, I think it's simply unimpossible to reach really
"complete" validation without the help of the expat team (for
additional features / bug fixing) or local changes to the expat
sources.
The major issues, that I'm aware of at the moment, are
o The "Proper Declaration/PE Nesting" validity constraint
(recommendation 2.8) and the "Proper Group/PE Nesting" validity
constraint (rec. 3.2.1) aren't detected by the parser (yes, yes,
thats OK for a wellformness parser, because they are, as there names
say, validity constraints) and can't detected at the handler level,
because parameter entity expansion is silently done by the parser,
befor the data reach the handler level.
o There is a wired problem with the standalone="yes"
declaration. 'standalone="yes"' doesn't mean, you don't have to read
external entities (if your validating), but mean the information
within the document entity isn't affected or changed by anything
within the external entities. But if the 'standalone="yes"' marked
document has an external dtd, that (only for example, there are
serveral cases, all dealing with attribute normalisation) declares a
fixed default for some attributes and that attribute isn't carried
within the document entity with that fixed value, expat will
normalize this attribute value, before it reaches handler
level. That's fine, but there is no way to know, if some
normalization by the parser was acutally needed - and this is, what
you have to know, if you want to validate such kinds of documents.
o There are a few bugs in expat, that comes to attention, if you try
to do validation. Don't get me wrong - expat is a great piece of
software, and the code of James Clark and the expat team is most
probably more reliable then the crude stuff, I'm
writing. Nevertheless lately I found some time to check both expat
1.95.1 and James Clarks expat 1.2 with the OASIS XML Conformance
Subcommittee XML 1.0 Test Suite, Second Edition
(http://www.oasis-open.org/committees/xml-conformance/suite-v1se/xmlconf-20010315.tar.gz).
Please notice: Due to the lack of some more time, my testing isn't
"hard" in any sense. Request more info, if you really wan't to know,
what I have done (not to much) and what not.
I found nearly no serious conformance problem while using expat as a
simple non-validating parser. With "simple non-validating parser" I
mean: the parser doesn't parse any external entity (no external DTD,
no external parameter entities, no external general entities). The
only exception is
ibm/not-wf/P69/ibm69n05.xml
Both expat 1.2 and expat 1.95.1 have the ability to parse (recusive)
external DTD's and external entities (both parameter entities and
general entities). Using this feature, both parsers failed the
following tests (unix style path relative to the root of the test
suite):
Document not conform with the XML rec but silently passed:
sun/not-wf/uri01.xml
oasis/p16fail3.xml
ibm/not-wf/misc/432gewf.xml
Error complain, although document is conform with the XML rec:
sun/valid/ext01.xml
xmltest/valid/not-sa/003.xml
xmltest/valid/not-sa/004.xml
xmltest/valid/not-sa/031.xml
In every case, it's not hard to see, what's going wrong (the four
complained, but corret XML documents have obivously the same
reason). Since I'm writing alread to much words (and since writing in
english is hard for me, because I'm obviously not to fluid within it),
just let me stop and request more info, if you really need.
Additionally, there seems to be one or another problem with the
autodetection of the encoding of external entities (but I'm not really
sure about this/haven't checked deeper).
(There are always bugs, in everything. Maybe it comfort you to know,
that - I'm tried to write of cource - even this OASIS test suite has
a few bugs.)
Since I'm writing / misc
o As far, as I see, expat handles only UTF-8 chars up to max 3 byte
long. That's fine, but if I know right, the new unicode 3.1 standard
(and ISO 10646-2) have started to use characters beyond the Basic
Multilingual Plane (BMP), that are beyond the first 2^16 character
positions. Are there plans, to support documents including such
characters with expat?
o Declarations of attributes with type ENUMERATION and NOTATION are
some kind of "normalized" - in a very straight forward way, I have
to confess -, before they reach handler level. As far as I know,
there is now reason within the XML recommendation for doing
this. Don't geht me wrong, I'm far from criticizing this - it make
life really a little bit easier, if you try to write a validator on
top of expat - but this isn't documented and my questing is: can I
trust in this?
o If you ever want to do some validaton on top of expat, you need
Name/NMTOKEN production tests. expat has already infrastructure to
do this (and a very effective one, as fas as I see). This is -
again, as far as I see - not a question of simply make an interface,
that's private, public, but a little more work. Nevertheless, this
would we eventually a good thing to have.
o Why, the hack, is expat1.2 xmlwf around 20 % faster than
expat-1.95.1 xmlwf, even with serveral MBytes long XML documents
(where the additional DTD parsing and reporting work done by expat
1.91.1 (more work relative to 1.2) shoudn't make any measurable
difference)?
rolf