[Expat-discuss] (no subject)

rolf@pointsman.de rolf@pointsman.de
Mon, 14 May 2001 03:33:50 +0200 (MEST)


Yes, late response... Ahem, 

On  3 Mar, Fred L. Drake, Jr. wrote:
> 
> Sam TH writes:
>  > Expat is a non-validating parser.  I do not know of any plans to
>  > change that status.  However, the latest version is 1.95.1, and that
>  > has a number of improvements.  You might want to check it out.  
> 
>   Actually, I think you can perform pretty much complete validation on
> top of the information Expat provides now, but haven't dug in with
> that in mind.  Perhaps "xmlval" would be a nice example program, if
> anyone has time to write it?

With some experiences in this direction under my belt I have to say
you are almost right. Yes, with 1.95.1, it's possible to come close
(from some practical viewpoint of things I would say very close) to
"complete validation" and it's indeed not very hard, but a lot of fun.

On the other side, I think it's simply unimpossible to reach really
"complete" validation without the help of the expat team (for
additional features / bug fixing) or local changes to the expat
sources.

The major issues, that I'm aware of at the moment, are

o The "Proper Declaration/PE Nesting" validity constraint
  (recommendation 2.8) and the "Proper Group/PE Nesting" validity
  constraint (rec. 3.2.1) aren't detected by the parser (yes, yes,
  thats OK for a wellformness parser, because they are, as there names
  say, validity constraints) and can't detected at the handler level,
  because parameter entity expansion is silently done by the parser,
  befor the data reach the handler level.

o There is a wired problem with the standalone="yes"
  declaration. 'standalone="yes"' doesn't mean, you don't have to read
  external entities (if your validating), but mean the information
  within the document entity isn't affected or changed by anything
  within the external entities. But if the 'standalone="yes"' marked
  document has an external dtd, that (only for example, there are
  serveral cases, all dealing with attribute normalisation) declares a
  fixed default for some attributes and that attribute isn't carried
  within the document entity with that fixed value, expat will
  normalize this attribute value, before it reaches handler
  level. That's fine, but there is no way to know, if some
  normalization by the parser was acutally needed - and this is, what
  you have to know, if you want to validate such kinds of documents.

o There are a few bugs in expat, that comes to attention, if you try
  to do validation. Don't get me wrong - expat is a great piece of
  software, and the code of James Clark and the expat team is most
  probably more reliable then the crude stuff, I'm
  writing. Nevertheless lately I found some time to check both expat
  1.95.1 and James Clarks expat 1.2 with the OASIS XML Conformance
  Subcommittee XML 1.0 Test Suite, Second Edition
  (http://www.oasis-open.org/committees/xml-conformance/suite-v1se/xmlconf-20010315.tar.gz).
  Please notice: Due to the lack of some more time, my testing isn't
  "hard" in any sense. Request more info, if you really wan't to know,
  what I have done (not to much) and what not.

  I found nearly no serious conformance problem while using expat as a
  simple non-validating parser. With "simple non-validating parser" I
  mean: the parser doesn't parse any external entity (no external DTD,
  no external parameter entities, no external general entities). The
  only exception is

        ibm/not-wf/P69/ibm69n05.xml
  
  Both expat 1.2 and expat 1.95.1 have the ability to parse (recusive)
  external DTD's and external entities (both parameter entities and
  general entities). Using this feature, both parsers failed the
  following tests (unix style path relative to the root of the test
  suite):
  
  Document not conform with the XML rec but silently passed:
        sun/not-wf/uri01.xml
        oasis/p16fail3.xml
        ibm/not-wf/misc/432gewf.xml
  
  Error complain, although document is conform with the XML rec:
        sun/valid/ext01.xml
        xmltest/valid/not-sa/003.xml
        xmltest/valid/not-sa/004.xml
        xmltest/valid/not-sa/031.xml

  In every case, it's not hard to see, what's going wrong (the four
  complained, but corret XML documents have obivously the same
  reason). Since I'm writing alread to much words (and since writing in
  english is hard for me, because I'm obviously not to fluid within it),
  just let me stop and request more info, if you really need.

  Additionally, there seems to be one or another problem with the
  autodetection of the encoding of external entities (but I'm not really
  sure about this/haven't checked deeper). 

  (There are always bugs, in everything. Maybe it comfort you to know,
  that - I'm tried to write of cource - even this OASIS test suite has
  a few bugs.)

Since I'm writing / misc

o As far, as I see, expat handles only UTF-8 chars up to max 3 byte
  long. That's fine, but if I know right, the new unicode 3.1 standard
  (and ISO 10646-2) have started to use characters beyond the Basic
  Multilingual Plane (BMP), that are beyond the first 2^16 character
  positions. Are there plans, to support documents including such
  characters with expat?

o Declarations of attributes with type ENUMERATION and NOTATION are
  some kind of "normalized" - in a very straight forward way, I have
  to confess -, before they reach handler level. As far as I know,
  there is now reason within the XML recommendation for doing
  this. Don't geht me wrong, I'm far from criticizing this - it make
  life really a little bit easier, if you try to write a validator on
  top of expat - but this isn't documented and my questing is: can I
  trust in this? 

o If you ever want to do some validaton on top of expat, you need
  Name/NMTOKEN production tests. expat has already infrastructure to
  do this (and a very effective one, as fas as I see). This is -
  again, as far as I see - not a question of simply make an interface,
  that's private, public, but a little more work. Nevertheless, this
  would we eventually a good thing to have. 

o Why, the hack, is expat1.2 xmlwf around 20 % faster than
  expat-1.95.1 xmlwf, even with serveral MBytes long XML documents
  (where the additional DTD parsing and reporting work done by expat
  1.91.1 (more work relative to 1.2) shoudn't make any measurable
  difference)?


rolf