[XML-SIG] ElementType.content_model interpretation of '*'

Lars Marius Garshol larsga@ifi.uio.no
04 May 1999 12:32:11 +0200


* Jeffrey Chang
|
| I am using xmlproc.dtdparser.DTDParser and xmlproc.xmldtd.CompleteDTD to
| parse and store the contents of a DTD file (xmlproc v0.60).  

I have been working on the same (in order to provide DTD caching), and
it would be interesting to see what you've done with this. I used
pickle and achieved acceptable (although not really impressive) speedups,
but ran into trouble with the internal subset handling. I'm still thinking
about what to do with this.

Some improvements to the current DTD handling have emerged, though,
and it is possible that there will be more.

| <!ELEMENT test (a,b*)>
| 
| [...]
| >>> d.elems['test'].content_model
| {                            # I've reformatted this for readability
| 'start': 1L, 
|      1L: [(6L, 'a')],
|      4L: [(4L, 'b')], 
|      6L: [(4L, 'b')], 
| 'final': 4L 
| }
| 
| According to this content model, 'test' must contain 1 'a' and at
| least 1 'b' before reaching the final state.

So it may seem, but it's not actually the case. This content model is
created by converting a non-deterministic automaton into a
deterministic one, so state 1 is state 1 in the original NDA, 4 is
state 4, while state 6 is the combination of state 2 and state 4 in
the original NDA.

If you look at the contents of the final_state method you'll see that
it does

  return self.content_model["final"] & state

which means that after seeing one a and no b's it will do

  return 4L & 6L

and so return 4, which evaluates as true.

| I [...] would have expected a content model more like:
| 'start': 1L, 
|      1L: [(4L, 'a')],
|      4L: [(4L, 'b')],
| 'final': 4L 

I agree that this would have been more optimal. I should probably have
a closer look at my automaton-generating code (which is sub-optimal in
some other respects as well) and see if I can improve it. I don't
expect this to happen any time soon, though, since the only things it
affects are memory consumption and DTD loading time (from the
not-yet-released cache).
 
I hope this resolved your question. I'm sorry for the late reply, but
I've been away for two weeks due to the PAJAVA and XML Europe conferences.

--Lars M.