[lxml-dev] better schematron support
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi, schematron support in lxml is currently a second-class citizen due to libxml2 restrictions, see e.g. http://mail.gnome.org/archives/xml/2007-August/msg00016.html where Stefan commented on error reporting deficiencies or http://mail.gnome.org/archives/xml/2009-September/msg00022.html where Daniel of libxml2 fame comments on a feature request to support schematron embedded in XML Schema, basically stating that the implementation is incomplete. However, there is a pure-XSLT implementation of the now-ISO-standardized schematron by its inventor (and editor of the standard) Rick Jelliffe, the so-called skeleton implementation: http://www.schematron.com/ (Daniel also mentions this in his comment) Basically, the "skeleton" toolchain creates an xslt that is used for validation. Indeed, schematron itself is just a well-defined way of using xslt for validation, which I was looking into but really wasn't aware that schematron does this exactly. The "skeleton" implementation is available in both xslt 1 and 2 notions. The toolchain steps are (taken from www.schematron.com, modifications by me) 0) [Extract from XML Schema/RelaxNG schema)] 1) Process inclusions 2) Process abstract patterns 3) Compile the schema 4) Validate which translates to xsltproc XSD2Schtrn.xsl XMLSchema.xsd > theSchema.sch or xsltproc RNG2Schtrn.xsl RelaxNGSchema.rng > theSchema.sch xsltproc iso_dsdl_include.xsl theSchema.sch > theSchema1.sch xsltproc iso_abstract_expand.xsl theSchema1.sch > theSchema2.sch xsltproc iso_svrl_for_xsltn.xsl theSchema2.sch > theSchema.xsl xsltproc theSchema.xsl myDocument.xml > myResult.xml Enter libxslt aka lxml's xslt capabilities: It looks pretty easy to integrate this xslt-based toolchain into lxml, effectively enabling full ISO schematron support. I suggest complementing the current lxml schematron support using this approach: - add the necessary stylesheets (extraction and skeleton implementation) to the lxml codebase - add a convenient API to support xslt-based schematron validation to lxml that - hides toolchain steps, at least in default mode - fits in with the current validators' API - provides support for the parameters used for the separate toolchain steps And finally: Maybe somebody has already done this with lxml + schematron. Care to step forward? Any Comments? Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Holger, thanks for bringing this up. I'm all in favour of doing this. jholg@gmx.de, 25.11.2009 11:38:
I suggest complementing the current lxml schematron support using this approach: - add the necessary stylesheets (extraction and skeleton implementation) to the lxml codebase
It looks like the license is ok for inclusion. The basic restrictions are that the author wants to keep his name in the sources and that modifications must be recognisable as such. So shipping the files verbatimly should not do any harm to lxml's users.
This sounds like it can easily be done in plain Python code, so I'd appreciate having a separate "lxml.schematron" module that implements this. It should mimic the existing validator APIs as much as possible. Additional parameters in the validator constructor are fine. I imagine that the result document could be represented by a special class that helps in interpreting errors, although I haven't actually looked into this any deeper. Holger, could you open a bug report for this? Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Done: https://bugs.launchpad.net/lxml/+bug/488222 Given that I'm currently establishing schematron validations embedded in an XML Schema we're using here I will need this tool chain anyhow, sooner or later. So unless you need a nice little after work project yourself (or have already started hacking away ;) I'd volunteer to come up with an implementation for this (on a branch). Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
Oh, no further need for glory and fame on my side - just go ahead. :)
Finally got around working on this, just checked in for review: Committed revision 70160. URL: http://codespeak.net/svn/lxml/branch/iso-schematron This comes complete with doc updates, unittests and whatnot. Some notes: * implemented as package lxml.schematron to cleanly bundle all the xsl/rng resources * the API allows for handing stylesheet parameters to the separate schematron-to-xsl compilation steps. Currently, these must be provided with stylesheet parameter properties in mind, i.e. text parameters must be given like {'phase': "'mandatory'"}. As all the stylesheet parameters seem to be text parameters, maybe we could make this a little more convenient and auto-strparam() everything in the arg dicts. But I think it's not really worth the effort and better to stick to normal stylesheet parameter handling * I had to modify xmlerror.pxi to stay as close to the original validators' workings as possible: $ svn diff --old=http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi@HEAD --new=src/lxml/xmlerror.pxi Index: src/lxml/xmlerror.pxi =================================================================== --- src/lxml/xmlerror.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi) (revision 70161) +++ src/lxml/xmlerror.pxi (.../src/lxml/xmlerror.pxi) (working copy) @@ -71,6 +71,8 @@ else: self.filename = _decodeFilename(error.file) + #FIXME: This seems not to have been used anywhere, so far. Is my addition + #FIXME: of _utf8()-ing message & filename correct? cdef _setGeneric(self, int domain, int type, int level, int line, message, filename): self.domain = domain @@ -78,8 +80,8 @@ self.level = level self.line = line self.column = 0 - self.message = message - self.filename = filename + self.message = _utf8(message) + self.filename = _utf8(filename) def __repr__(self): return u"%s:%d:%d:%s:%s:%s: %s" % ( @@ -102,6 +104,12 @@ def __get__(self): return ErrorLevels._getName(self.level, u"unknown") +#FIXME: Can _LogEntry be settable itself so we don't need this? +cdef class _SettableLogEntry(_LogEntry): + cpdef setGeneric(self, int domain, int type, int level, int line, + message, filename): + self._setGeneric(domain, type, level, line, message, filename) + cdef class _BaseErrorLog: cdef _LogEntry _first_error cdef readonly object last_error @@ -172,7 +180,7 @@ message = u"%s, line %d" % (message, line) return exctype(message, code, line, column) - cdef _buildExceptionMessage(self, default_message): + cpdef _buildExceptionMessage(self, default_message): if self._first_error is None: return default_message if self._first_error.message is not None and self._first_error.message: I'm not too sure about these changes so here are some questions: * Ok to cpdef _buildExceptionMessage() instead of cdef? * Instead of adding _SettableLogEntry, would it also be ok to just cpdef _LogEntry.setGeneric? Also, I added a Validator class to isoschematron that mimics etree._Validator. This wouldn't be necessary if etree._Validator made _error_log accessible from python. Can we just do that? Have fun Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Holger, jholg@gmx.de, 17.12.2009 01:03:
Cool, thanks!
* implemented as package lxml.schematron to cleanly bundle all the xsl/rng resources
That's ok. I was about to propose calling it "lxml.isoschematron" when I noticed that that was what you meant anway. :)
I prefer simplifying the interface here and make them all string parameters. I just skimmed through your code, it looks like you want users to pass dicts instead of regular keyword arguments. Why is that?
That won't work. Neither the message nor the filename need to be compatible with the UTF-8 encoding.
I wonder why this is necessary anyway. Can't we just reuse the error log of the underlying XSLT object? I don't expect that we need to generate any log messages ourselves, do we? Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi Stefan,
Because I want/need to separate the attributes for the steps include, abstract, compile. If we don't use dicts, how to distinguish these? I don't like using naming conventions here. What if any future version of the skeleton implementation resolves to using some number xslt args? We'd at least need to only auto-strparam() string arguments. But what if a parameter taking an xpath expression should show up someday? We can't really discriminate this from a normal string parameter; it's also not possible to hand in an etree.XPath object (just tried it). The only option I see here is to provide another switch to turn off auto-strparam()ing, if need be, defaulting to False.
I wondered about that. Maybe a misunderstanding of mine of what _utf8 is supposed to do. This just ensures that a string is 7-bit-ascii or unicode and returns it utf8-encoded, right? Now, with _setGeneric becoming public in one or the other way, don't you need to store message and filename in a well-known encoding? Or is it the caller's responsibility to know the encoding? What happens to unicode parameters?
The XSLT object returns a perfectly valid XSLT result tree. What makes it a schematron validation error is what's then selected from the result using an XPath expression (which is exposed in the package if someone has different selection needs/chooses to write his own "meta stylsheet" instead of iso_svrl_for_xslt1.xsl which might produce different output): # svrl result accessors svrl_validation_errors = _etree.XPath( '//svrl:failed-assert', namespaces={'sch': SCHEMATRON_NS, 'svrl': SVRL_NS}) What we retrieve as a result from this xpath are the validation errors, which then need to be put into the Schematron._error_log; this is why I need to manually access _error_log in the subclass, and also why it needs to be setGeneric()-able. Btw: The validation result doc is accessible as property validation_report if store_report is true (default). I didn't wrap this in a special class as - I can't really imagine what one might want to actually extract from the svrl report - it's just so easy to get to what you want using XPath on this result tree One thing I forgot to mention: I haven't tested with Python 3, as I currently don't have an installation. Holger -- Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
jholg@gmx.de, 17.12.2009 10:57:
Why not make the most common (and non-overlapping) parameters keyword arguments and use the dicts only as fallbacks? We could rename them to "additional_..._parameters" or something.
That's a good idea, though. Passing an XPath object in would simply let the parameter mangling code extract the underlying unparsed XPath expression. It's some unnecessary work if you don't actually want to use the pre-parsed expression, but it's definitely explicit.
The only option I see here is to provide another switch to turn off auto-strparam()ing, if need be, defaulting to False.
Ugly.
Right, it's a user input validation and normalisation function. Sort of the opposite of funicode().
Now, with _setGeneric becoming public in one or the other way
You say that like it was decided. It's a totally internal thing that shouldn't get exposed to Python space.
Why don't you just create a fake error log? There's nothing that requires that the error_log property is of type _ErrorLog or that it receives its error messages in the normal C level way. If a fake-log isn't easy to do, I'm fine with making that simpler, but I'm against making C level APIs public just for a case like this.
I'm actually for extracting the error log lazily when the error_log property is first read, so storing the unmodified result document sounds like a good idea to me.
One thing I forgot to mention: I haven't tested with Python 3, as I currently don't have an installation.
That'll come. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
These are the parameters of the involved xsl stylesheets: iso_dsdl_include.xsl: <xsl:param name="include-schematron">true</xsl:param> <xsl:param name="include-crdl">true</xsl:param> <xsl:param name="include-xinclude">true</xsl:param> <xsl:param name="include-dtll">true</xsl:param> <xsl:param name="include-relaxng">true</xsl:param> <xsl:param name="include-xlink">true</xsl:param> iso_abstract_expand.xsl: <xslt:param name="schema-id"></xslt:param> iso_svrl_for_xslt1.xsl: <xsl:param name="diagnose" >true</xsl:param> <xsl:param name="phase" > [...] </xsl:param> <xsl:param name="allow-foreign" >false</xsl:param> <xsl:param name="generate-paths" >true</xsl:param> <xsl:param name="generate-fired-rule" >true</xsl:param> <xsl:param name="optimize"/> <xsl:param name="output-encoding" ></xsl:param> One thing is that there are so many, another thing is to decide which ones are the "most common". As it stands, isoschematron already has quite a lot of parameters.
Do you mean for etree.XSLT to allow for XPath object arguments (that's what I meant), or for isoschematron to extract the path using XPath.path? Supposing you meant the latter, the rules would then be: - if an arg is string, auto-strparam() it - if an arg is an XPath object, just use its .path property - else use unicode(<arg>) (in a py3 compatible way) This would add a little more convenience to the parameter passing; I'm still not convinced that this shouldn't/couldn't be rather just addressed in the documentation. The advantage of not implementing the magic is that I can use the very same arg dictionary with isoschematron.Schematron() as with any other XSLT transform.
:) I know it's not - that's why I'm asking these questions. But it seems easier to me to reuse the existing stuff than replicating the very same functionality. Why not make this stuff a little friendlier for subclassing? Also, for the _setGeneric case I actually added the class _SettableLogEntry(_LogEntry) to make this minimally intrusive for the existing infrastructure; I just seemed to get it wrong regarding the _utf8() stuff.
Of course we can do that, but then we need to basically reimplement _BaseErrorLog, _ListErrorLog and _ErrorLog, maybe combined into one class, minus the C level entry points. Is there much gain in this? Or, in other words: What's lost by exposing _buildExceptionMessage() to the python side?
I need to retrieve the errors in the __call__ method anyway to see if the validation result is true, so why not store it right away? Holger -- Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
jholg@gmx.de, 17.12.2009 12:21:
All 'true' sounds like a good default, that seems to make all of the above "uncommon" enough to drop them into a dict parameter.
iso_abstract_expand.xsl: <xslt:param name="schema-id"></xslt:param>
Is there any visible need to override that? If not, I'd just drop it completely.
I just want to keep users from having to pass a dict in /most/ cases. The ones above do not seem to be that unuseful.
I'm fine with the first. It's just the same as supporting QName for tags, CDATA for text and strparam() for XSLT parameters. Passing an XPath object is really hard to misinterpret (at least a lot harder than a plain string value).
Well, there isn't that much functionality in there, really. And at the point where you need to do the conversion, you'd probably already know in what kind of errors the user is interested (see the filter_*() methods), so having a custom class here isn't all that wrong. I agree that it's worth trying to make the existing classes a little friendlier to subclasses, though, that might help already.
Because I expect that many users won't be interested in the exact errors and can live with a boolean predicate result. Extracting the information if errors were found at all is a simple and fast XPath search with a boolean result, right? Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
This is for: """ It also * extracts a particular schema using an ID, where there are multiple schemas, such as when they are embedded in the same NVDL script """ No need to expose as extra kwarg for this.
These seem mostly to be for trimming the svrl output somewhat, for providing additional information to the bare-bones validation failure messages. I'm pretty much a Schematron beginner so I'm not too sure which would be worth exposing. One exception, though: 'phase' is s.th. I plan to make excessive use of in my use case, as this allows grouping validation patterns and gives you a mechanism to selectively validate. So this would be my only candidate for an extra keyword arg.
Ok, I'll look at the XSLT implementation. I take it you don't see any value in keeping the xsl parameter handling compatible to what you normally have to hand to etree.XSLT as stylesheet parameters?
For the error log stuff the only thing I needed to access from python was cpdef _buildExceptionMessage(self, default_message): in the isoschematron._Validator class. This need would go away if I could subclass etree._Validator and access _Validator._error_log from Python, so that I can call self._error_log.receive(logEntry)
Currently, the XPath searches all failed-assert elements, which are the actual error messages put into the error log: svrl_validation_errors = _etree.XPath( '//svrl:failed-assert', namespaces={'sch': SCHEMATRON_NS, 'svrl': SVRL_NS}) This could be easily enough changed to return a boolean, with above xpath being used only for accessing the error_log property. Of course, if subclassing etree._Validator, lazy extraction would then mean to override error_log access. As a side effect, lazy error_log extraction would mean to always need to store the result report (this makes store_report arg obsolete). Then again, all the other validators return a simple boolean true, store any validation error message in ._error_log during __call__() and return a copy of this on .error_log() access. Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
jholg@gmx.de, 17.12.2009 15:05:
Fine with me. Note that we can usually add additional keyword arguments at the end if we notice that people use them a lot. If provided, their value would then override the keywords passed as dict.
I'm not sure what exactly you mean here. I'm fine with *extending* the current functionality with something that is useful but doesn't currently work. I'm also all for making the schematron interface more specific (and more usable) than the generic XSLT interface. That's the whole point of integrating the stylesheets, after all.
Actually, looking through the code, I think "_receiveGeneric()" was originally used but then replaced by a locally constructed xmlError and a call to _receive(), so it's actually a dead method by now. We could provide _Validator with an _append_log_message() method that basically calls it. Would that solve the issue?
If False, the report would simply be evaluated and deleted immediately after the run (should be easy to do by accessing the error_log once if it's evaluated lazily :) I think it should be False by default, BTW. If users want the report, it's easy to enable it.
The difference is that the errors are collected in the log during the run. Here, they are extracted from the result *after* running the validation. Ok, let's make that a potential optimisation, not a requirement. I'm fine with having the error log extracted immediately after validation and throwing away the result document if the users asked to do so by passing the option. I would guess that the XSLT based validation is already heavy enough anyway. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Ok.
What I mean is adding magic to the handling of stylesheet parameters will not let one reuse the very same parameters (dict or keyword) when performing steps manually. This can of course be down by just using the existing stylesheets, and the isoschematron package also exposes the steps as globals: # the iso-schematron skeleton implementation steps aka xsl transformations extract_from_xsd = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'XSD2Schtrn.xsl'))) extract_from_rng = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'RNG2Schtrn.xsl'))) iso_dsdl_include = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_dsdl_include.xsl'))) iso_abstract_expand = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_abstract_expand.xsl'))) iso_svrl_for_xslt1 = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_svrl_for_xslt1.xsl'))) # if you want to use another "meta-stylesheet" for compilation to xslt, plug it # here iso_compile2xslt = iso_svrl_for_xslt1
Not sure I follow. __call__ runs the xslt on the input data and produces the svrl report aka xsl result document. Now, if I want lazy error log extraction I need to store this result report.
I think it should be False by default, BTW. If users want the report, it's easy to enable it.
Fine with me. But then, I need to put the errors into the error log before throwing away the report: No lazy error log extraction.
The difference is that the errors are collected in the log during the run. Here, they are extracted from the result *after* running the validation.
I see. So you'd say the overhead of putting all the errors into the error log one by one in __call__ is expensive to a degree and we should avoid that.
For lazy error log extraction we need to store the validation report. So maybe we could just compromise: If the user opts for storing the result report error_log will use lazy extraction, if not error log needs to be set inside __call__. Classical trade of memory vs speed :) Holger -- Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
After looking at the code I feel this solution's implementation would be quite a bit clumsier compared to now, e.g. because you need to save the file uri/name of the validated tree in the lazy-extraction case for latter reuse when error_log is first accessed. So I don't think it's worth the effort now. Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
I took a look: A method to clear the error log from the subclass would also be needed. Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi Stefan,
You really sure you want that for XSLT.__call__? Because that means looping through the arg dict on every invocation of the stylesheet, doesn't it? What about just providing a helper function that takes keyword args and does the stylesheet parameter mangling? This could then be used in isoschematron.Schematron() and wherever else s.o. needs it. Holger -- Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi
Well, how do you think lxml passes the parameters to libxslt? Look at the _run_transform() method.
Yeah, not my brightest hour there... Anyway: Committed revision 70244. I've now updated the isoschematron implementation to - inherit from etree._Validator - use simple arguments and automagically convert them to stylesheet parameters - added the 'phase' keyword arg Also, etree.XSLT now accepts XPath objects as stylesheet parameters - test & doc updated to reflect this. These are the changes I made on the core lxml files since branching: $ svn diff -N --old=http://codespeak.net/svn/lxml/trunk/src/lxml/@69913 --new=src/lxml Index: src/lxml/xslt.pxi =================================================================== --- src/lxml/xslt.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml) (revision 69913) +++ src/lxml/xslt.pxi (.../src/lxml) (working copy) @@ -609,7 +609,10 @@ xslt.xsltQuoteOneUserParam( transform_ctxt, _cstr(k), _cstr(v)) else: - v = _utf8(value) + if isinstance(value, XPath): + v = _utf8((<XPath>value).path) + else: + v = _utf8(value) params[i] = _cstr(k) i += 1 params[i] = _cstr(v) Index: src/lxml/xmlerror.pxi =================================================================== --- src/lxml/xmlerror.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml)(revision 69913) +++ src/lxml/xmlerror.pxi (.../src/lxml) (working copy) @@ -102,6 +102,12 @@ def __get__(self): return ErrorLevels._getName(self.level, u"unknown") +#FIXME: Can _LogEntry be settable itself so we don't need this? +cdef class _SettableLogEntry(_LogEntry): + cpdef setGeneric(self, int domain, int type, int level, int line, + message, filename): + self._setGeneric(domain, type, level, line, message, filename) + cdef class _BaseErrorLog: cdef _LogEntry _first_error cdef readonly object last_error Index: src/lxml/lxml.etree.pyx =================================================================== --- src/lxml/lxml.etree.pyx (.../http://codespeak.net/svn/lxml/trunk/src/lxml)(revision 69913) +++ src/lxml/lxml.etree.pyx (.../src/lxml) (working copy) @@ -2783,6 +2783,14 @@ raise AssertionError, self._error_log._buildExceptionMessage( u"Document does not comply with schema") + cpdef _append_log_message(self, int domain, int type, int level, int line, + message, filename): + self._error_log._receiveGeneric(domain, type, level, line, message, + filename) + + cpdef _clear_error_log(self): + self._error_log.clear() + property error_log: u"The log of validation errors and warnings." def __get__(self): Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Holger, thanks for bringing this up. I'm all in favour of doing this. jholg@gmx.de, 25.11.2009 11:38:
I suggest complementing the current lxml schematron support using this approach: - add the necessary stylesheets (extraction and skeleton implementation) to the lxml codebase
It looks like the license is ok for inclusion. The basic restrictions are that the author wants to keep his name in the sources and that modifications must be recognisable as such. So shipping the files verbatimly should not do any harm to lxml's users.
This sounds like it can easily be done in plain Python code, so I'd appreciate having a separate "lxml.schematron" module that implements this. It should mimic the existing validator APIs as much as possible. Additional parameters in the validator constructor are fine. I imagine that the result document could be represented by a special class that helps in interpreting errors, although I haven't actually looked into this any deeper. Holger, could you open a bug report for this? Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Done: https://bugs.launchpad.net/lxml/+bug/488222 Given that I'm currently establishing schematron validations embedded in an XML Schema we're using here I will need this tool chain anyhow, sooner or later. So unless you need a nice little after work project yourself (or have already started hacking away ;) I'd volunteer to come up with an implementation for this (on a branch). Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
Oh, no further need for glory and fame on my side - just go ahead. :)
Finally got around working on this, just checked in for review: Committed revision 70160. URL: http://codespeak.net/svn/lxml/branch/iso-schematron This comes complete with doc updates, unittests and whatnot. Some notes: * implemented as package lxml.schematron to cleanly bundle all the xsl/rng resources * the API allows for handing stylesheet parameters to the separate schematron-to-xsl compilation steps. Currently, these must be provided with stylesheet parameter properties in mind, i.e. text parameters must be given like {'phase': "'mandatory'"}. As all the stylesheet parameters seem to be text parameters, maybe we could make this a little more convenient and auto-strparam() everything in the arg dicts. But I think it's not really worth the effort and better to stick to normal stylesheet parameter handling * I had to modify xmlerror.pxi to stay as close to the original validators' workings as possible: $ svn diff --old=http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi@HEAD --new=src/lxml/xmlerror.pxi Index: src/lxml/xmlerror.pxi =================================================================== --- src/lxml/xmlerror.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi) (revision 70161) +++ src/lxml/xmlerror.pxi (.../src/lxml/xmlerror.pxi) (working copy) @@ -71,6 +71,8 @@ else: self.filename = _decodeFilename(error.file) + #FIXME: This seems not to have been used anywhere, so far. Is my addition + #FIXME: of _utf8()-ing message & filename correct? cdef _setGeneric(self, int domain, int type, int level, int line, message, filename): self.domain = domain @@ -78,8 +80,8 @@ self.level = level self.line = line self.column = 0 - self.message = message - self.filename = filename + self.message = _utf8(message) + self.filename = _utf8(filename) def __repr__(self): return u"%s:%d:%d:%s:%s:%s: %s" % ( @@ -102,6 +104,12 @@ def __get__(self): return ErrorLevels._getName(self.level, u"unknown") +#FIXME: Can _LogEntry be settable itself so we don't need this? +cdef class _SettableLogEntry(_LogEntry): + cpdef setGeneric(self, int domain, int type, int level, int line, + message, filename): + self._setGeneric(domain, type, level, line, message, filename) + cdef class _BaseErrorLog: cdef _LogEntry _first_error cdef readonly object last_error @@ -172,7 +180,7 @@ message = u"%s, line %d" % (message, line) return exctype(message, code, line, column) - cdef _buildExceptionMessage(self, default_message): + cpdef _buildExceptionMessage(self, default_message): if self._first_error is None: return default_message if self._first_error.message is not None and self._first_error.message: I'm not too sure about these changes so here are some questions: * Ok to cpdef _buildExceptionMessage() instead of cdef? * Instead of adding _SettableLogEntry, would it also be ok to just cpdef _LogEntry.setGeneric? Also, I added a Validator class to isoschematron that mimics etree._Validator. This wouldn't be necessary if etree._Validator made _error_log accessible from python. Can we just do that? Have fun Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Holger, jholg@gmx.de, 17.12.2009 01:03:
Cool, thanks!
* implemented as package lxml.schematron to cleanly bundle all the xsl/rng resources
That's ok. I was about to propose calling it "lxml.isoschematron" when I noticed that that was what you meant anway. :)
I prefer simplifying the interface here and make them all string parameters. I just skimmed through your code, it looks like you want users to pass dicts instead of regular keyword arguments. Why is that?
That won't work. Neither the message nor the filename need to be compatible with the UTF-8 encoding.
I wonder why this is necessary anyway. Can't we just reuse the error log of the underlying XSLT object? I don't expect that we need to generate any log messages ourselves, do we? Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi Stefan,
Because I want/need to separate the attributes for the steps include, abstract, compile. If we don't use dicts, how to distinguish these? I don't like using naming conventions here. What if any future version of the skeleton implementation resolves to using some number xslt args? We'd at least need to only auto-strparam() string arguments. But what if a parameter taking an xpath expression should show up someday? We can't really discriminate this from a normal string parameter; it's also not possible to hand in an etree.XPath object (just tried it). The only option I see here is to provide another switch to turn off auto-strparam()ing, if need be, defaulting to False.
I wondered about that. Maybe a misunderstanding of mine of what _utf8 is supposed to do. This just ensures that a string is 7-bit-ascii or unicode and returns it utf8-encoded, right? Now, with _setGeneric becoming public in one or the other way, don't you need to store message and filename in a well-known encoding? Or is it the caller's responsibility to know the encoding? What happens to unicode parameters?
The XSLT object returns a perfectly valid XSLT result tree. What makes it a schematron validation error is what's then selected from the result using an XPath expression (which is exposed in the package if someone has different selection needs/chooses to write his own "meta stylsheet" instead of iso_svrl_for_xslt1.xsl which might produce different output): # svrl result accessors svrl_validation_errors = _etree.XPath( '//svrl:failed-assert', namespaces={'sch': SCHEMATRON_NS, 'svrl': SVRL_NS}) What we retrieve as a result from this xpath are the validation errors, which then need to be put into the Schematron._error_log; this is why I need to manually access _error_log in the subclass, and also why it needs to be setGeneric()-able. Btw: The validation result doc is accessible as property validation_report if store_report is true (default). I didn't wrap this in a special class as - I can't really imagine what one might want to actually extract from the svrl report - it's just so easy to get to what you want using XPath on this result tree One thing I forgot to mention: I haven't tested with Python 3, as I currently don't have an installation. Holger -- Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
jholg@gmx.de, 17.12.2009 10:57:
Why not make the most common (and non-overlapping) parameters keyword arguments and use the dicts only as fallbacks? We could rename them to "additional_..._parameters" or something.
That's a good idea, though. Passing an XPath object in would simply let the parameter mangling code extract the underlying unparsed XPath expression. It's some unnecessary work if you don't actually want to use the pre-parsed expression, but it's definitely explicit.
The only option I see here is to provide another switch to turn off auto-strparam()ing, if need be, defaulting to False.
Ugly.
Right, it's a user input validation and normalisation function. Sort of the opposite of funicode().
Now, with _setGeneric becoming public in one or the other way
You say that like it was decided. It's a totally internal thing that shouldn't get exposed to Python space.
Why don't you just create a fake error log? There's nothing that requires that the error_log property is of type _ErrorLog or that it receives its error messages in the normal C level way. If a fake-log isn't easy to do, I'm fine with making that simpler, but I'm against making C level APIs public just for a case like this.
I'm actually for extracting the error log lazily when the error_log property is first read, so storing the unmodified result document sounds like a good idea to me.
One thing I forgot to mention: I haven't tested with Python 3, as I currently don't have an installation.
That'll come. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
These are the parameters of the involved xsl stylesheets: iso_dsdl_include.xsl: <xsl:param name="include-schematron">true</xsl:param> <xsl:param name="include-crdl">true</xsl:param> <xsl:param name="include-xinclude">true</xsl:param> <xsl:param name="include-dtll">true</xsl:param> <xsl:param name="include-relaxng">true</xsl:param> <xsl:param name="include-xlink">true</xsl:param> iso_abstract_expand.xsl: <xslt:param name="schema-id"></xslt:param> iso_svrl_for_xslt1.xsl: <xsl:param name="diagnose" >true</xsl:param> <xsl:param name="phase" > [...] </xsl:param> <xsl:param name="allow-foreign" >false</xsl:param> <xsl:param name="generate-paths" >true</xsl:param> <xsl:param name="generate-fired-rule" >true</xsl:param> <xsl:param name="optimize"/> <xsl:param name="output-encoding" ></xsl:param> One thing is that there are so many, another thing is to decide which ones are the "most common". As it stands, isoschematron already has quite a lot of parameters.
Do you mean for etree.XSLT to allow for XPath object arguments (that's what I meant), or for isoschematron to extract the path using XPath.path? Supposing you meant the latter, the rules would then be: - if an arg is string, auto-strparam() it - if an arg is an XPath object, just use its .path property - else use unicode(<arg>) (in a py3 compatible way) This would add a little more convenience to the parameter passing; I'm still not convinced that this shouldn't/couldn't be rather just addressed in the documentation. The advantage of not implementing the magic is that I can use the very same arg dictionary with isoschematron.Schematron() as with any other XSLT transform.
:) I know it's not - that's why I'm asking these questions. But it seems easier to me to reuse the existing stuff than replicating the very same functionality. Why not make this stuff a little friendlier for subclassing? Also, for the _setGeneric case I actually added the class _SettableLogEntry(_LogEntry) to make this minimally intrusive for the existing infrastructure; I just seemed to get it wrong regarding the _utf8() stuff.
Of course we can do that, but then we need to basically reimplement _BaseErrorLog, _ListErrorLog and _ErrorLog, maybe combined into one class, minus the C level entry points. Is there much gain in this? Or, in other words: What's lost by exposing _buildExceptionMessage() to the python side?
I need to retrieve the errors in the __call__ method anyway to see if the validation result is true, so why not store it right away? Holger -- Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
jholg@gmx.de, 17.12.2009 12:21:
All 'true' sounds like a good default, that seems to make all of the above "uncommon" enough to drop them into a dict parameter.
iso_abstract_expand.xsl: <xslt:param name="schema-id"></xslt:param>
Is there any visible need to override that? If not, I'd just drop it completely.
I just want to keep users from having to pass a dict in /most/ cases. The ones above do not seem to be that unuseful.
I'm fine with the first. It's just the same as supporting QName for tags, CDATA for text and strparam() for XSLT parameters. Passing an XPath object is really hard to misinterpret (at least a lot harder than a plain string value).
Well, there isn't that much functionality in there, really. And at the point where you need to do the conversion, you'd probably already know in what kind of errors the user is interested (see the filter_*() methods), so having a custom class here isn't all that wrong. I agree that it's worth trying to make the existing classes a little friendlier to subclasses, though, that might help already.
Because I expect that many users won't be interested in the exact errors and can live with a boolean predicate result. Extracting the information if errors were found at all is a simple and fast XPath search with a boolean result, right? Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
This is for: """ It also * extracts a particular schema using an ID, where there are multiple schemas, such as when they are embedded in the same NVDL script """ No need to expose as extra kwarg for this.
These seem mostly to be for trimming the svrl output somewhat, for providing additional information to the bare-bones validation failure messages. I'm pretty much a Schematron beginner so I'm not too sure which would be worth exposing. One exception, though: 'phase' is s.th. I plan to make excessive use of in my use case, as this allows grouping validation patterns and gives you a mechanism to selectively validate. So this would be my only candidate for an extra keyword arg.
Ok, I'll look at the XSLT implementation. I take it you don't see any value in keeping the xsl parameter handling compatible to what you normally have to hand to etree.XSLT as stylesheet parameters?
For the error log stuff the only thing I needed to access from python was cpdef _buildExceptionMessage(self, default_message): in the isoschematron._Validator class. This need would go away if I could subclass etree._Validator and access _Validator._error_log from Python, so that I can call self._error_log.receive(logEntry)
Currently, the XPath searches all failed-assert elements, which are the actual error messages put into the error log: svrl_validation_errors = _etree.XPath( '//svrl:failed-assert', namespaces={'sch': SCHEMATRON_NS, 'svrl': SVRL_NS}) This could be easily enough changed to return a boolean, with above xpath being used only for accessing the error_log property. Of course, if subclassing etree._Validator, lazy extraction would then mean to override error_log access. As a side effect, lazy error_log extraction would mean to always need to store the result report (this makes store_report arg obsolete). Then again, all the other validators return a simple boolean true, store any validation error message in ._error_log during __call__() and return a copy of this on .error_log() access. Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
jholg@gmx.de, 17.12.2009 15:05:
Fine with me. Note that we can usually add additional keyword arguments at the end if we notice that people use them a lot. If provided, their value would then override the keywords passed as dict.
I'm not sure what exactly you mean here. I'm fine with *extending* the current functionality with something that is useful but doesn't currently work. I'm also all for making the schematron interface more specific (and more usable) than the generic XSLT interface. That's the whole point of integrating the stylesheets, after all.
Actually, looking through the code, I think "_receiveGeneric()" was originally used but then replaced by a locally constructed xmlError and a call to _receive(), so it's actually a dead method by now. We could provide _Validator with an _append_log_message() method that basically calls it. Would that solve the issue?
If False, the report would simply be evaluated and deleted immediately after the run (should be easy to do by accessing the error_log once if it's evaluated lazily :) I think it should be False by default, BTW. If users want the report, it's easy to enable it.
The difference is that the errors are collected in the log during the run. Here, they are extracted from the result *after* running the validation. Ok, let's make that a potential optimisation, not a requirement. I'm fine with having the error log extracted immediately after validation and throwing away the result document if the users asked to do so by passing the option. I would guess that the XSLT based validation is already heavy enough anyway. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Ok.
What I mean is adding magic to the handling of stylesheet parameters will not let one reuse the very same parameters (dict or keyword) when performing steps manually. This can of course be down by just using the existing stylesheets, and the isoschematron package also exposes the steps as globals: # the iso-schematron skeleton implementation steps aka xsl transformations extract_from_xsd = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'XSD2Schtrn.xsl'))) extract_from_rng = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'RNG2Schtrn.xsl'))) iso_dsdl_include = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_dsdl_include.xsl'))) iso_abstract_expand = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_abstract_expand.xsl'))) iso_svrl_for_xslt1 = _etree.XSLT(_etree.parse( os.path.join(_resources_dir, 'xsl', 'iso-schematron-xslt1', 'iso_svrl_for_xslt1.xsl'))) # if you want to use another "meta-stylesheet" for compilation to xslt, plug it # here iso_compile2xslt = iso_svrl_for_xslt1
Not sure I follow. __call__ runs the xslt on the input data and produces the svrl report aka xsl result document. Now, if I want lazy error log extraction I need to store this result report.
I think it should be False by default, BTW. If users want the report, it's easy to enable it.
Fine with me. But then, I need to put the errors into the error log before throwing away the report: No lazy error log extraction.
The difference is that the errors are collected in the log during the run. Here, they are extracted from the result *after* running the validation.
I see. So you'd say the overhead of putting all the errors into the error log one by one in __call__ is expensive to a degree and we should avoid that.
For lazy error log extraction we need to store the validation report. So maybe we could just compromise: If the user opts for storing the result report error_log will use lazy extraction, if not error log needs to be set inside __call__. Classical trade of memory vs speed :) Holger -- Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
After looking at the code I feel this solution's implementation would be quite a bit clumsier compared to now, e.g. because you need to save the file uri/name of the validated tree in the lazy-extraction case for latter reuse when error_log is first accessed. So I don't think it's worth the effort now. Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/atbrowser
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi,
I took a look: A method to clear the error log from the subclass would also be needed. Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi Stefan,
You really sure you want that for XSLT.__call__? Because that means looping through the arg dict on every invocation of the stylesheet, doesn't it? What about just providing a helper function that takes keyword args and does the stylesheet parameter mangling? This could then be used in isoschematron.Schematron() and wherever else s.o. needs it. Holger -- Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.! http://portal.gmx.net/de/go/dsl02
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
Hi
Well, how do you think lxml passes the parameters to libxslt? Look at the _run_transform() method.
Yeah, not my brightest hour there... Anyway: Committed revision 70244. I've now updated the isoschematron implementation to - inherit from etree._Validator - use simple arguments and automagically convert them to stylesheet parameters - added the 'phase' keyword arg Also, etree.XSLT now accepts XPath objects as stylesheet parameters - test & doc updated to reflect this. These are the changes I made on the core lxml files since branching: $ svn diff -N --old=http://codespeak.net/svn/lxml/trunk/src/lxml/@69913 --new=src/lxml Index: src/lxml/xslt.pxi =================================================================== --- src/lxml/xslt.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml) (revision 69913) +++ src/lxml/xslt.pxi (.../src/lxml) (working copy) @@ -609,7 +609,10 @@ xslt.xsltQuoteOneUserParam( transform_ctxt, _cstr(k), _cstr(v)) else: - v = _utf8(value) + if isinstance(value, XPath): + v = _utf8((<XPath>value).path) + else: + v = _utf8(value) params[i] = _cstr(k) i += 1 params[i] = _cstr(v) Index: src/lxml/xmlerror.pxi =================================================================== --- src/lxml/xmlerror.pxi (.../http://codespeak.net/svn/lxml/trunk/src/lxml)(revision 69913) +++ src/lxml/xmlerror.pxi (.../src/lxml) (working copy) @@ -102,6 +102,12 @@ def __get__(self): return ErrorLevels._getName(self.level, u"unknown") +#FIXME: Can _LogEntry be settable itself so we don't need this? +cdef class _SettableLogEntry(_LogEntry): + cpdef setGeneric(self, int domain, int type, int level, int line, + message, filename): + self._setGeneric(domain, type, level, line, message, filename) + cdef class _BaseErrorLog: cdef _LogEntry _first_error cdef readonly object last_error Index: src/lxml/lxml.etree.pyx =================================================================== --- src/lxml/lxml.etree.pyx (.../http://codespeak.net/svn/lxml/trunk/src/lxml)(revision 69913) +++ src/lxml/lxml.etree.pyx (.../src/lxml) (working copy) @@ -2783,6 +2783,14 @@ raise AssertionError, self._error_log._buildExceptionMessage( u"Document does not comply with schema") + cpdef _append_log_message(self, int domain, int type, int level, int line, + message, filename): + self._error_log._receiveGeneric(domain, type, level, line, message, + filename) + + cpdef _clear_error_log(self): + self._error_log.clear() + property error_log: u"The log of validation errors and warnings." def __get__(self): Holger -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
participants (2)
-
jholg@gmx.de
-
Stefan Behnel