[XML-SIG] PyXML XPath woes
Mike Brown
mike at skew.org
Sun Feb 8 01:18:04 EST 2004
Matt Patterson wrote:
> As an aside, I originally did the whole thing using PyXML but the XPath
> were too complex for it (example below) and it would return no results!
> I now pre-process the file by running with the complex XPaths through
> libxslt to add the boundary attributes using the compkex XPath, and
> then searching for the attributes with PyXML. This is the XPath to find
> the boundary nodes without help:
>
> //H1[ancestor::boxtexttable = false()][ancestor::casestudy =
> false()][ancestor::casetexttable = false()][ancestor::checklist =
> false()]|//H2[ancestor::boxtexttable = false()][ancestor::casestudy =
> false()][ancestor::casetexttable = false()][ancestor::checklist =
> false()][preceding-sibling::*[1][name() !=
> 'H1']]|//H2[ancestor::boxtexttable = false()][ancestor::casestudy =
> false()][ancestor::casetexttable = false()][ancestor::checklist =
> false()][count(preceding-sibling::*) = 0]|//H3[ancestor::boxtexttable =
> false()][ancestor::casestudy = false()][ancestor::casetexttable =
> false()][ancestor::checklist = false()][preceding-sibling::*[1][name()
> != 'H2'][name() != 'H1']]|//H3[ancestor::boxtexttable =
> false()][ancestor::casestudy = false()][ancestor::casetexttable =
> false()][ancestor::checklist = false()][count(preceding-sibling::*) =
> 0]
Some XPath hints for you here...
1. These predicates don't have to be chained. For example, instead of
[ancestor::boxtexttable = false()][ancestor::casestudy = false()]
[ancestor::casetexttable = false()][ancestor::checklist = false()]
[preceding-sibling::*[1][name() != 'H1']
you could just say
[not(ancestor::boxtexttable or ancestor::casestudy or ancestor::casetexttable
or ancestor::checklist) and name(preceding-sibling::*[1]) != 'H1']
2. count(preceding-sibling::*) = 0
is more succinctly written as not(preceding-sibling::*)
However I think this predicate may have been interfering with your results.
Take your first H2, for example... it does have a preceding sibling:
standfirst, but you did in fact want it to be recorded as a boundary,
right?
I am guessing that your boundary elements are those H1, H2 and H3 elements
that are not descendants of boxtexttable, casestudy, casetextable, or
checklist elements, and that are not immediately preceded by a higher-level
heading element (H1 being higher than H2 being higher than H3). I think this
last clause is not really necessary; you could more easily just rule out a
given element as being a boundary if its immediate preceding sibling's name
doesn't start with 'H' (since you don't have any <HR>s).
This expression is much simpler and I think will do what you want:
(//H1|//H2|//H3)[not(ancestor::boxtexttable or
ancestor::casestudy or
ancestor::casetexttable or
ancestor::checklist) and
not(starts-with(local-name(preceding-sibling::*[1]),'H'))]
And it could be easily made to be more specific if you do in fact have
other element names starting with 'H'.
However, any XPath expression that uses "//" for a full descent into the tree
and "ancestor::" (multiple times!) to traverse all the way back up is putting
a serious stress test on an XPath processor's optimization abilities.
If efficiency is critical, I'd look into other mechanisms involving a single
pass through the tree. For example, this XSLT stylesheet, which does a
recursive copy-through ("identity transform", see XSLT spec under Copying) is
far more efficient for what you want to do, which is generate a new document
that has a boundary="true" attribute added to the appropriate elements:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>
<!-- when not in a special mode, do an identity transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- when not in a special mode, for these elements,
do an identity transform, but also add a boundary attribute
if the immediately preceding sibling isn't H1 or H2 -->
<xsl:template match="H1|H2|H3">
<xsl:copy>
<xsl:if test="not(preceding-sibling::*[1][local-name='H1' or local-name()='H2'])">
<xsl:attribute name="boundary">true</xsl:attribute>
</xsl:if>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- when not in a special mode, for these elements,
do an identity transform, but set a special mode -->
<xsl:template match="boxtexttable|casestudy|casetexttable|checklist">
<xsl:copy>
<xsl:apply-templates select="@*|node()" mode="ignore-Hs"/>
</xsl:copy>
</xsl:template>
<!-- when in the special mode, do an identity transform,
and stay in the special mode -->
<xsl:template match="@*|node()" mode="ignore-Hs">
<xsl:copy>
<xsl:apply-templates select="@*|node()" mode="ignore-Hs"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
-Mike
More information about the XML-SIG
mailing list