[XML-SIG] PyXML XPath woes

Mike Brown mike at skew.org
Sun Feb 8 01:18:04 EST 2004


Matt Patterson wrote:
> As an aside, I originally did the whole thing using PyXML but the XPath 
> were too complex for it (example below) and it would return no results! 
> I now pre-process the file by running with the complex XPaths through 
> libxslt to add the boundary attributes using the compkex XPath, and 
> then searching for the attributes with PyXML. This is the XPath to find 
> the boundary nodes without help:
> 
> //H1[ancestor::boxtexttable = false()][ancestor::casestudy = 
> false()][ancestor::casetexttable = false()][ancestor::checklist = 
> false()]|//H2[ancestor::boxtexttable = false()][ancestor::casestudy = 
> false()][ancestor::casetexttable = false()][ancestor::checklist = 
> false()][preceding-sibling::*[1][name() != 
> 'H1']]|//H2[ancestor::boxtexttable = false()][ancestor::casestudy = 
> false()][ancestor::casetexttable = false()][ancestor::checklist = 
> false()][count(preceding-sibling::*) = 0]|//H3[ancestor::boxtexttable = 
> false()][ancestor::casestudy = false()][ancestor::casetexttable = 
> false()][ancestor::checklist = false()][preceding-sibling::*[1][name() 
> != 'H2'][name() != 'H1']]|//H3[ancestor::boxtexttable = 
> false()][ancestor::casestudy = false()][ancestor::casetexttable = 
> false()][ancestor::checklist = false()][count(preceding-sibling::*) = 
> 0]

Some XPath hints for you here...

1. These predicates don't have to be chained. For example, instead of

[ancestor::boxtexttable = false()][ancestor::casestudy = false()]
[ancestor::casetexttable = false()][ancestor::checklist = false()]
[preceding-sibling::*[1][name() != 'H1']

you could just say

[not(ancestor::boxtexttable or ancestor::casestudy or ancestor::casetexttable
or ancestor::checklist) and name(preceding-sibling::*[1]) != 'H1']

2. count(preceding-sibling::*) = 0

is more succinctly written as not(preceding-sibling::*)

However I think this predicate may have been interfering with your results.
Take your first H2, for example... it does have a preceding sibling:
standfirst, but you did in fact want it to be recorded as a boundary,
right?

I am guessing that your boundary elements are those H1, H2 and H3 elements
that are not descendants of boxtexttable, casestudy, casetextable, or
checklist elements, and that are not immediately preceded by a higher-level
heading element (H1 being higher than H2 being higher than H3). I think this
last clause is not really necessary; you could more easily just rule out a
given element as being a boundary if its immediate preceding sibling's name
doesn't start with 'H' (since you don't have any <HR>s).

This expression is much simpler and I think will do what you want:

(//H1|//H2|//H3)[not(ancestor::boxtexttable or
                     ancestor::casestudy or
                     ancestor::casetexttable or
                     ancestor::checklist) and
                 not(starts-with(local-name(preceding-sibling::*[1]),'H'))]

And it could be easily made to be more specific if you do in fact have
other element names starting with 'H'.

However, any XPath expression that uses "//" for a full descent into the tree
and "ancestor::" (multiple times!) to traverse all the way back up is putting
a serious stress test on an XPath processor's optimization abilities.  

If efficiency is critical, I'd look into other mechanisms involving a single
pass through the tree. For example, this XSLT stylesheet, which does a
recursive copy-through ("identity transform", see XSLT spec under Copying) is
far more efficient for what you want to do, which is generate a new document
that has a boundary="true" attribute added to the appropriate elements:


<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" indent="no"/>

  <!-- when not in a special mode, do an identity transform -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- when not in a special mode, for these elements,
    do an identity transform, but also add a boundary attribute
    if the immediately preceding sibling isn't H1 or H2 -->
  <xsl:template match="H1|H2|H3">
    <xsl:copy>
      <xsl:if test="not(preceding-sibling::*[1][local-name='H1' or local-name()='H2'])">
        <xsl:attribute name="boundary">true</xsl:attribute>
      </xsl:if>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- when not in a special mode, for these elements,
   do an identity transform, but set a special mode -->
  <xsl:template match="boxtexttable|casestudy|casetexttable|checklist">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" mode="ignore-Hs"/>
    </xsl:copy>
  </xsl:template>

  <!-- when in the special mode, do an identity transform,
   and stay in the special mode -->
  <xsl:template match="@*|node()" mode="ignore-Hs">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()" mode="ignore-Hs"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>


-Mike



More information about the XML-SIG mailing list