aligning ElementTrees to text

Steven Bethard steven.bethard at gmail.com
Wed Jan 17 18:43:56 EST 2007


I'm trying to align an XML file with the original text file from which 
it was created. Unfortunately, the XML version of the file has added and 
removed some of the whitespace. For example::

     >>> plain_text = '''
     ... Pacific First Financial Corp. said shareholders approved its
     ... acquisition.
     ... '''
     >>> xml_text = '''   <s>Pacific First Financial Corp.
     ... <EVENT eid="e1" class="REPORTING" > said </EVENT> shareholders
     ... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENT> its
     ... <EVENT eid="e8" class="OCCURRENCE" >    acquis ition </EVENT>.
     ... </s>
     ... '''

I want to determine which offsets in the *original* text each element 
from the XML text is supposed to cover. So I want something like::

     >>> xml_tree = etree.fromstring(xml_text)
     >>> align(xml_tree, plain_text)
     [(<Element 'EVENT' at 01411B00>, 31, 35),
     (<Element 'EVENT' at 01411EA8>, 49, 57),
     (<Element 'EVENT' at 01411E18>, 62, 73),
     (<Element 's' at 01411FC8>, 1, 74)]

where ``align`` has returned a list of all elements in the XML text 
along with their start and end indices in the original text::

     >>> plain_text[31:35]
     'said'
     >>> plain_text[49:57]
     'approved'
     >>> plain_text[62:73]
     'acquisition'

Note that I want to ignore whitespace as much as possible, so the 
elements are aligned only to the non-whitespace text they include.


Below is my current implementation of the ``align`` function. It seems 
pretty messy to me -- can anyone offer me some advice on how to clean it 
up or write it differently?

     def align(tree, text):

         def align_helper(elem, elem_start):
             # skip whitespace in the text before the element
             while text[elem_start:elem_start + 1].isspace():
                 elem_start += 1

             # advance the element end past any element text
             elem_end = elem_start
             if elem.text is not None:
                 for char in elem.text:
                     if not char.isspace():
                         while text[elem_end:elem_end + 1].isspace():
                             elem_end += 1
                         assert text[elem_end] == char
                         elem_end += 1

             # advance the element end past any child elements
             for child_elem in elem:
                 elem_end = align_helper(child_elem, elem_end)

             # advance the start for the next element past the tail text
             next_start = elem_end
             if elem.tail is not None:
                 for char in elem.tail:
                     if not char.isspace():
                         while text[next_start:next_start + 1].isspace():
                             next_start += 1
                         assert text[next_start] == char
                         next_start += 1

             # add the element and its start and end to the result list
             result.append((elem, elem_start, elem_end))

             # return the start of the next element
             return next_start

         result = []
         align_helper(tree, 0)
         return result


Thanks,

STeVe



More information about the Python-list mailing list