a question about a loop within a loop
data:image/s3,"s3://crabby-images/d5859/d5859e89788ed2836a0a4ecbda4a1f9d4a69b9e7" alt=""
I would be greateful for advice on the following: I iterate over a sequence of sibling elements with the typical code for element in tree.iter(tei +'w', tei +'c'): do this or that Within that sequence there are shorter sequences (between two or seven elements) that begin with an element <w part="I"/> and end with an element <w part="F"/>. There may or may not be one or more elements of the type <w part="M"/>. Since most of the cases involve sequences of two or three elements, I've dealt with code like "if the next-but-one" element has a part='F' attribute." That works for the simple cases, but it would be much better if I could break out of the current iteration, isolate the sequence that goes from <w part="I"/> to <w part="F"/>, iterate over it, and integrate the result ( a single <w> element) back into the tree. But I don't know how to write code that would 1.start at a known point and make that the point of departure for a sequence that can be iterated over 2. gather the elements that follow it until I come to the unknown future point that is defined by part="F" And I don't know whether that would be an lxml or a more general Python procedure. Martin Mueller Professor emeritus of English and Classics Northwestern University
data:image/s3,"s3://crabby-images/863b1/863b1190bbdaf32564c8b302dc468286f365d9bb" alt=""
Hi Martin, Am .08.2014, 22:36 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
1.start at a known point and make that the point of departure for a sequence that can be iterated over
2. gather the elements that follow it until I come to the unknown future point that is defined by part="F"
And I don't know whether that would be an lxml or a more general Python procedure.
When you're working in a tree then ElementTree API applies. lxml has some extensions that make some operations easier, but switching between lxml and the standard library is fortunately very easy. From your description it's not entirely clear what you're wanting to do with the elements you encounter: extract information or change the tree in place. What we do in openpyxl when parsing reasonably complex structures is to break the job down into different functions for the different parts. One of the nice things about the ElementTree interface is that it's the same at any level in the tree so if you find an element within a tree that you want to deal with you can simply pass it onto another function. To do this it's best to search explicitly for elements (using either find(), findall() or iter() with the relevant identifiers to prevent your functions getting in the way of each other. Not sure how close this is to your use case but it might help to see how read (but don't modify) some xml: https://bitbucket.org/openpyxl/openpyxl/src/03cb2a7f046d02ec3a19cbeba4375b6d... If that's totally off the mark can you be more explicit and provide some examples of what you've got and what you need it to be. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Mueller schrieb am 21.08.2014 um 22:36:
This is not (really) an lxml releated question, so python-list (a.k.a. comp.lang.python) would be a better place to ask it. However, it's fun to answer, so here it goes. There is a pattern that you could use here. Normally, you would iterate by passing an iterable to a for-loop, which then creates an iterator from it (by calling "iter(iterable)") and runs over it. The "iterator" is the thing that remembers where it was and when you call next() on it, returns the next item. That's essentially how for-loops work, by repeatedly calling next() on an iterator. But you could also create the iterator yourself and use it in multiple iteration steps and/or for-loops. It keeps its iteration state so that each for-loop continues instead of restarting from the beginning. Here is an untested example, with the lucky twist that element.iterfind() already returns an iterator (not just an iterable): def print_sequences(parent): ws = parent.iterfind('w') # an iterator over 'w' children for w in ws: seq = [w] if w.get('part') == 'I': # collect more elements until we find an 'F' part for w in ws: seq.append(w) if w.get('part') == 'F': break # now print the 'part' attributes of all elements # in our sequence to show what elements we got print([w.get('part') for w in seq]) This will put each 'w' element in its own list, except when it finds an 'I' part, in which case it will collect more elements until it finds an 'F'. The list "seq" then always contains either a single (non-I) element or a sequence from I to F. I didn't understand your comment of the 'M' part, but the above example should enable you to deal with a mix of single elements and sequences. Stefan
data:image/s3,"s3://crabby-images/863b1/863b1190bbdaf32564c8b302dc468286f365d9bb" alt=""
Hi Martin, Am .08.2014, 22:36 Uhr, schrieb Martin Mueller <martinmueller@northwestern.edu>:
1.start at a known point and make that the point of departure for a sequence that can be iterated over
2. gather the elements that follow it until I come to the unknown future point that is defined by part="F"
And I don't know whether that would be an lxml or a more general Python procedure.
When you're working in a tree then ElementTree API applies. lxml has some extensions that make some operations easier, but switching between lxml and the standard library is fortunately very easy. From your description it's not entirely clear what you're wanting to do with the elements you encounter: extract information or change the tree in place. What we do in openpyxl when parsing reasonably complex structures is to break the job down into different functions for the different parts. One of the nice things about the ElementTree interface is that it's the same at any level in the tree so if you find an element within a tree that you want to deal with you can simply pass it onto another function. To do this it's best to search explicitly for elements (using either find(), findall() or iter() with the relevant identifiers to prevent your functions getting in the way of each other. Not sure how close this is to your use case but it might help to see how read (but don't modify) some xml: https://bitbucket.org/openpyxl/openpyxl/src/03cb2a7f046d02ec3a19cbeba4375b6d... If that's totally off the mark can you be more explicit and provide some examples of what you've got and what you need it to be. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Martin Mueller schrieb am 21.08.2014 um 22:36:
This is not (really) an lxml releated question, so python-list (a.k.a. comp.lang.python) would be a better place to ask it. However, it's fun to answer, so here it goes. There is a pattern that you could use here. Normally, you would iterate by passing an iterable to a for-loop, which then creates an iterator from it (by calling "iter(iterable)") and runs over it. The "iterator" is the thing that remembers where it was and when you call next() on it, returns the next item. That's essentially how for-loops work, by repeatedly calling next() on an iterator. But you could also create the iterator yourself and use it in multiple iteration steps and/or for-loops. It keeps its iteration state so that each for-loop continues instead of restarting from the beginning. Here is an untested example, with the lucky twist that element.iterfind() already returns an iterator (not just an iterable): def print_sequences(parent): ws = parent.iterfind('w') # an iterator over 'w' children for w in ws: seq = [w] if w.get('part') == 'I': # collect more elements until we find an 'F' part for w in ws: seq.append(w) if w.get('part') == 'F': break # now print the 'part' attributes of all elements # in our sequence to show what elements we got print([w.get('part') for w in seq]) This will put each 'w' element in its own list, except when it finds an 'I' part, in which case it will collect more elements until it finds an 'F'. The list "seq" then always contains either a single (non-I) element or a sequence from I to F. I didn't understand your comment of the 'M' part, but the above example should enable you to deal with a mix of single elements and sequences. Stefan
participants (3)
-
Charlie Clark
-
Martin Mueller
-
Stefan Behnel