Hi Gilles, I guess you're intending on using 'sort -u' on your data? An alternative would be to de-dup the data as XML instead of as text. Here is something to play with... For the input file: <data> <entries> <wpt lat="46.98520" lon="6.8831"> <name>London</name> </wpt> <wpt lat="46.98520" lon="2.8831"> <name>Paris</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>Manhattan</name> </wpt> <wpt lat="46.98520" lon="6.8831"> <name>London 2</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>New York</name> </wpt> </entries> </data> We can process it with the following code, using python' set() object to remove duplicates: #!/usr/bin/env python3 from lxml import etree # Create a custom class that knows which attributes of wbt # we care about to consider them unique or not. # # Note that both eq() and hash() need to be supported. I was # originally expecting that just hash() would have been sufficient # for set() to cull duplicates. class WPT(etree.ElementBase): def __eq__(self, b): return self.attrib['lat'] == b.attrib['lat'] and self.attrib['lon'] == b.attrib['lon'] def __hash__(self): return hash( (self.attrib['lat'], self.attrib['lon']) ) # Create a parser that returns WPT objects in place of _Elements # but only for elements with a name of 'wpt' def get_wpt_parser(): lookup = etree.ElementNamespaceClassLookup() parser = etree.XMLParser() parser.set_element_class_lookup(lookup) namespace = lookup.get_namespace('') namespace['wpt'] = WPT return parser # Load the XML data and find the parent of the data we're interested in wbt_parser = get_wpt_parser() root = etree.parse('input.xml', wbt_parser) entries = root.find('entries') # Some sanity checking: Print out the Python type of the entries # element (should be a traditional _Element) and each of the children, # which should be of type WPT. print(f"type(entries) = {type(entries)}") print(f"type(entries.children = {','.join(str(type(c)) for c in entries.getchildren())}") # Read the child elements of the parent into a set; which will cause # duplicated entries to be removed; with set() leveraging the __eq__ and # __hash__ functions of the WBT class above children = set(entries.iterchildren()) # Replace the original children with the unique children entries[:] = children # Write out the resultant XML with open('output.xml', 'wb') as output_file: output_file.write(etree.tostring(root)) This results in the following output: <data> <entries> <wpt lat="46.98520" lon="6.8831"> <name>London</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>Manhattan</name> </wpt> <wpt lat="46.98520" lon="2.8831"> <name>Paris</name>a </wpt> </entries> </data> Which may well be what you're after... If the contents of the <name> elements should also be part of the "is equal" then the WBT class can be updated to include this data too in the __eq__ and __hash__ functions. Cheers, aid
On 8 Aug 2022, at 20:32, Gilles <codecomplete@free.fr> wrote:
Hello,
Before I resort to a regex, I figured I should ask here.
To find and remove possible duplicates, I need to turn each block into a single line:
FROM
<wpt lat="46.98520" lon="6.8831"> <name>blah</name> </wpt>
TO
<wpt lat="46.98520" lon="6.8831"><name>blah</name></wpt>
Do you know of a way to do this in lxml?
Thank you.
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: aid@logic.org.uk