Turn three-line block into single?
Hello, Before I resort to a regex, I figured I should ask here. To find and remove possible duplicates, I need to turn each block into a single line: FROM <wpt lat="46.98520" lon="6.8831"> <name>blah</name> </wpt> TO <wpt lat="46.98520" lon="6.8831"><name>blah</name></wpt> Do you know of a way to do this in lxml? Thank you.
Add options: method=‘c14n2’, strip_text=True When you serialize the output. ( pretty_print should also be the default False )
print(etree.tostring(etree.fromstring(ss),method='c14n2', strip_text=True)) b'<wpt lat="46.98520" lon="6.8831"><name>blah</name></wpt>'
On Aug 8, 2022, at 3:32 PM, Gilles <codecomplete@free.fr> wrote:
Hello,
Before I resort to a regex, I figured I should ask here.
To find and remove possible duplicates, I need to turn each block into a single line:
FROM
<wpt lat="46.98520" lon="6.8831"> <name>blah</name> </wpt>
TO
<wpt lat="46.98520" lon="6.8831"><name>blah</name></wpt>
Do you know of a way to do this in lxml?
Thank you.
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: sdm7g@virginia.edu
On 08/08/2022 22:08, Majewski, Steven Dennis (sdm7g) wrote:
Add options: method=‘c14n2’, strip_text=True When you serialize the output. ( pretty_print should also be the default False )
print(etree.tostring(etree.fromstring(ss),method='c14n2', strip_text=True)) b'<wpt lat="46.98520" lon="6.8831"><name>blah</name></wpt>'
Thank you.
Hi Gilles, I guess you're intending on using 'sort -u' on your data? An alternative would be to de-dup the data as XML instead of as text. Here is something to play with... For the input file: <data> <entries> <wpt lat="46.98520" lon="6.8831"> <name>London</name> </wpt> <wpt lat="46.98520" lon="2.8831"> <name>Paris</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>Manhattan</name> </wpt> <wpt lat="46.98520" lon="6.8831"> <name>London 2</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>New York</name> </wpt> </entries> </data> We can process it with the following code, using python' set() object to remove duplicates: #!/usr/bin/env python3 from lxml import etree # Create a custom class that knows which attributes of wbt # we care about to consider them unique or not. # # Note that both eq() and hash() need to be supported. I was # originally expecting that just hash() would have been sufficient # for set() to cull duplicates. class WPT(etree.ElementBase): def __eq__(self, b): return self.attrib['lat'] == b.attrib['lat'] and self.attrib['lon'] == b.attrib['lon'] def __hash__(self): return hash( (self.attrib['lat'], self.attrib['lon']) ) # Create a parser that returns WPT objects in place of _Elements # but only for elements with a name of 'wpt' def get_wpt_parser(): lookup = etree.ElementNamespaceClassLookup() parser = etree.XMLParser() parser.set_element_class_lookup(lookup) namespace = lookup.get_namespace('') namespace['wpt'] = WPT return parser # Load the XML data and find the parent of the data we're interested in wbt_parser = get_wpt_parser() root = etree.parse('input.xml', wbt_parser) entries = root.find('entries') # Some sanity checking: Print out the Python type of the entries # element (should be a traditional _Element) and each of the children, # which should be of type WPT. print(f"type(entries) = {type(entries)}") print(f"type(entries.children = {','.join(str(type(c)) for c in entries.getchildren())}") # Read the child elements of the parent into a set; which will cause # duplicated entries to be removed; with set() leveraging the __eq__ and # __hash__ functions of the WBT class above children = set(entries.iterchildren()) # Replace the original children with the unique children entries[:] = children # Write out the resultant XML with open('output.xml', 'wb') as output_file: output_file.write(etree.tostring(root)) This results in the following output: <data> <entries> <wpt lat="46.98520" lon="6.8831"> <name>London</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>Manhattan</name> </wpt> <wpt lat="46.98520" lon="2.8831"> <name>Paris</name>a </wpt> </entries> </data> Which may well be what you're after... If the contents of the <name> elements should also be part of the "is equal" then the WBT class can be updated to include this data too in the __eq__ and __hash__ functions. Cheers, aid
On 8 Aug 2022, at 20:32, Gilles <codecomplete@free.fr> wrote:
Hello,
Before I resort to a regex, I figured I should ask here.
To find and remove possible duplicates, I need to turn each block into a single line:
FROM
<wpt lat="46.98520" lon="6.8831"> <name>blah</name> </wpt>
TO
<wpt lat="46.98520" lon="6.8831"><name>blah</name></wpt>
Do you know of a way to do this in lxml?
Thank you.
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: aid@logic.org.uk
Thanks mucho. The script fails on this particular line: """ File "remove.dups.py", line 54, in <module> print(f"type(entries.children = {','.join(str(type(c)) for c in entries.getchildren())}") AttributeError: 'NoneType' object has no attribute 'getchildren' """ print(f"type(entries.children = {','.join(str(type(c)) for c in entries.getchildren())}") On 09/08/2022 00:55, Adrian Bool wrote:
Hi Gilles,
I guess you're intending on using 'sort -u' on your data? An alternative would be to de-dup the data as XML instead of as text.
Here is something to play with...
For the input file:
<data> <entries> <wpt lat="46.98520" lon="6.8831"> <name>London</name> </wpt> <wpt lat="46.98520" lon="2.8831"> <name>Paris</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>Manhattan</name> </wpt> <wpt lat="46.98520" lon="6.8831"> <name>London 2</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>New York</name> </wpt> </entries> </data>
We can process it with the following code, using python' set() object to remove duplicates:
#!/usr/bin/env python3 from lxml import etree # Create a custom class that knows which attributes of wbt # we care about to consider them unique or not. # # Note that both eq() and hash() need to be supported. I was # originally expecting that just hash() would have been sufficient # for set() to cull duplicates. class WPT(etree.ElementBase): def __eq__(self, b): return self.attrib['lat'] == b.attrib['lat'] and self.attrib['lon'] == b.attrib['lon'] def __hash__(self): return hash( (self.attrib['lat'], self.attrib['lon']) ) # Create a parser that returns WPT objects in place of _Elements # but only for elements with a name of 'wpt' def get_wpt_parser(): lookup = etree.ElementNamespaceClassLookup() parser = etree.XMLParser() parser.set_element_class_lookup(lookup) namespace = lookup.get_namespace('') namespace['wpt'] = WPT return parser # Load the XML data and find the parent of the data we're interested in wbt_parser = get_wpt_parser() root = etree.parse('input.xml', wbt_parser) entries = root.find('entries') # Some sanity checking: Print out the Python type of the entries # element (should be a traditional _Element) and each of the children, # which should be of type WPT. print(f"type(entries) = {type(entries)}") print(f"type(entries.children = {','.join(str(type(c)) for c in entries.getchildren())}") # Read the child elements of the parent into a set; which will cause # duplicated entries to be removed; with set() leveraging the __eq__ and # __hash__ functions of the WBT class above children = set(entries.iterchildren()) # Replace the original children with the unique children entries[:] = children # Write out the resultant XML with open('output.xml', 'wb') as output_file: output_file.write(etree.tostring(root))
This results in the following output:
<data> <entries> <wpt lat="46.98520" lon="6.8831"> <name>London</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>Manhattan</name> </wpt> <wpt lat="46.98520" lon="2.8831"> <name>Paris</name>a </wpt> </entries> </data>
Which may well be what you're after... If the contents of the <name> elements should also be part of the "is equal" then the WBT class can be updated to include this data too in the __eq__ and __hash__ functions.
Cheers,
aid
On 8 Aug 2022, at 20:32, Gilles <codecomplete@free.fr> wrote:
Hello,
Before I resort to a regex, I figured I should ask here.
To find and remove possible duplicates, I need to turn each block into a single line:
FROM
<wpt lat="46.98520" lon="6.8831"> <name>blah</name> </wpt>
TO
<wpt lat="46.98520" lon="6.8831"><name>blah</name></wpt>
Do you know of a way to do this in lxml?
Thank you.
On 9 Aug 2022, at 8:40, Gilles wrote:
Thanks mucho.
The script fails on this particular line:
"""
File "remove.dups.py", line 54, in
print(f"type(entries.children = {','.join(str(type(c)) for c in entries.getchildren())}")
AttributeError: 'NoneType' object has no attribute 'getchildren'
"""
print(f"type(entries.children = {','.join(str(type(c)) for c in entries.getchildren())}")
Then add the condition `not None` in the comprehension. Though, to be honest I suspect writing to a Sqlite database and exporting unique values back to XML is probably going to be easier. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
On 09/08/2022 10:51, Charlie Clark wrote:
Though, to be honest I suspect writing to a Sqlite database and exporting unique values back to XML is probably going to be easier.
Nice idea too. I could just ignore the error when trying to insert a duplicate https://www.sqlitetutorial.net/sqlite-unique-constraint/
On 9 Aug 2022, at 11:09, Gilles wrote:
Nice idea too. I could just ignore the error when trying to insert a duplicate
Sure, though that's a kind of try/except and if you have a lot of data I suspect the aggregate function will be faster for this kind of one off. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
On 09/08/2022 11:40, Charlie Clark wrote:
On 9 Aug 2022, at 11:09, Gilles wrote:
Nice idea too. I could just ignore the error when trying to insert a duplicate
https://www.sqlitetutorial.net/sqlite-unique-constraint/ Sure, though that's a kind of try/except and if you have a lot of data I suspect the aggregate function will be faster for this kind of one off.
I don't know what that is. Is it Python or SQLite?
On 09/08/2022 11:40, Charlie Clark wrote:
On 9 Aug 2022, at 11:09, Gilles wrote:
Nice idea too. I could just ignore the error when trying to insert a duplicate
https://www.sqlitetutorial.net/sqlite-unique-constraint/ Sure, though that's a kind of try/except and if you have a lot of data I suspect the aggregate function will be faster for this kind of one off.
Here's some working code. I recon using SQL's UNIQUE and ignoring the error triggered when adding a duplicate is a bit kludgy, but it works ========= import sqlite3 db = sqlite3.connect('wp.sqlite') cursor = db.cursor() cursor.execute('CREATE TABLE IF NOT EXISTS wp(id INTEGER PRIMARY KEY,name TEXT UNIQUE,latitude TEXT,longitude TEXT)') db.commit() cursor.execute('BEGIN') wps = tree.findall("wpt") for wp in wps : name = wp.find('name').text lat = wp.attrib['lat'] lon = wp.attrib['lon'] print(name,lat,lon) #Ignore error when inserting dup try: cursor.execute('INSERT INTO wp (name,latitude,longitude) VALUES(?,?,?)', (name,lat,lon)) except sqlite3.IntegrityError as err: if err.args != ('UNIQUE constraint failed: wp.name',): raise cursor.execute('END') db.commit() db.close() =========
On 9 Aug 2022, at 15:16, Gilles wrote:
Here's some working code. I recon using SQL's UNIQUE and ignoring the error triggered when adding a duplicate is a bit kludgy, but it works
For the task I don't see the need for any kind of keys, they'll just slow things down. Also, it will be faster using cursor.executemany() with a list of rows. Not sure if you can combine this with a generator expression but it would be great if you could, otherwise just materialise it when you pass it in. ```python import sqlite3 db = sqlite3.connect('wp.sqlite') cursor = db.cursor() cursor.execute('CREATE TABLE IF NOT EXISTS wp(name TEXT, latitude TEXT, longitude TEXT)') db.commit() def get_rows(tree): for row in tree.iter("wpt"): yield [row.find("name").text] + row.attrib.values() rows = get_rows() # initialise generator cursor.execute('BEGIN') cursor.executemany('INSERT INTO wp (name,latitude,longitude) VALUES(?,?,?)', rows) cursor.execute("COMMIT") cursor.execute("SELECT name, latitude, longitude from wp group by latitude, longitude") cursor.fetchall() Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
Thank you. On 09/08/2022 15:56, Charlie Clark wrote:
On 9 Aug 2022, at 15:16, Gilles wrote:
Here's some working code. I recon using SQL's UNIQUE and ignoring the error triggered when adding a duplicate is a bit kludgy, but it works
For the task I don't see the need for any kind of keys, they'll just slow things down.
Also, it will be faster using cursor.executemany() with a list of rows. Not sure if you can combine this with a generator expression but it would be great if you could, otherwise just materialise it when you pass it in.
|import sqlite3 db = sqlite3.connect('wp.sqlite') cursor = db.cursor() cursor.execute('CREATE TABLE IF NOT EXISTS wp(name TEXT, latitude TEXT, longitude TEXT)') db.commit() def get_rows(tree): for row in tree.iter("wpt"): yield [row.find("name").text] + row.attrib.values() rows = get_rows() # initialise generator cursor.execute('BEGIN') cursor.executemany('INSERT INTO wp (name,latitude,longitude) VALUES(?,?,?)', rows) cursor.execute("COMMIT") cursor.execute("SELECT name, latitude, longitude from wp group by latitude, longitude") cursor.fetchall() Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226 |
You can also do this maybe more simply in XQuery. In that case, you may want to remove any whitespace differences on ingest ( or else, use normalize-space() in comparisons ) [ In BaseX, there is an option to strip whitespace on parse. ] Exactly how, depends on how you are defining EQUALs, ( i.e. comparing @lon & @lat values alone, or including text in the comparison ) and what you want to do in a conflict ( use first or use last or maybe drop both and report ) let $doc := <data> <entries> <wpt lat="46.98520" lon="6.8831"> <name>London</name> </wpt> <wpt lat="46.98520" lon="2.8831"> <name>Paris</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>Manhattan</name> </wpt> <wpt lat="46.98520" lon="6.8831"> <name>London 2</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>New York</name> </wpt> </entries> </data> for $x in $doc/entries/wpt where not( $x/@lon = $x/following-sibling::wpt/@lon and $x/@lat = $x/following-sibling::wpt/@lat ) return $x The above compares @lon & @lat and only includes the last in a conflict: <wpt lat="46.98520" lon="2.8831"><name>Paris</name></wpt> <wpt lat="46.98520" lon="6.8831"><name>London</name></wpt> <wpt lat="46.98520" lon="-4.8831"><name>New York</name></wpt> Change the “where” condition to compare the complete wpt element and keep the first occurrence by changing the comparison to preceding-sibling:: where not( $x = $x/preceding-sibling::wpt ) — Steve M.
Thanks for the tip. On 09/08/2022 17:49, Majewski, Steven Dennis (sdm7g) wrote:
You can also do this maybe more simply in XQuery. In that case, you may want to remove any whitespace differences on ingest ( or else, use normalize-space() in comparisons ) [ In BaseX, there is an option to strip whitespace on parse. ]
Exactly how, depends on how you are defining EQUALs, ( i.e. comparing @lon & @lat values alone, or including text in the comparison ) and what you want to do in a conflict ( use first or use last or maybe drop both and report )
let $doc := <data> <entries> <wpt lat="46.98520" lon="6.8831"> <name>London</name> </wpt> <wpt lat="46.98520" lon="2.8831"> <name>Paris</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>Manhattan</name> </wpt> <wpt lat="46.98520" lon="6.8831"> <name>London 2</name> </wpt> <wpt lat="46.98520" lon="-4.8831"> <name>New York</name> </wpt> </entries> </data>
for $x in $doc/entries/wpt where not( $x/@lon = $x/following-sibling::wpt/@lon and $x/@lat = $x/following-sibling::wpt/@lat ) return $x
The above compares @lon & @lat and only includes the last in a conflict:
<wpt lat="46.98520" lon="2.8831"><name>Paris</name></wpt> <wpt lat="46.98520" lon="6.8831"><name>London</name></wpt> <wpt lat="46.98520" lon="-4.8831"><name>New York</name></wpt>
Change the “where” condition to compare the complete wpt element and keep the first occurrence by changing the comparison to preceding-sibling::
where not( $x = $x/preceding-sibling::wpt )
On 09/08/2022 10:51, Charlie Clark wrote:
Though, to be honest I suspect writing to a Sqlite database and exporting unique values back to XML is probably going to be easier.
I found another way, without relying on SQLite: =============== parser = et.XMLParser(remove_blank_text=True) tree = et.parse(item,parser) root = tree.getroot() no_dups = {} for row in tree.iter("wpt"): name,lat,lon = [row.find("name").text] + row.attrib.values() if name not in no_dups: no_dups[name] = lat,lon print(no_dups) ===============
An important line was missing :-/ no_dups = {} for row in tree.iter("wpt"): name,lat,lon = [row.find("name").text] + row.attrib.values() if name not in no_dups: no_dups[name] = lat,lon else: #dup = remove row.getparent().remove(row) On 09/08/2022 20:59, Gilles wrote:
On 09/08/2022 10:51, Charlie Clark wrote:
Though, to be honest I suspect writing to a Sqlite database and exporting unique values back to XML is probably going to be easier.
I found another way, without relying on SQLite:
=============== parser = et.XMLParser(remove_blank_text=True) tree = et.parse(item,parser) root = tree.getroot()
no_dups = {}
for row in tree.iter("wpt"): name,lat,lon = [row.find("name").text] + row.attrib.values() if name not in no_dups: no_dups[name] = lat,lon
print(no_dups) ===============
On 10 Aug 2022, at 9:53, Gilles wrote:
An important line was missing :-/
no_dups = {}
for row in tree.iter("wpt"):
name,lat,lon = [row.find("name").text] + row.attrib.values()
if name not in no_dups:
no_dups[name] = lat,lon
else:
#dup = remove
row.getparent().remove(row)
Yes, this should work. However, I don't know if adjusting the tree while looping over it won't the same kind of problems as with other sequences in Python. How many elements are there in your tree? Memory use in XML can get very expensive so combining iterparse with xmlfile would be an alternative. Also, if you're only interested in duplicate names, use a set rather than a dictionary. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
On 10/08/2022 13:30, Charlie Clark wrote:
Yes, this should work. However, I don't know if adjusting the tree while looping over it won't the same kind of problems as with other sequences in Python. How many elements are there in your tree? Memory use in XML can get very expensive so combining iterparse with xmlfile would be an alternative. Also, if you're only interested in duplicate names, use a set rather than a dictionary.
Just a few ten's, so performance isn't an issue for me. Indeed, I changed the code since I don't actually care about the three infos, just if that block has already been see or not: #remove dups no_dups = [] for row in tree.iter("wpt"): lat,lon = row.attrib.values() if lat not in no_dups: no_dups.append(lat) else: row.getparent().remove(row)
Gilles schrieb am 10.08.22 um 15:20:
for row in tree.iter("wpt"): lat,lon = row.attrib.values()
Note that this assignment depends on the order of the two attributes in the XML document, i.e. in data that you may not control yourself. It will break if the provider of your input documents ever decides to change the order. I'd probably just use lat, lon = row.get('lat'), row.get('lon') Also:
#remove dups no_dups = [] for row in tree.iter("wpt"): lat,lon = row.attrib.values() if lat not in no_dups: no_dups.append(lat) else: row.getparent().remove(row)
You're using a list here instead of a set. It might be that a list is faster for very small amounts of data, but I'd expect a set to win quite quickly. Regardless of my guessing, you shouldn't be using a list here unless benchmarking tells you that it's faster. And if you do, you'd better add a comment for the reasoning. It's just too surprising to see this implemented with a list, so readers will end up wasting their time thinking more into it than there is. Stefan
On 10/08/2022 16:34, Stefan Behnel wrote:
Gilles schrieb am 10.08.22 um 15:20:
for row in tree.iter("wpt"): lat,lon = row.attrib.values()
Note that this assignment depends on the order of the two attributes in the XML document, i.e. in data that you may not control yourself. It will break if the provider of your input documents ever decides to change the order.
I'd probably just use
lat, lon = row.get('lat'), row.get('lon')
Also:
#remove dups no_dups = [] for row in tree.iter("wpt"): lat,lon = row.attrib.values() if lat not in no_dups: no_dups.append(lat) else: row.getparent().remove(row)
You're using a list here instead of a set. It might be that a list is faster for very small amounts of data, but I'd expect a set to win quite quickly. Regardless of my guessing, you shouldn't be using a list here unless benchmarking tells you that it's faster. And if you do, you'd better add a comment for the reasoning. It's just too surprising to see this implemented with a list, so readers will end up wasting their time thinking more into it than there is.
Thanks for the tips. #remove dups no_dups = set() for row in tree.iter("wpt"): lat, lon = row.get('lat'), row.get('lon') if lat not in no_dups: no_dups.add(lat) else: row.getparent().remove(row)
I didn't think I'd hear anything new from this site. Everything looks different, but each thing is interesting on its own <a href="https://fnafs.io/">five nights at freddy's</a> [url=https://fnafs.io/]five nights at freddy's[/url] [five nights at freddy's](https://fnafs.io/)
participants (6)
-
Adrian Bool
-
Charlie Clark
-
Gilles
-
Majewski, Steven Dennis (sdm7g)
-
Stefan Behnel
-
theodoreevans2410@gmail.com