can I tell lxml to ignore xmlids?
Duplicate xmlids have way of creeping into my 60,000 documents. The ids keep the document from parsing, which is helpful in drawing attention to errors, but it makes it harder to correct the errors. I work with documents in the TEI namespace, and I have a very kludgy workaround: I comment out the reference to the schema and change ‘xml:id ‘ to ‘xmlom’. Then I can loop through the document and fix errors with a script. There must be a more elegant way to do this. Is there away of telling lxml: “never mind the duplicate IDs. Just carry on”. Then I can toggle between a script that cares or doesn’t care about duplicate IDS. With thanks in advance for any help Martin Mueller Professor emeritus of English and Classics Northwestern University
Hi Martin, unique IDs are written into the constraints in the XML specification itself (in 3.3.1, Attribute types). However, you can tell the XML parser to not care about IDs. I’m not sure if this is a useful option. With renaming, processing and renaming back you know at least exactly what is going on. if you have a document like this:
from lxml import etree
s = '<a><b xml:id="id1"/><c xml:id="id1"/></a>'
containing the same ID twice, this fails:
etree.XML(s) Traceback (most recent call last): … lxml.etree.XMLSyntaxError: ID id1 already defined, line 1, column 36
But I can define a parser with the option collect_ids set to false, like this:
myparser = etree.XMLParser(collect_ids=False)
use it to parse my document s:
tree = etree.XML(s, parser=myparser)
and everything seems fine:
etree.dump(tree) <a> <b xml:id="id1"/> <c xml:id="id1"/> </a>
As I said, this is a path not often taken, proceed with caution. jens
On 31. Jul 2022, at 23:56, Martin Mueller <martinmueller@northwestern.edu> wrote:
Duplicate xmlids have way of creeping into my 60,000 documents. The ids keep the document from parsing, which is helpful in drawing attention to errors, but it makes it harder to correct the errors. I work with documents in the TEI namespace, and I have a very kludgy workaround: I comment out the reference to the schema and change ‘xml:id ‘ to ‘xmlom’. Then I can loop through the document and fix errors with a script.
There must be a more elegant way to do this. Is there away of telling lxml: “never mind the duplicate IDs. Just carry on”. Then I can toggle between a script that cares or doesn’t care about duplicate IDS.
With thanks in advance for any help
Martin Mueller Professor emeritus of English and Classics Northwestern University
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org <mailto:lxml@python.org> To unsubscribe send an email to lxml-leave@python.org <mailto:lxml-leave@python.org> https://mail.python.org/mailman3/lists/lxml.python.org/ <https://mail.python.org/mailman3/lists/lxml.python.org/> Member address: jq@qdevelop.de <mailto:jq@qdevelop.de>
Alas, setting the parser to ‘collect_ids = False’ does not solve the problem. It generates the same error message: File "/users/martinmueller/dropbox/earlyprint/ecco-evans/evans-2020-03-02/N01868.xml", line 3663 lxml.etree.XMLSyntaxError: ID N01868-0011-3105 already defined, line 3663, column 74 From: Jens Quade <jq@qdevelop.de> Date: Sunday, July 31, 2022 at 4:52 PM To: Martin Mueller <martinmueller@northwestern.edu> Cc: lxml mailing list <lxml@lxml.de> Subject: Re: [lxml] can I tell lxml to ignore xmlids? Hi Martin, unique IDs are written into the constraints in the XML specification itself (in 3.3.1, Attribute types). However, you can tell the XML parser to not care about IDs. I’m not sure if this is a useful option. With renaming, processing and renaming back you know at least exactly what is going on. if you have a document like this:
from lxml import etree
s = '<a><b xml:id="id1"/><c xml:id="id1"/></a>'
containing the same ID twice, this fails:
etree.XML(s) Traceback (most recent call last): … lxml.etree.XMLSyntaxError: ID id1 already defined, line 1, column 36
But I can define a parser with the option collect_ids set to false, like this:
myparser = etree.XMLParser(collect_ids=False)
use it to parse my document s:
tree = etree.XML(s, parser=myparser)
and everything seems fine:
etree.dump(tree) <a> <b xml:id="id1"/> <c xml:id="id1"/> </a>
As I said, this is a path not often taken, proceed with caution. jens On 31. Jul 2022, at 23:56, Martin Mueller <martinmueller@northwestern.edu<mailto:martinmueller@northwestern.edu>> wrote: Duplicate xmlids have way of creeping into my 60,000 documents. The ids keep the document from parsing, which is helpful in drawing attention to errors, but it makes it harder to correct the errors. I work with documents in the TEI namespace, and I have a very kludgy workaround: I comment out the reference to the schema and change ‘xml:id ‘ to ‘xmlom’. Then I can loop through the document and fix errors with a script. There must be a more elegant way to do this. Is there away of telling lxml: “never mind the duplicate IDs. Just carry on”. Then I can toggle between a script that cares or doesn’t care about duplicate IDS. With thanks in advance for any help Martin Mueller Professor emeritus of English and Classics Northwestern University _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org<mailto:lxml@python.org> To unsubscribe send an email to lxml-leave@python.org<mailto:lxml-leave@python.org> https://mail.python.org/mailman3/lists/lxml.python.org/<https://urldefense.com/v3/__https:/mail.python.org/mailman3/lists/lxml.python.org/__;!!Dq0X2DkFhyF93HkjWTBQKhk!XoYXDs7FdDJl70otcJApkERiTtT3RfI96nuNNX6pdgZ9VU1otgaNd1UAlH6xnDJp0TJMdiO5N3FYakzT-9PdttI$> Member address: jq@qdevelop.de<mailto:jq@qdevelop.de>
On 1 Aug 2022, at 2:28, Martin Mueller wrote:
Alas, setting the parser to ‘collect_ids = False’ does not solve the problem. It generates the same error message:
File "/users/martinmueller/dropbox/earlyprint/ecco-evans/evans-2020-03-02/N01868.xml", line 3663
lxml.etree.XMLSyntaxError: ID N01868-0011-3105 already defined, line 3663, column 74
Does your parses work with Jens' example? If so I'd suggest you post a small sample from one of your files. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Sengelsweg 34 Düsseldorf D- 40489 Tel: +49-203-3925-0390 Mobile: +49-178-782-6226
I've been looking for good information on this topic but haven't had any luck until now. You just won yourself a new biggest fan https://fnafs.io/
participants (4)
-
Charlie Clark
-
Jens Quade
-
Martin Mueller
-
theodoreevans2410@gmail.com