Line Number of a Start Tag
Hi, assume I have the following Python 3 code: ------------------------------ import io from lxml import etree source = """<?xml version="1.0"?> <article version="5.0" xml:lang="en" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink"> <title>...</title> <para>...</para> </article> """ tree = etree.parse(io.StringIO(source)) root = tree.getroot() print(root.sourceline) ------------------------------ When I run the above code, I get "4" as a result. This is a bit unexpected. It seems, root.sourceline returns the line number where the start tag _ends_. However, I need to get the line number where <article> _starts_ (here in this example "2"). How can I get the "starting" line number of a start tag? Thanks! -- Gruß/Regards, Thomas Schraitle
Thomas Schraitle schrieb am 21.05.2015 um 10:28:
assume I have the following Python 3 code:
------------------------------ import io from lxml import etree
source = """<?xml version="1.0"?> <article version="5.0" xml:lang="en" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink"> <title>...</title> <para>...</para> </article> """
tree = etree.parse(io.StringIO(source)) root = tree.getroot() print(root.sourceline) ------------------------------
When I run the above code, I get "4" as a result. This is a bit unexpected.
It seems, root.sourceline returns the line number where the start tag _ends_. However, I need to get the line number where <article> _starts_ (here in this example "2").
It seems that this behaviour applies only to the root element, though:
"""
In [15]: source = '''<?xml version="1.0"?>
....:
Hi Stefan,
thanks for your answer. :-)
On Thu, 21 May 2015 10:48:18 +0200
Stefan Behnel
[...]
When I run the above code, I get "4" as a result. This is a bit unexpected.
It seems, root.sourceline returns the line number where the start tag _ends_. However, I need to get the line number where <article> _starts_ (here in this example "2").
It seems that this behaviour applies only to the root element, though:
""" In [15]: source = '''<?xml version="1.0"?> ....:
http://docbook.org/ns/docbook" ....: xmlns:xlink="http://www.w3.org/1999/xlink"> ....: <title>... ....: </title> ....: <para> ....: ...</para> ....: </article> ....: ''' In [16]: root = etree.fromstring(source)
In [17]: print(root.sourceline) 4
In [18]: print(root[0].sourceline) 5
In [19]: print(root[1].sourceline) 7 """
Does this pose a problem in practice?
Well, yes, is. :) For example, I need to remove the whole prolog (XML declaration, DOCTYPE, and optional comments) of an XML file. I know, this sounds strange, but for the time being let's assume I have a valid reason. ;) To remove the prolog, my idea was to get the line number of the root's start-tag. With that information, I can strip the complete prolog. Unfortunately, it gives me the line number where it _ends_ which makes the start-tag syntactically incorrect. So my idea doesn't work. Maybe there is a better method to remove the prolog of an XML file, but I only found this one. Any idea? -- Gruß/Regards, Thomas Schraitle
Thomas Schraitle schrieb am 21.05.2015 um 11:13:
For example, I need to remove the whole prolog (XML declaration, DOCTYPE, and optional comments) of an XML file. I know, this sounds strange, but for the time being let's assume I have a valid reason. ;)
To remove the prolog, my idea was to get the line number of the root's start-tag. With that information, I can strip the complete prolog. Unfortunately, it gives me the line number where it _ends_ which makes the start-tag syntactically incorrect. So my idea doesn't work.
Maybe there is a better method to remove the prolog of an XML file, but I only found this one.
Any idea?
Why not serialise only the root element? Stefan
Hi,
On Thu, 21 May 2015 11:18:38 +0200
Stefan Behnel
[...]
Maybe there is a better method to remove the prolog of an XML file, but I only found this one.
Any idea?
Why not serialise only the root element?
Well, not sure if I understood you correct, but I don't think this will work. The bigger problem is because of entities. I need to (temporarily) remove the DOCTYPE because I want to retain the complete prolog, fix something in my XML, and write back prolog plus the fixed XML. Unfortunately, this copious method is needed as I don't want to touch the DOCTYPE header. For example, if I have something like this: <!DOCTYPE article [ <!ENTITY % entities SYSTEM "entity-decl.ent"> %entities; ]> the complete set of my "entity-decl.ent" file is read in and will appear in the subset of the DTD after I write it to a file. As I want to keep the DOCTYPE as it is, I have to remove the DOCTYPE declaration somehow. For that reason, I need to know the exact(!) line number of the start-tag to remove the whole prolog. Unfortunately, I haven't found a method where I can tell lxml "leave DOCTYPE as is is, never touch it". Does that make sense to you? -- Gruß/Regards, Thomas Schraitle
participants (2)
-
Stefan Behnel
-
Thomas Schraitle