xml.etree and namespaces -- why?
axy at declassed.art
Wed Oct 19 11:23:56 EDT 2022
I mean, it's worth to look at BeautifulSoup source how do they do that.
With BS I work with attributes exactly as you want, and I explicitly
tell BS to use lxml parser.
On 19/10/2022 14:25, Robert Latest via Python-list wrote:
> Hi all,
> For the impatient: Below the longish text is a fully self-contained Python
> example that illustrates my problem.
> I'm struggling to understand xml.etree's handling of namespaces. I'm trying to
> parse an Inkscape document which uses several namespaces. From etree's
> documentation:
> If the XML input has namespaces, tags and attributes with prefixes in the
> form prefix:sometag get expanded to {uri}sometag where the prefix is
> replaced by the full URI.
> Which means that given an Element e, I cannot directly access its attributes
> using e.get() because in order to do that I need to know the URI of the
> namespace. So rather than doing this (see example below):
> label = e.get('inkscape:label')
> I need to do this:
> label = e.get('{' + uri_inkscape_namespace + '}label')
> ...which is the method mentioned in etree's docs:
> One way to search and explore this XML example is to manually add the URI
> to every tag or attribute in the xpath of a find() or findall().
> [...]
> A better way to search the namespaced XML example is to create a
> dictionary with your own prefixes and use those in the search functions.
> Good idea! Better yet, that dictionary or rather, its reverse, already exists,
> because etree has used it to unnecessarily mangle the namespaces in the first
> place. The documentation doesn't mention where it can be found, but we can
> just use the 'xmlns:' attributes of the <svg> root element to rebuild it. Or
> so I thought, until I found out that etree deletes exactly these attributes
> before handing the <svg> element to the user.
> I'm really stumped here. Apart from the fact that I think XML is bloated shit
> anyway and has no place outside HTML, I just don't get the purpose of etree's
> way of working:
> 1) Evaluate 'xmlns:' attributes of the <svg> element
> 2) Use that info to replace the existing prefixes by {uri}
> 3) Realizing that using {uri} prefixes is cumbersome, suggest to
> the user to build their own prefix -> uri dictionary
> to undo the effort of doing 1) and 2)
> 4) ...but witholding exactly the information that existed in the original
> document by deleting the 'xmlns:' attributes from the <svg> tag
> Why didn't they leave the whole damn thing alone? Keep <svg> intact and keep
> the attribute 'prefix:key' literally as they are. For anyone wanting to use
> the {uri} prefixes (why would they) they could have thrown in a helper
> function for the prefix->URI translation.
> I'm assuming that etree's designers knew what they were doing in order to make
> my life easier when dealing with XML. Maybe I'm missing the forest for the
> trees. Can anybody enlighten me? Thanks!
> #### self-contained example
> import xml.etree.ElementTree as ET
> def test_svg(xml):
> root = ET.fromstring(xml)
> for e in root.iter():
> print(e.tag) # tags are shown prefixed with {URI}
> if e.tag.endswith('svg'):
> # Since namespaces are defined inside the <svg> tag, let's use the info
> # from the 'xmlns:' attributes to undo etree's URI prefixing
> print('Element <svg>:')
> for k, v in e.items():
> print(' %s: %s' % (k, v))
> # ...but alas: the 'xmlns:' attributes have been deleted by the parser
> xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <!-- Created with Inkscape (http://www.inkscape.org/) -->
> <svg
> width="210mm"
> height="297mm"
> viewBox="0 0 210 297"
> version="1.1"
> id="svg285"
> inkscape:version="1.2.1 (9c6d41e410, 2022-07-14)"
> sodipodi:docname="test.svg"
> xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
> xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
> xmlns="http://www.w3.org/2000/svg"
> xmlns:svg="http://www.w3.org/2000/svg">
> <sodipodi:namedview
> id="namedview287"
> pagecolor="#ffffff"
> bordercolor="#000000"
> borderopacity="0.25"
> inkscape:showpageshadow="2"
> inkscape:pageopacity="0.0"
> inkscape:pagecheckerboard="0"
> inkscape:deskcolor="#d1d1d1"
> inkscape:document-units="mm"
> showgrid="false"
> inkscape:zoom="0.2102413"
> inkscape:cx="394.78447"
> inkscape:cy="561.25984"
> inkscape:window-width="1827"
> inkscape:window-height="1177"
> inkscape:window-x="85"
> inkscape:window-y="-8"
> inkscape:window-maximized="1"
> inkscape:current-layer="layer1" />
> <defs
> id="defs282" />
> <g
> inkscape:label="Ebene 1"
> inkscape:groupmode="layer"
> id="layer1">
> <rect
> style="fill:#aaccff;stroke-width:0.264583"
> id="rect289"
> width="61.665253"
> height="54.114403"
> x="33.978813"
> y="94.38559" />
> </g>
> </svg>
> '''
> if __name__ == '__main__':
> test_svg(xml)
