[lxml-dev] Problem with ":" char in tag names
I've been using lxml and think it is great, but ... I recently installed lxml-1.3.3. Now I find that the following gives me an error: In [3]: from lxml import etree In [4]: etree.Element('abc:def') ------------------------------------------------------------ Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "etree.pyx", line 1801, in etree.Element File "apihelpers.pxi", line 101, in etree._makeElement File "apihelpers.pxi", line 723, in etree._getNsTag ValueError: Invalid tag name It's because of the ":" in the tag name. That's critical for me, because I use lxml in my rst2odt project to produce OpenOffice ODF .odt files. See: http://www.rexx.com/~dkuhlman/odtwriter.html An ODF/.odt file is a zipped archive of XML files. Those XML files contain many tags that contain colons. Here are the relevant portions of the XML spec, I believe: http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-starttags http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Name Aren't I correct that a colon should be allowed in a tag name? In apihelpers.pxi, it looks like the following lines were added in lxml version 1.3.3 and which I believe are raising the exception: elif cstd.strchr(c_tag, c':') is not NULL: raise ValueError, "Invalid tag name" Is there a reason for that? Hoping for enlightenment. Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman
Hi David, Dave Kuhlman wrote:
I've been using lxml and think it is great
:)
, but ...
;) I just knew there was more to come...
I recently installed lxml-1.3.3. Now I find that the following gives me an error:
In [3]: from lxml import etree In [4]: etree.Element('abc:def') ------------------------------------------------------------ Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "etree.pyx", line 1801, in etree.Element File "apihelpers.pxi", line 101, in etree._makeElement File "apihelpers.pxi", line 723, in etree._getNsTag ValueError: Invalid tag name
It's because of the ":" in the tag name.
That's critical for me, because I use lxml in my rst2odt project to produce OpenOffice ODF .odt files. See: http://www.rexx.com/~dkuhlman/odtwriter.html
An ODF/.odt file is a zipped archive of XML files. Those XML files contain many tags that contain colons.
Here are the relevant portions of the XML spec, I believe:
http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-starttags http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-Name
Aren't I correct that a colon should be allowed in a tag name?
In apihelpers.pxi, it looks like the following lines were added in lxml version 1.3.3 and which I believe are raising the exception:
elif cstd.strchr(c_tag, c':') is not NULL: raise ValueError, "Invalid tag name"
Is there a reason for that?
lxml (read: libxml2) supports XML 1.0 (don't think there were any relevant changes in 1.1, which you cite above) and is generally namespace aware. This means that ":" is considered a separator between a namespace prefix and the tag name, and is therefore not allowed as part of a plain (namespace-less) tag name. You mentioned ODF, which is heavily based on namespaces, and AFAIA, it doesn't use prefixes for anything but namespace references. So you should be fine with the general namespace support in lxml.etree. http://codespeak.net/lxml/dev/tutorial.html#namespaces Does that 'enlighten' you? :) Stefan
Hey,
lxml (read: libxml2) supports XML 1.0 (don't think there were any relevant changes in 1.1, which you cite above) and is generally namespace aware. This means that ":" is considered a separator between a namespace prefix and the tag name, and is therefore not allowed as part of a plain (namespace-less) tag name.
What used to happen if you put a colon in a tag name? What would people expect to happen? I wonder whether it'd be possible to support namespace prefixes the proper way this way. I.e if I write: Element('foo:bar', nsmap={'foo': 'blah}) that could be equivalent to: Element('{blah}bar', nsmap={'foo': 'blah'}) The nice thing is that you could avoid having to write '{%s}foo' % my_namespace a lot. Of course this has consequences for other areas, such as 'tag', so I'm not sure whether this is a good idea, but throwing it in. It's definitely another extension on ElementTree, which can't really do this kind of stuff well due to the lack of parent pointers. Regards, Martijn
Martijn Faassen wrote:
lxml (read: libxml2) supports XML 1.0 (don't think there were any relevant changes in 1.1, which you cite above) and is generally namespace aware. This means that ":" is considered a separator between a namespace prefix and the tag name, and is therefore not allowed as part of a plain (namespace-less) tag name.
What used to happen if you put a colon in a tag name? What would people expect to happen?
Well, lxml.etree previously accepted those as part of a tag name. This means that you could do this: >>> root = etree.Element("some:root") >>> print etree.tostring(root) <some:root/> which allowed you to use namespace prefixes without declaring namespaces, i.e. it really helps you in writing out broken XML. It also allowed you to do this, which I think people did: >>> root = etree.XML('<p:root xmlns:p="http://whatever/"/>') >>> root.append( etree.Element("p:other") ) >>> print etree.tostring(root) <p:root xmlns:p="http://whatever/"><p:other/></p:root> Looks correct, right? However, it nicely breaks all namespace aware XML stuff that works on the in-memory tree: >>> print root, root[0] <Element {http://whatever/}root at b7624e3c> <Element p:other at b792b93c> >>> print root.xpath("//p:other") Traceback (most recent call last): ... etree.XPathEvalError: Undefined namespace prefix >>> print root.xpath("//p:other", {"p":"http://whatever/"}) [] So raising an exception here *really* prevents a lot of pitfalls and helps people fix their programs.
I wonder whether it'd be possible to support namespace prefixes the proper way this way. I.e if I write:
Element('foo:bar', nsmap={'foo': 'blah})
that could be equivalent to:
Element('{blah}bar', nsmap={'foo': 'blah'})
No. There should be one way to do this. We already use prefixes in XPath, which causes a lot of annoyance for new users. BTW, this is an extremely rare use pattern. Normally, you would either work on an XML document that already comes with its pre-defined prefixes, or you would define an nsmap once (as you show above) and then stick to using SubElement(..., "{ns}tag") without redefining the prefixes. Note that lxml nicely reassigns prefixes now when inserting an element into an existing tree, so there really is no need to assign prefixes more than once (if at all).
The nice thing is that you could avoid having to write '{%s}foo' % my_namespace a lot.
Feel free to assign it to a global constant or to use the E factory as in lxml.html.builder.
Of course this has consequences for other areas, such as 'tag', so I'm not sure whether this is a good idea, but throwing it in.
Right, it would let ".tag" return something other than what you passed into the Element() function.
It's definitely another extension on ElementTree, which can't really do this kind of stuff well due to the lack of parent pointers.
Right, so it would unnecessarily add an additional namespace definition pattern that is not supported by ET and at the same time allow the pitfalls that the users who reported the problem currently run into. Meaning: it would let people write programs that would stop working the day they wanted to switch to ET or the day they started using XPath. Great. No, this change is definitely a bug fix. I'm sorry for people who were not aware of this bug in the past and accidentally misused it, but this has to change. Stefan
Dave Kuhlman wrote:
I've been using lxml and think it is great, but ...
I recently installed lxml-1.3.3. Now I find that the following gives me an error:
In [3]: from lxml import etree In [4]: etree.Element('abc:def') ------------------------------------------------------------ Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "etree.pyx", line 1801, in etree.Element File "apihelpers.pxi", line 101, in etree._makeElement File "apihelpers.pxi", line 723, in etree._getNsTag ValueError: Invalid tag name
It's because of the ":" in the tag name.
As another data point: by coincidence yesterday I saw a discussion of some other project who also ran into this problem. http://groups.google.com/group/html5lib-discuss/browse_thread/thread/9997a24... No idea about the context there. Regards, Martijn
Martijn Faassen wrote:
Dave Kuhlman wrote:
I've been using lxml and think it is great, but ...
I recently installed lxml-1.3.3. Now I find that the following gives me an error:
In [3]: from lxml import etree In [4]: etree.Element('abc:def') ------------------------------------------------------------ Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "etree.pyx", line 1801, in etree.Element File "apihelpers.pxi", line 101, in etree._makeElement File "apihelpers.pxi", line 723, in etree._getNsTag ValueError: Invalid tag name
It's because of the ":" in the tag name.
As another data point: by coincidence yesterday I saw a discussion of some other project who also ran into this problem.
http://groups.google.com/group/html5lib-discuss/browse_thread/thread/9997a24...
No idea about the context there.
Hmmm, I really wonder how many people used this 'feature' to work around having to implement proper namespace support... Stefan
Stefan Behnel wrote:
It's because of the ":" in the tag name. As another data point: by coincidence yesterday I saw a discussion of some other project who also ran into this problem.
http://groups.google.com/group/html5lib-discuss/browse_thread/thread/9997a24...
No idea about the context there.
Hmmm, I really wonder how many people used this 'feature' to work around having to implement proper namespace support...
One of the places where this recently came up is that Facebook is using markup with fb:*: http://wiki.developers.facebook.com/index.php/FBML -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers
Ian Bicking wrote:
Stefan Behnel wrote:
It's because of the ":" in the tag name. As another data point: by coincidence yesterday I saw a discussion of some other project who also ran into this problem.
http://groups.google.com/group/html5lib-discuss/browse_thread/thread/9997a24...
No idea about the context there.
Hmmm, I really wonder how many people used this 'feature' to work around having to implement proper namespace support...
One of the places where this recently came up is that Facebook is using markup with fb:*: http://wiki.developers.facebook.com/index.php/FBML
No, they are not. They are using a well-defined namespace: http://wiki.developers.facebook.com/index.php/FBML_DTD If you use unnamespaced "fb:*" tag names here, you will also break validation against their XSD. Stefan
Martijn Faassen wrote:
Dave Kuhlman wrote:
I've been using lxml and think it is great, but ...
I recently installed lxml-1.3.3. Now I find that the following gives me an error:
In [3]: from lxml import etree In [4]: etree.Element('abc:def') ------------------------------------------------------------ Traceback (most recent call last): File "<ipython console>", line 1, in <module> File "etree.pyx", line 1801, in etree.Element File "apihelpers.pxi", line 101, in etree._makeElement File "apihelpers.pxi", line 723, in etree._getNsTag ValueError: Invalid tag name
It's because of the ":" in the tag name.
As another data point: by coincidence yesterday I saw a discussion of some other project who also ran into this problem.
http://groups.google.com/group/html5lib-discuss/browse_thread/thread/9997a24...
Hmmm, I don't know. Maybe we should revert the behaviour for 1.3.4 and just keep it for 2.0, which actually tests tag names against the spec instead of just looking for ':'. Projects that use those tag names are now aware that this is not supposed to be allowed (as the link above suggests), so changing the behaviour in 2.0 gives them the time to fix their software. We could maybe raise a Warning if we encounter problematic usage. At least, I would make it clear in the release notes that this is *only* for temporary convenience. Opinions? Stefan
participants (4)
-
Dave Kuhlman
-
Ian Bicking
-
Martijn Faassen
-
Stefan Behnel