[lxml-dev] invalid tag names get serialized
Hi, I noticed that lxml (both objectify and etree) happily accepts broken tag names (numbers, containing whitespace, ...) throughout the API and also serializes such document; only when trying to re-parse it this fails:
root = etree.Element("root") etree.SubElement(root, " __foo bar ") '' print etree.tostring(root) <root>< __foo bar /></root> print etree.fromstring(etree.tostring(root)) Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 1970, in etree.fromstring File "parser.pxi", line 980, in etree._parseMemoryDocument File "parser.pxi", line 876, in etree._parseDoc File "parser.pxi", line 533, in etree._BaseParser._parseDoc File "parser.pxi", line 660, in etree._handleParseResult File "parser.pxi", line 608, in etree._raiseParseError etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 8
I gather this is basically libxml2 behaviour. It is not nice, though, since you can produce serialized data without knowing your evil doings, and only detect it when you try to parse it back in (in vain). Would it be a problem to have the tag name checked before it is set for an element? Holger -- Psssst! Schon vom neuen GMX MultiMessenger gehört? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger
jholg@gmx.de wrote:
I noticed that lxml (both objectify and etree) happily accepts broken tag names (numbers, containing whitespace, ...) throughout the API and also serializes such document; only when trying to re-parse it this fails:
root = etree.Element("root") etree.SubElement(root, " __foo bar ") '' print etree.tostring(root) <root>< __foo bar /></root> print etree.fromstring(etree.tostring(root)) Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 1970, in etree.fromstring File "parser.pxi", line 980, in etree._parseMemoryDocument File "parser.pxi", line 876, in etree._parseDoc File "parser.pxi", line 533, in etree._BaseParser._parseDoc File "parser.pxi", line 660, in etree._handleParseResult File "parser.pxi", line 608, in etree._raiseParseError etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 8
I gather this is basically libxml2 behaviour. It is not nice, though, since you can produce serialized data without knowing your evil doings, and only detect it when you try to parse it back in (in vain). Would it be a problem to have the tag name checked before it is set for an element?
Not entirely "libxml2 behaviour", since it actually provides functions to check names. You just have to use them. Although 'just' is slightly too simplistic here. The straight forward patch actually breaks lots of test cases, e.g. getiterator('*'). I'll have to look into this, but this is definitely 2.0 stuff. Maybe it would be enough to check names only in the factory functions, 'el.set()' and 'el.attrib.__setitem__()'. Lookup and search methods/functions don't have to care. Stefan
Hi Stefan,
detect it when you try to parse it back in (in vain). Would it be a problem to have the tag name checked before it is set for an element?
Not entirely "libxml2 behaviour", since it actually provides functions to check names. You just have to use them. Although 'just' is slightly too simplistic here. The straight forward patch actually breaks lots of test cases, e.g. getiterator('*').
I'll have to look into this, but this is definitely 2.0 stuff. Maybe it would be enough to check names only in the factory functions, 'el.set()' and 'el.attrib.__setitem__()'. Lookup and search methods/functions don't have to care.
For my purposes, it would be sufficient if a tree did not serialize successfully; what I want to avoid is that I store/pickle documents that then turn out to not have been well-formed XML in the first place. So maybe that's easier to achieve than to check names straight away, although I fear not... Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
jholg@gmx.de wrote:
detect it when you try to parse it back in (in vain). Would it be a problem to have the tag name checked before it is set for an element? Not entirely "libxml2 behaviour", since it actually provides functions to check names. You just have to use them. Although 'just' is slightly too simplistic here. The straight forward patch actually breaks lots of test cases, e.g. getiterator('*').
I'll have to look into this, but this is definitely 2.0 stuff. Maybe it would be enough to check names only in the factory functions, 'el.set()' and 'el.attrib.__setitem__()'. Lookup and search methods/functions don't have to care.
For my purposes, it would be sufficient if a tree did not serialize successfully
:) that's actually the heaviest thing to implement, as we currently only pass a tree to libxml2 and let it do the rest. Also, it's too late and too hard to debug. No, this patch works much better, but the now failing tests seem to imply that Klingon tag names are not allowed in well-formed XML documents. I'll have to check if it's the XML spec that's xenophobe here or only libxml2... Stefan
Stefan Behnel wrote:
jholg@gmx.de wrote:
detect it when you try to parse it back in (in vain). Would it be a problem to have the tag name checked before it is set for an element? Not entirely "libxml2 behaviour", since it actually provides functions to check names. You just have to use them. Although 'just' is slightly too simplistic here. The straight forward patch actually breaks lots of test cases, e.g. getiterator('*').
I'll have to look into this, but this is definitely 2.0 stuff. Maybe it would be enough to check names only in the factory functions, 'el.set()' and 'el.attrib.__setitem__()'. Lookup and search methods/functions don't have to care. For my purposes, it would be sufficient if a tree did not serialize successfully
:) that's actually the heaviest thing to implement, as we currently only pass a tree to libxml2 and let it do the rest.
Also, it's too late and too hard to debug. No, this patch works much better, but the now failing tests seem to imply that Klingon tag names are not allowed in well-formed XML documents. I'll have to check if it's the XML spec that's xenophobe here or only libxml2...
Actually it's the spec, libxml2 is right here. So the current trunk no longer accepts invalid tag names at the API level when *creating* elements or attributes. It still accepts them when searching tags or looking up attributes. Stefan
Hi, I've just seen you've already been looking into this, so my comment below concerning test cases is just for reference, but: The name check should go directly into _createElement, otherwise etree.SubElement will not pick it up. I'm also pro renaming TagNameIsValid to NCNameIsValid, as it is used on attributes also.
Also, it's too late and too hard to debug. No, this patch works much better, but the now failing tests seem to imply that Klingon tag names are not allowed in well-formed XML documents. I'll have to check if it's the XML spec that's xenophobe here or only libxml2...
I do think that the character \u1234 is not allowed for XML NCNames: BaseChar production snippet: [...] #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] [...] Thanks, Holger -- Psssst! Schon vom neuen GMX MultiMessenger gehört? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger
jholg@gmx.de wrote:
The name check should go directly into _createElement,
No, _createElement() is only a tiny wrapper around the element node creation in libxml2. No Python exceptions allowed there.
otherwise etree.SubElement will not pick it up.
Then SubElement will get its own check. I factored out the exception raising so that it's only a one-liner to prevent invalid tags from passing through the API.
I'm also pro renaming TagNameIsValid to NCNameIsValid, as it is used on attributes also.
I actually renamed it to "_xmlNameIsValid()". It's not a public function yet, but I might reconsider that.
Also, it's too late and too hard to debug. No, this patch works much better, but the now failing tests seem to imply that Klingon tag names are not allowed in well-formed XML documents. I'll have to check if it's the XML spec that's xenophobe here or only libxml2...
I do think that the character \u1234 is not allowed for XML NCNames: BaseChar production snippet:
[...] #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] [...]
Right, I noticed that also. I also fixed the test cases now and added a bunch of new ones. Stefan
Hi,
The name check should go directly into _createElement,
No, _createElement() is only a tiny wrapper around the element node creation in libxml2. No Python exceptions allowed there.
Just out of curiosity: Is this by policy, or would it really cause problems? Because I tried just that and didn't see any problems. But of course I didn't test any tricky stuff or threading or you know what.
otherwise etree.SubElement will not pick it up.
Then SubElement will get its own check. I factored out the exception raising so that it's only a one-liner to prevent invalid tags from passing through the API.
Great, thanks. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
jholg@gmx.de wrote:
The name check should go directly into _createElement, No, _createElement() is only a tiny wrapper around the element node creation in libxml2. No Python exceptions allowed there.
Just out of curiosity: Is this by policy, or would it really cause problems? Because I tried just that and didn't see any problems. But of course I didn't test any tricky stuff or threading or you know what.
They are meant to be as simple as they are: just create a plain xmlNode. They basically give that a more explicit name and make sure we always call the same thing, so if we ever really need to change something here, we have one place to do so. But they are internal functions, not API level functions. Error checking must be done at the API level, before entering into the internals. For example, _makeElement is the main function for creating an Element proxy at the API level. It's also a public C function that can be safely used in a external modules (like objectify). It does all the error checking and figuring out what you meant and it can happily throw an exception if you provided rubbish, as it will only be used from API functions. Another good example is _getNsTag, which is the main API-level helper for splitting up something that came from the user into a UTF-8 encoded namespace *string* (or None) and a tag name *string*. It throws an exception if that fails and guarantees to return objects of the expected types. That really helps internally, because all internal code can just rely on that. There's also public-api.pxi that wraps some of the half-public C functions and adds some additional error checking in some cases to make them publicly usable. Maybe a good way to detect an API-level C function is if a) they already throw an exception, b) they return things like _Element or other API-level objects or c) they are often used at the beginning of API functions. Admittedly, it's not always 100% clear from the code (_createElement is a bad example as it's directly used in SubElement), but those are good rules of thumb. Does that make the difference clear? Stefan
Hi Stefan,
Maybe a good way to detect an API-level C function is if a) they already throw an exception, b) they return things like _Element or other API-level objects or c) they are often used at the beginning of API functions. Admittedly, it's not always 100% clear from the code (_createElement is a bad example as it's directly used in SubElement), but those are good rules of thumb.
Does that make the difference clear?
Yes, that's very helpful, and good to know. Thanks, Holger -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
participants (2)
-
jholg@gmx.de -
Stefan Behnel