[lxml-dev] Help with an error message
Hi, everyone: I'm having trouble with the following case. One of my automatic import scripts takes data from one source and submits it to another as an XML feed. Recently, it started failing because one of the entries contains a null. The testcase is such: from lxml.etree import Element sourcestr = 'Contains a null: \x00' unistr = unicode(sourcestr, 'utf-8') elt = Element('foo').text = unistr Running it will cause the following error: Traceback (most recent call last): File "foo.py", line 6, in <module> elt = Element('foo').text = unistr File "etree.pyx", line 741, in etree._Element.text.__set__ File "apihelpers.pxi", line 344, in etree._setNodeText File "apihelpers.pxi", line 648, in etree._utf8 AssertionError: All strings must be XML compatible, either Unicode or ASCII Can someone suggest the best way to deal with this? Kind regards, -- Konstantin Ryabitsev Montréal, Québec
Hi, Konstantin Ryabitsev wrote:
I'm having trouble with the following case. One of my automatic import scripts takes data from one source and submits it to another as an XML feed. Recently, it started failing because one of the entries contains a null. The testcase is such:
from lxml.etree import Element sourcestr = 'Contains a null: \x00' unistr = unicode(sourcestr, 'utf-8') elt = Element('foo').text = unistr
Running it will cause the following error:
Traceback (most recent call last): File "foo.py", line 6, in <module> elt = Element('foo').text = unistr File "etree.pyx", line 741, in etree._Element.text.__set__ File "apihelpers.pxi", line 344, in etree._setNodeText File "apihelpers.pxi", line 648, in etree._utf8 AssertionError: All strings must be XML compatible, either Unicode or ASCII
Can someone suggest the best way to deal with this?
My first question is: why do you need a '\x00' here? If you want to pass binary data in XML, the best way is to use a safe encoding such as uuencode or whatever. That should be part of your XML language spec/schema/... Stefan
The null character makes the XML non-well-formed anyway. The legal character ranges for XML (as per the spec, section 2.2): Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] Definitely no \x00! So ... I would base64 encode any offending data, as suggested by Stefan. Rob On Thu, 2008-01-03 at 17:30 +0100, Stefan Behnel wrote:
Konstantin Ryabitsev wrote:
I'm having trouble with the following case. One of my automatic import scripts takes data from one source and submits it to another as an XML feed. Recently, it started failing because one of the entries contains a null.
My first question is: why do you need a '\x00' here? If you want to pass binary data in XML, the best way is to use a safe encoding such as uuencode or whatever. That should be part of your XML language spec/schema/...
Hi, Rob Sanderson wrote:
The null character makes the XML non-well-formed anyway.
The legal character ranges for XML (as per the spec, section 2.2):
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Definitely no \x00!
that's true. While you could get away on the XML /generator/ side with adding an Entity (and lxml 2.0 will let you do that), this will just let you write out broken XML that the recipient will not be able to parse:
from lxml import etree as et el = et.Element("test") el.text = "mind the " el.append(et.Entity("#0")) xml = et.tostring(el) '<test>mind the </test>'
et.fromstring(xml) Traceback (most recent call last): lxml.etree.XMLSyntaxError: xmlParseCharRef: invalid xmlChar value 0, line 1,
column 20
Maybe we should fix the Entity() factory here to prevent such misuse... Stefan
Stefan Behnel <stefan_ml <at> behnel.de> writes:
Konstantin Ryabitsev wrote:
Traceback (most recent call last): File "foo.py", line 6, in <module> elt = Element('foo').text = unistr File "etree.pyx", line 741, in etree._Element.text.__set__ File "apihelpers.pxi", line 344, in etree._setNodeText File "apihelpers.pxi", line 648, in etree._utf8 AssertionError: All strings must be XML compatible, either Unicode or ASCII
Can someone suggest the best way to deal with this?
My first question is: why do you need a '\x00' here? If you want to pass binary data in XML, the best way is to use a safe encoding such as uuencode or whatever. That should be part of your XML language spec/schema/...
I just ran into this myself. In my case, having the NULL was not desired, rather I wanted to see a raw '\x00' to appear in the string(ie, the literal backslash sequence, *not* the NULL character). It would be nice if lxml would be more explicit about the problem: raise ValueError("NULL characters are not allowed in XML strings") That is: How I am supposed to derive that a NULL character was causing that AssertionError from the given string? (It wasn't until I found this message that I understood what I was doing wrong)
James William Pye wrote:
It would be nice if lxml would be more explicit about the problem:
raise ValueError("NULL characters are not allowed in XML strings")
That is: How I am supposed to derive that a NULL character was causing that AssertionError from the given string? (It wasn't until I found this message that I understood what I was doing wrong)
Ok, what about: "All strings must be XML compatible: Unicode or ASCII, no NULL bytes" ? Stefan
participants (4)
-
James William Pye
-
Konstantin Ryabitsev
-
Rob Sanderson
-
Stefan Behnel