[Twisted-Python] raw xml to element, char encoding/decoding error
Hello, I wrote some code to transform a raw XML string into a domish.Element, and I keep on getting char encoding/decoding errors : class __RawXmlToElement(object): def __call__(self, s): self.result = None def onStart(el): self.result = el def onEnd(): pass def onElement(el): self.result.addChild(el) parser = domish.elementStream() parser.DocumentStartEvent = onStart parser.ElementEvent = onElement parser.DocumentEndEvent = onEnd tmp = domish.Element(("", "s")) tmp.addRawXml(s) parser.parse(tmp.toXml()) return self.result.firstChildElement() rawXmlToElement = __RawXmlToElement() Here's a test raw XML string : >>> u"<t>reçu</t>" u'<t>re\xe7u</t>' >>> u"<t>reçu</t>".encode("utf-8") '<t>re\xc3\xa7u</t>' >>> "<t>reçu</t>" '<t>re\xc3\xa7u</t>' As you can see my system encodes strings in UTF-8, I tried the following but I keep on getting errors : >>> rawXmlToElement("<t>reçu</t>") raw xml adder error : 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128) >>> rawXmlToElement(u"<t>reçu</t>") parser error : 'ascii' codec can't encode character u'\xe7' in position 8: ordinal not in range(128) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 26, in __call__ AttributeError: 'NoneType' object has no attribute 'firstChildElement' >>> rawXmlToElement(unicode("<t>reçu</t>", "utf-8")) parser error : 'ascii' codec can't encode character u'\xe7' in position 8: ordinal not in range(128) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 26, in __call__ AttributeError: 'NoneType' object has no attribute 'firstChildElement' If I try it with ASCII encodable chars it works correctly : >>> rawXmlToElement("<t>toto</t>").toXml() u'<t>toto</t>' >>> rawXmlToElement(u"<t>toto</t>").toXml() u'<t>toto</t>' >>> rawXmlToElement(unicode("<t>toto</t>", " utf-8")).toXml() u'<t>toto</t>' Does anyone have an idea on what I'm doing wrong here? Thank you!
Gabriel Rossetti wrote:
Hello,
I wrote some code to transform a raw XML string into a domish.Element, and I keep on getting char encoding/decoding errors :
class __RawXmlToElement(object): def __call__(self, s): self.result = None def onStart(el): self.result = el def onEnd(): pass def onElement(el): self.result.addChild(el) parser = domish.elementStream() parser.DocumentStartEvent = onStart parser.ElementEvent = onElement parser.DocumentEndEvent = onEnd tmp = domish.Element(("", "s")) tmp.addRawXml(s) parser.parse(tmp.toXml()) return self.result.firstChildElement()
rawXmlToElement = __RawXmlToElement()
Here's a test raw XML string :
>>> u"<t>reçu</t>" u'<t>re\xe7u</t>'
>>> u"<t>reçu</t>".encode("utf-8") '<t>re\xc3\xa7u</t>'
>>> "<t>reçu</t>" '<t>re\xc3\xa7u</t>'
As you can see my system encodes strings in UTF-8, I tried the following but I keep on getting errors :
>>> rawXmlToElement("<t>reçu</t>") raw xml adder error : 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
>>> rawXmlToElement(u"<t>reçu</t>") parser error : 'ascii' codec can't encode character u'\xe7' in position 8: ordinal not in range(128) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 26, in __call__ AttributeError: 'NoneType' object has no attribute 'firstChildElement'
>>> rawXmlToElement(unicode("<t>reçu</t>", "utf-8")) parser error : 'ascii' codec can't encode character u'\xe7' in position 8: ordinal not in range(128) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 26, in __call__ AttributeError: 'NoneType' object has no attribute 'firstChildElement'
If I try it with ASCII encodable chars it works correctly :
>>> rawXmlToElement("<t>toto</t>").toXml() u'<t>toto</t>'
>>> rawXmlToElement(u"<t>toto</t>").toXml() u'<t>toto</t>'
>>> rawXmlToElement(unicode("<t>toto</t>", " utf-8")).toXml() u'<t>toto</t>'
Does anyone have an idea on what I'm doing wrong here? Thank you!
I think this is an Python environment problem and not a Twisted problem. If I run the attached example in Eclipse, it works, if I run it from a terminal, it doesn't. This is now off topic, but if anyone has an Idea I'd be grateful... I'm also going to post this on the Python mailing list. Thank you, Gabriel
On 2009-02-18 12:14, Gabriel Rossetti wrote:
Hello,
I wrote some code to transform a raw XML string into a domish.Element, and I keep on getting char encoding/decoding errors :
[..] parser.parse(tmp.toXml()) [..]
Parser input is expected to be a string, not unicode. Try this instead: parser.parse(tmp.toXml().encode('utf-8')) ralphm
Ralph Meijer wrote:
On 2009-02-18 12:14, Gabriel Rossetti wrote:
Hello,
I wrote some code to transform a raw XML string into a domish.Element, and I keep on getting char encoding/decoding errors :
[..] parser.parse(tmp.toXml()) [..]
Parser input is expected to be a string, not unicode. Try this instead:
parser.parse(tmp.toXml().encode('utf-8'))
ralphm
Hello Ralphm, yes, I had fixed that in the code attached in my last msg, it only works in Eclipse though. Gabriel
On 2009-02-18 14:57, Gabriel Rossetti wrote:
Ralph Meijer wrote:
On 2009-02-18 12:14, Gabriel Rossetti wrote: [..]
Parser input is expected to be a string, not unicode. Try this instead:
parser.parse(tmp.toXml().encode('utf-8'))
ralphm
Hello Ralphm,
yes, I had fixed that in the code attached in my last msg, it only works in Eclipse though.
Ah, but in /that/ code, you typed: res = rawXmlToElement("<t>reçu</t>") While you should have: res = rawXmlToElement(u"<t>reçu</t>") ralphm
Ralph Meijer wrote:
On 2009-02-18 14:57, Gabriel Rossetti wrote:
Ralph Meijer wrote:
On 2009-02-18 12:14, Gabriel Rossetti wrote: [..]
Parser input is expected to be a string, not unicode. Try this instead:
parser.parse(tmp.toXml().encode('utf-8'))
ralphm
Hello Ralphm,
yes, I had fixed that in the code attached in my last msg, it only works in Eclipse though.
Ah, but in /that/ code, you typed:
res = rawXmlToElement("<t>reçu</t>")
While you should have:
res = rawXmlToElement(u"<t>reçu</t>")
ralphm
Ahh, yes, I see my error, thanks :-) I'm glad everything is unicode in python 3.... Gabriel
On 18 Feb, 02:26 pm, gabriel.rossetti@arimaz.com wrote:
Ahh, yes, I see my error, thanks :-) I'm glad everything is unicode in python 3....
Erm, input to the parser will still be bytes in python 3. The failure mode will hopefully be more obvious, but it's not that "everything" is unicode :). See the FAQ: http://is.gd/k2XF
glyph@divmod.com wrote:
On 18 Feb, 02:26 pm, gabriel.rossetti@arimaz.com wrote:
Ahh, yes, I see my error, thanks :-) I'm glad everything is unicode in python 3....
Erm, input to the parser will still be bytes in python 3. The failure mode will hopefully be more obvious, but it's not that "everything" is unicode :).
See the FAQ: http://is.gd/k2XF
Ok, I see now, thanks. Those links are nice too.
participants (3)
-
Gabriel Rossetti
-
glyph@divmod.com
-
Ralph Meijer