[lxml-dev] fromstring and encoding (a bug?)

Hi! I'm having troubles when trying to parse xml from a string with iso-8859-1 encoded contents. Consider for example this test script: testxml = file("/tmp/test.xml", "w") testxml.write( """<?xml version="1.0" encoding="ISO-8859-1" ?> <root>\xfa</root>""") testxml.close() from lxml import etree print "%r" % etree.parse("/tmp/test.xml").getroot().text # => u'\xfa' print "%r" % etree.fromstring(file("/tmp/test.xml").read().decode("iso-8859-1")).text # => u'\xc3\xba' print "%r" % etree.fromstring(file("/tmp/test.xml").read()).text # => UnicodeDecodeError Here \xfa is the latin-1 u acute char. In the last three lines: 1) It's read from a document file with the correct encoding in its header. The output is: u'\xfa'. Right. 2) It's read from an unicode string decoded from that file via iso-8859-1 codec. The output is: u'\xc3\xba'. Wrong. 3) It's read from a raw string with that file contents. An UnicodeDecodeError is raised. Right? Wrong? Note that in the second case the input unicode string seems to be encoded to utf-8 at first and then decoded again but using the iso-8859-1 codec instead. So at the end not only its encoding but -worst- its contents are modified. (One quick fix would be to encode that output to latin-1 and then redecode again to unicode via utf-8.) Is that a bug? I can't figure myself another way of using fromstring than passing it a raw string with a correct header (which fails with an exception as in 3) or passing it an unicode string correctly decoded from the original (which fails as in 2). Best regards, Carlos ___________________________________________________________ 1GB gratis, Antivirus y Antispam Correo Yahoo!, el mejor correo web del mundo http://correo.yahoo.com.ar

Carlos Pita wrote:
I think 3) is correct given the fact that 2) returns a two byte character. I can reproduce that without stepping through a file. Just try this: .>>> from lxml import etree .>>> test = '<?xml version="1.0" encoding="ISO-8859-1" ?><root>\xfa</root>' .>>> test.decode('iso-8859-1') u'<?xml version="1.0" encoding="ISO-8859-1" ?>\n<root>\xfa</root>' .>>> etree.fromstring(test.decode('iso-8859-1')).text u'\xc3\xba' It comes from the fact that fromstring encodes the file as UTF-8 when it is passed a Unicode string. Then, libxml2 reads the UTF-8 and the header tells it to understand it as ISO-8859-1. Problem just here... The easiest should be to replace the definition of XML() in etree.pyx by this: def XML(text): cdef xmlDoc* c_doc if isinstance(text, unicode): text = text.encode('UTF-8') if text[:5] == '<?xml': i = text.find('?>') if i != -1: if text[i+2:i+3] == '\n': i = i+1 text = text[i + 1:] c_doc = theParser.parseDoc(text) return _documentFactory(c_doc).getroot() That fixes it. Stefan

Stefan Behnel wrote:
Not that I forget: This is not actually a bug, but rather a misuse. Don't pass in Unicode and say it's ISO-8859-1! It would be a bug if you passed in XML data delaced as UTF-16 in a Python unicode string. Anyway, the above code prevents such misuse. I'll send in a new version of my patch that contains that little helper. Stefan

Carlos Pita wrote:
I think 3) is correct given the fact that 2) returns a two byte character. I can reproduce that without stepping through a file. Just try this: .>>> from lxml import etree .>>> test = '<?xml version="1.0" encoding="ISO-8859-1" ?><root>\xfa</root>' .>>> test.decode('iso-8859-1') u'<?xml version="1.0" encoding="ISO-8859-1" ?>\n<root>\xfa</root>' .>>> etree.fromstring(test.decode('iso-8859-1')).text u'\xc3\xba' It comes from the fact that fromstring encodes the file as UTF-8 when it is passed a Unicode string. Then, libxml2 reads the UTF-8 and the header tells it to understand it as ISO-8859-1. Problem just here... The easiest should be to replace the definition of XML() in etree.pyx by this: def XML(text): cdef xmlDoc* c_doc if isinstance(text, unicode): text = text.encode('UTF-8') if text[:5] == '<?xml': i = text.find('?>') if i != -1: if text[i+2:i+3] == '\n': i = i+1 text = text[i + 1:] c_doc = theParser.parseDoc(text) return _documentFactory(c_doc).getroot() That fixes it. Stefan

Stefan Behnel wrote:
Not that I forget: This is not actually a bug, but rather a misuse. Don't pass in Unicode and say it's ISO-8859-1! It would be a bug if you passed in XML data delaced as UTF-16 in a Python unicode string. Anyway, the above code prevents such misuse. I'll send in a new version of my patch that contains that little helper. Stefan
participants (2)
-
Carlos Pita
-
Stefan Behnel