[lxml-dev] fromstring and encoding (a bug?)
data:image/s3,"s3://crabby-images/7e37f/7e37f93298b0621c66732053206a155caab5fa01" alt=""
Hi! I'm having troubles when trying to parse xml from a string with iso-8859-1 encoded contents. Consider for example this test script: testxml = file("/tmp/test.xml", "w") testxml.write( """<?xml version="1.0" encoding="ISO-8859-1" ?> <root>\xfa</root>""") testxml.close() from lxml import etree print "%r" % etree.parse("/tmp/test.xml").getroot().text # => u'\xfa' print "%r" % etree.fromstring(file("/tmp/test.xml").read().decode("iso-8859-1")).text # => u'\xc3\xba' print "%r" % etree.fromstring(file("/tmp/test.xml").read()).text # => UnicodeDecodeError Here \xfa is the latin-1 u acute char. In the last three lines: 1) It's read from a document file with the correct encoding in its header. The output is: u'\xfa'. Right. 2) It's read from an unicode string decoded from that file via iso-8859-1 codec. The output is: u'\xc3\xba'. Wrong. 3) It's read from a raw string with that file contents. An UnicodeDecodeError is raised. Right? Wrong? Note that in the second case the input unicode string seems to be encoded to utf-8 at first and then decoded again but using the iso-8859-1 codec instead. So at the end not only its encoding but -worst- its contents are modified. (One quick fix would be to encode that output to latin-1 and then redecode again to unicode via utf-8.) Is that a bug? I can't figure myself another way of using fromstring than passing it a raw string with a correct header (which fails with an exception as in 3) or passing it an unicode string correctly decoded from the original (which fails as in 2). Best regards, Carlos ___________________________________________________________ 1GB gratis, Antivirus y Antispam Correo Yahoo!, el mejor correo web del mundo http://correo.yahoo.com.ar
participants (1)
-
Carlos Pita