Mailman 3 [lxml-dev] fromstring and encoding (a bug?) - lxml - The Python XML Toolkit

Oct. 22, 2005

      Hi!

I'm having troubles when trying to parse xml from a string with
iso-8859-1 encoded
contents. Consider for example this test script:

  testxml = file("/tmp/test.xml", "w")
  testxml.write(
  """<?xml version="1.0" encoding="ISO-8859-1" ?>
  <root>\xfa</root>""")
  testxml.close()
  from lxml import etree
  print "%r" % etree.parse("/tmp/test.xml").getroot().text  # => u'\xfa'
  print "%r" %
etree.fromstring(file("/tmp/test.xml").read().decode("iso-8859-1")).text
# => u'\xc3\xba'
  print "%r" % etree.fromstring(file("/tmp/test.xml").read()).text # =>
UnicodeDecodeError

Here \xfa is the latin-1 u acute char. In the last three lines:

1) It's read from a document file with the correct encoding in its header.
    The output is: u'\xfa'. Right.

2) It's read from an unicode string decoded from that file via
iso-8859-1 codec.
    The output is: u'\xc3\xba'. Wrong.

3) It's read from a raw string with that file contents.
    An UnicodeDecodeError is raised. Right? Wrong?

Note that in the second case the input unicode string seems to be
encoded to utf-8
at first and then decoded again but using the iso-8859-1 codec instead.
So at the end
not only its encoding but -worst- its contents are modified. (One quick
fix would be
to encode that output to latin-1 and then redecode again to unicode via
utf-8.)

Is that a bug? I can't figure myself another way of using fromstring
than passing
it a raw string with a correct header (which fails with an exception as
in 3) or passing it
an unicode string correctly decoded from the original (which fails as in 2).

Best regards,
Carlos

___________________________________________________________ 
1GB gratis, Antivirus y Antispam 
Correo Yahoo!, el mejor correo web del mundo 
http://correo.yahoo.com.ar

[lxml-dev] fromstring and encoding (a bug?)

Carlos Pita

tags

participants (1)