[lxml-dev] automatic attribute unicode decode?

Hi, I'm quite puzzled by the following excerpt:
In a bare document with no encoding declaration, lxml has decoded itself a string that did not match the ascii table (what heuristic did it use?). Now I have three attributes of two different types. I wonder why the integer was not decoded. ;-) I actually found this in a real-world document with encoding and namespaces (An ODF xml part). Is this a bug to report and how to circumvent it? Thanks, Hervé

Hervé: I keep hearing that LXML defaults to UTF-8 so that is probably the heuristic used. Good luck, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell@nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Hervé Cauwelier Sent: Friday, July 31, 2009 8:40 AM To: lxml-dev@codespeak.net Subject: [lxml-dev] automatic attribute unicode decode? Hi, I'm quite puzzled by the following excerpt:
In a bare document with no encoding declaration, lxml has decoded itself a string that did not match the ascii table (what heuristic did it use?). Now I have three attributes of two different types. I wonder why the integer was not decoded. ;-) I actually found this in a real-world document with encoding and namespaces (An ODF xml part). Is this a bug to report and how to circumvent it? Thanks, Hervé _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev

Hervé Cauwelier wrote:
No heuristic. It follows the XML specification in that the absence of an XML declaration defines the encoding as UTF-8. I assume your console was set to UTF-8 when you typed the above?
Definitely not a bug. What would be the behaviour you expected instead? Stefan

Hervé: I keep hearing that LXML defaults to UTF-8 so that is probably the heuristic used. Good luck, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell@nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] On Behalf Of Hervé Cauwelier Sent: Friday, July 31, 2009 8:40 AM To: lxml-dev@codespeak.net Subject: [lxml-dev] automatic attribute unicode decode? Hi, I'm quite puzzled by the following excerpt:
In a bare document with no encoding declaration, lxml has decoded itself a string that did not match the ascii table (what heuristic did it use?). Now I have three attributes of two different types. I wonder why the integer was not decoded. ;-) I actually found this in a real-world document with encoding and namespaces (An ODF xml part). Is this a bug to report and how to circumvent it? Thanks, Hervé _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev

Hervé Cauwelier wrote:
No heuristic. It follows the XML specification in that the absence of an XML declaration defines the encoding as UTF-8. I assume your console was set to UTF-8 when you typed the above?
Definitely not a bug. What would be the behaviour you expected instead? Stefan
participants (3)
-
Hervé Cauwelier
-
John Lovell
-
Stefan Behnel