[lxml-dev] Encoding Issues
Hello, I notice that I can pass an encoding parameter to an elementtree's write method. Two things: 1) When lxml supports xml processing instructions, will these be updated accordingly? I notice that the DocBook XSL will update meta-equiv HTML elements according to the value I pass in here - which is good, except for the following... 2) Why does this let me make up encodings such as "Noahs-Cool-Encoding" without raising an exception. If I was to take a guess I would imagine this falls back on UTF-8 but these is a bug IMO. In case you are wondering I am in the process of writing an XMl based HTTP publishing framework and lxml will be sitting at the very core of how I handle document conversion/manipulation/transformation. The problem with encoding lies within my content negotiation module which will (in a pythonic manner IMO) try to transform the document with each encoding specified in the Accept-Charset header of the client request. If the transform raises an exception we move on to the next one. My previous way of working would raise an exception if I tried an encoding it didn't recognise. While I am aware of how to look up encoding names using the python standard library - I am not sure if this correlates 100% with lxml and additionally I don't feel this extra step should be necessary. Thanks so much. Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman
Hi Noah, Noah Slater wrote:
I notice that I can pass an encoding parameter to an elementtree's write method.
:) http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.Ele...
Two things:
1) When lxml supports xml processing instructions, will these be updated accordingly?
lxml doesn't currently support processing instructions. If you meant the XML declaration, then: yes, starting with 1.0.beta.
2) Why does this let me make up encodings such as "Noahs-Cool-Encoding" without raising an exception. If I was to take a guess I would imagine this falls back on UTF-8 but these is a bug IMO.
True, guess we should ask libxml2 to parse the encoding and raise an exception if it is not known. Since it's already parsed a couple of times, that's not too much of a problem.
In case you are wondering I am in the process of writing an XMl based HTTP publishing framework and lxml will be sitting at the very core of how I handle document conversion/manipulation/transformation.
Interesting. Feel free to post a URL to the list in case it becomes available online.
The problem with encoding lies within my content negotiation module which will (in a pythonic manner IMO) try to transform the document with each encoding specified in the Accept-Charset header of the client request. If the transform raises an exception we move on to the next one.
Sure, sounds sensible. Although most likely a commonly accepted encoding such as UTF-8 should be fine in most cases. As an optimisation, you can check if it's in the acceptance list and only if it's not accepted, fall back to checking one after the other.
While I am aware of how to look up encoding names using the python standard library - I am not sure if this correlates 100% with lxml and additionally I don't feel this extra step should be necessary.
Well, it is necessary because we can only rely on encodings known in libxml2 (which uses iconv, so that's most of the encodings you will ever come across). And except for the UCS4 bug, libxml2 is pretty good in guessing what encoding was meant, so as long as no one finds a discrepancy in what Python understands and what libxml2 handles, I don't see a reason for changing anything here. I'll make sure we raise an exception for unknown encodings, though. Stefan
Hi Stefan,
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.Ele...
Yeah, but help() and dir() is so much more fun don't you think? ;)
Interesting. Feel free to post a URL to the list in case it becomes available online.
Without a doubt, my software is being developed under a GNU GPL licence and I intend to distribute it far and wide. :) I just wanted to say again, thanks for this great software! Regards, Noah -- "Creativity can be a social contribution, but only in so far as society is free to use the results." - R. Stallman
participants (2)
-
Noah Slater -
Stefan Behnel