Fredrik Lundh wrote:
Steve Howe wrote:
Whatever the calling method gets named, its a great feature, thanks.
so what's your use case?
(I hope you're aware that the XML file format is defined in terms of en- coded data, not as sequences of Unicode code points, and that XML encoding involves more than just character sets; there's no such thing as an "XML document in a Unicode string")
For fun let's look at the XML spec and see whether we can get some answers there. The spec says: The mechanism for encoding character code points into bit patterns MAY vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1 It also says: In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: ... [...] In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration. Confusingly in the first part it talks about 'stored in an encoding other than..' and later on it talks about "information provided by an external transport protocol". Still, my interpretation would be that in the case of Python unicode strings, we *do* have a form of 'external character encoding information'. So, in the presence of such external information, this means that the encoding declaration is *not* necessary in the document (and in fact I'd say it shouldn't be there in case of XML in unicode strings). Whether it's useful in practical applications to have the ability to store XML in Python unicode strings is an interesting debate. In the case of in-memory XML processors it might simplify matters if you can just treat any text everywhere as unicode. At least, it'd simplify combining XML text with non-XML text somehow. (You'd prefer to use the ElementTree API for such manipulation though. :) On the other hand, in the lxml implementation it'll be slower than actually dealing with XML as UTF-8, as that's what libxml2 will be able to parse most quickly. So we could argue that encouraging the above usage pattern is going to lead to less than optimal performance. I don't consider that a big problem as fast performance is still available, though. I'm fine with a tounicode() output function (I'd be more worried about the unicode(), but I'm glad that idea got revoked already). I also don't see harm in accepting unicode input into the XML() function. I see that it fails in case an encoding is expressed in the XML itself, so that's good. So, +1 to the current set of changes. Regards, Martijn