Re: [lxml-dev] Re: Python unicode string support in lxml

10 May 2006

      Fredrik Lundh wrote:
...
Steve Howe wrote:
...
Whatever the calling method gets named, its a great feature, thanks.
so what's your use case?
...
(I hope you're aware that the XML file format is defined in terms of en-
coded data, not as sequences of Unicode code points, and that XML
encoding involves more than just character sets; there's no such thing
as an "XML document in a Unicode string")
For fun let's look at the XML spec and see whether we can get some 
answers there.

The spec says:

   The mechanism for encoding character code points into bit patterns MAY
   vary from entity to entity. All XML processors MUST accept the UTF-8
   and UTF-16 encodings of  Unicode 3.1

It also says:

   In the absence of external character encoding information (such as
   MIME headers), parsed entities which are stored in an encoding other
   than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The
   Text Declaration) containing an encoding declaration: ...

   [...]

   In the absence of information provided by an external transport
   protocol (e.g. HTTP or MIME), it is a fatal error for an entity
   including an encoding declaration to be presented to the XML processor
   in an encoding other than that named in the declaration, or for an
   entity which begins with neither a Byte Order Mark nor an encoding
   declaration to use an encoding other than UTF-8. Note that since ASCII
   is a subset of UTF-8, ordinary ASCII entities do not strictly need an
   encoding declaration.

Confusingly in the first part it talks about 'stored in an encoding 
other than..' and later on it talks about "information provided by an 
external transport protocol". Still, my interpretation would be that in 
the case of Python unicode strings, we *do* have a form of 'external 
character encoding information'. So, in the presence of such external 
information, this means that the encoding declaration is *not* necessary 
in the document (and in fact I'd say it shouldn't be there in case of 
XML in unicode strings).

Whether it's useful in practical applications to have the ability to 
store XML in Python unicode strings is an interesting debate. In the 
case of in-memory XML processors it might simplify matters if you can 
just treat any text everywhere as unicode. At least, it'd simplify 
combining XML text with non-XML text somehow. (You'd prefer to use the 
ElementTree API for such manipulation though. :)

On the other hand, in the lxml implementation it'll be slower than 
actually dealing with XML as UTF-8, as that's what libxml2 will be able 
to parse most quickly. So we could argue that encouraging the above 
usage pattern is going to lead to less than optimal performance. I don't 
consider that a big problem as fast performance is still available, though.

I'm fine with a tounicode() output function (I'd be more worried about 
the unicode(), but I'm glad that idea got revoked already). I also don't 
see harm in accepting unicode input into the XML() function. I see that 
it fails in case an encoding is expressed in the XML itself, so that's 
good. So, +1 to the current set of changes.

Regards,

Martijn

Re: [lxml-dev] Re: Python unicode string support in lxml

Martijn Faassen