Stefan Behnel wrote:
Fredrik Lundh wrote:
there's no such thing as an "XML document in a Unicode string"
Well, there's things like "XML documents in files", "XML documents in HTTP" and "XML documents in SMTP", so why not "XML documents in Unicode strings"?
files, HTTP, and SMTP all deal with bytes (or if you prefer, octets). a Python Unicode string doesn't contain bytes; it contains a sequence of Unicode code points, which are indexes into an abstract character space. a Python Unicode string doesn't have an encoding. XML serialization is all about converting between the XML infoset (which contains sequences of abstract code points) and the XML file format (which contains bytes). an XML file is a bunch of bytes, not a bunch of code points. storing a bunch of bytes as a bunch of code points is simply not a very good idea, and is a great way to make people who don't understand Unicode to write XML applications that will break when exposed to non- ASCII text. </F>