Martin v. Löwis wrote:
I don't know. Is an XML document ill-formed if it doesn't contain an XML declaration, is not in UTF-8 or UTF-8, but there's external encoding info?
If there is external encoding info, matching the actual encoding, it would be well-formed. Of course, preserving that information would be up to the application.
OK. When the application passes an encoding to the decoder this is supposed to be the external encoding info, so for the decoder it makes sense to assume that the encoding passed to the encoder is the external encoding info and will be transmitted along with the encoded bytes.
This looks good. Now we would have to extent the code to detect and replace the encoding in the XML declaration too.
I'm still opposed to making this a codec. Right - for a pure Python solution, the processing of the XML declaration would still need to be implemented.
I think there could be a much simpler routine to have the same effect. - if it's less than 4 bytes, answer "need more data". Can there be an XML document that is less then 4 bytes? I guess not.
No, the smallest document has exactly 4 characters (e.g. "<f/>"). However, external entities may be smaller, such as "x".
But anyway: would a Python implementation of these two functions (detect_encoding()/fix_encoding()) be accepted?
I could agree to a Python implementation of this algorithm as long as it's not packaged as a codec.
I still can't understand your objection to a codec. What's the difference between UTF-16 decoding and XML decoding? In fact PEP 263 IMHO does specify how to decode Python source, so in theory it could be a codec (in practice this probably wouldn't work because of bootstrapping problems). Servus, Walter