[Python-Dev] XML codec?

Sat Nov 10 16:55:41 CET 2007

"Martin v. LÃ¶wis" sagte:

>>> So what if the unicode string doesn't start with an XML declaration?
>>> Will it add one?
>>
>> No.
>
> Ok. So the XML document would be ill-formed then unless the encoding is
> UTF-8, right?

I don't know. Is an XML document ill-formed if it doesn't contain an XML declaration, is not in UTF-8 or UTF-8, but there's
external encoding info? If it is, then yes, the document would be ill-formed.

>> The point of this code is not just to return whether the string starts
>> with "<?xml" or not. There are actually three cases:
>
> Still, it's overly complex for that matter:
>
>>   * The string does start with "<?xml"
>
>    if s.startswith("<?xml"):
>      return Yes
>
>>   * The string starts with a prefix of "<?xml", i.e. we can only
>>     decide if it starts with "<?xml" if we have more input.
>
>    if "<?xml".startswith(s):
>      return Maybe
>
>>   * The string definitely doesn't start with "<?xml".
>
>    return No

This looks good. Now we would have to extent the code to detect and replace the encoding in the XML declaration too.

>>> What bit fiddling are you referring to specifically that you think
>>> is better done in C than in Python?
>>
>> The code that checks the byte signature, i.e. the first part of
>> detect_xml_encoding_str().
>
> I can't see any *bit* fiddling there, except for the bit mask of
> candidates. For the candidate list, I cannot quite understand why
> you need a bit mask at all, since the candidates are rarely
> overlapping.

I tried many variants and that seemed to be the most straitforward one.

> I think there could be a much simpler routine to have the same
> effect.
> - if it's less than 4 bytes, answer "need more data".

Can there be an XML document that is less then 4 bytes? I guess not.

> - otherwise, implement annex F "literally". Make a dictionary
>   of all prefixes that are exactly 4 bytes, i.e.
>
>   prefixes4 = {"\x00\x00\xFE\xFF":"utf-32be", ...
>                   ..., 	"\0\x3c\0\x3f":"utf-16le"}
>
>   try: return prefixes4[s[:4]]
>   except KeyError: pass
>   if s.startswith(codecs.BOM_UTF16_BE):return "utf-16be"
>   ...
>   if s.startswith("<?xml"):
>      return get_encoding_from_declaration(s)
>   return "utf-8"

get_encoding_from_declaration() would have to do the same yes/no/maybe decision.

But anyway: would a Python implementation of these two functions (detect_encoding()/fix_encoding()) be accepted?

Servus,
   Walter