[Python-Dev] XML codec?

Fri Nov 9 19:55:51 CET 2007

>> So what if the unicode string doesn't start with an XML declaration?
>> Will it add one?
> 
> No.

Ok. So the XML document would be ill-formed then unless the encoding is
UTF-8, right?

> The point of this code is not just to return whether the string starts
> with "<?xml" or not. There are actually three cases:

Still, it's overly complex for that matter:

>   * The string does start with "<?xml"

   if s.startswith("<?xml"):
     return Yes

>   * The string starts with a prefix of "<?xml", i.e. we can only
>     decide if it starts with "<?xml" if we have more input.

   if "<?xml".startswith(s):
     return Maybe

>   * The string definitely doesn't start with "<?xml".

   return No

>> What bit fiddling are you referring to specifically that you think
>> is better done in C than in Python?
> 
> The code that checks the byte signature, i.e. the first part of
> detect_xml_encoding_str().

I can't see any *bit* fiddling there, except for the bit mask of
candidates. For the candidate list, I cannot quite understand why
you need a bit mask at all, since the candidates are rarely
overlapping.

I think there could be a much simpler routine to have the same
effect.
- if it's less than 4 bytes, answer "need more data".
- otherwise, implement annex F "literally". Make a dictionary
  of all prefixes that are exactly 4 bytes, i.e.

  prefixes4 = {"\x00\x00\xFE\xFF":"utf-32be", ...
                  ..., 	"\0\x3c\0\x3f":"utf-16le"}

  try: return prefixes4[s[:4]]
  except KeyError: pass
  if s.startswith(codecs.BOM_UTF16_BE):return "utf-16be"
  ...
  if s.startswith("<?xml"):
     return get_encoding_from_declaration(s)
  return "utf-8"

Regards,
Martin