[Python-Dev] XML codec?

Walter Dörwald walter at livinglogic.de
Fri Nov 9 13:49:41 CET 2007


Martin v. Löwis wrote:

>> Because you can force the encoder to use a specified encoding. If you do
>> this and the unicode string starts with an XML declaration
> 
> So what if the unicode string doesn't start with an XML declaration?
> Will it add one?

No.

> If so, what version number will it use?

If we added this we could add an extra argument version to the encoder
constructor defaulting to '1.0'.

>>>> OK, so should I put the C code into a _xml module?
>>> I don't see the need for C code at all.
>> Doing the bit fiddling for
>> Modules/_codecsmodule.c::detect_xml_encoding_str() in C felt like the
>> right thing to do.
> 
> Hmm. I don't think a sequence like
> 
> +    if (strlen>0)
> +    {
> +        if (*str++ != '<')
> +            return 1;
> +        if (strlen>1)
> +        {
> +            if (*str++ != '?')
> +                return 1;
> +            if (strlen>2)
> +            {
> +                if (*str++ != 'x')
> +                    return 1;
> +                if (strlen>3)
> +                {
> +                    if (*str++ != 'm')
> +                        return 1;
> +                    if (strlen>4)
> +                    {
> +                        if (*str++ != 'l')
> +                            return 1;
> +                        if (strlen>5)
> +                        {
> +                            if (*str != ' ' && *str != '\t' && *str !=
> '\r' && *str != '\n')
> +                                return 1;
> 
> is well-maintainable C. I feel it is much better writing
> 
>   if not s.startswith("<=?xml"):
>      return 1

The point of this code is not just to return whether the string starts
with "<?xml" or not. There are actually three cases:
  * The string does start with "<?xml"
  * The string starts with a prefix of "<?xml", i.e. we can only
    decide if it starts with "<?xml" if we have more input.
  * The string definitely doesn't start with "<?xml".

> What bit fiddling are you referring to specifically that you think
> is better done in C than in Python?

The code that checks the byte signature, i.e. the first part of
detect_xml_encoding_str().

Servus,
   Walter






More information about the Python-Dev mailing list