Any reason why cStringIO in 2.5 behaves different from 2.4?

Sat Jul 28 21:14:26 EDT 2007

Michael L Torrie <torriem at chem.byu.edu> wrote:
> Stefan Scholl wrote:
>> Don't let the subject line fool you. I'm OK with cStringIO. The
>> thread is now about xml.sax's parseString().
> 
> Giving you the benefit of the doubt here, despite the fact that Stefan
> Behnel has state this over and over again and you just haven't listened.

Speaking of over and over again ...

> xml.sax's use of parseString() is exactly correct.  xml.sax should
> *never* parse python unicode strings as by definition XML must be
> encoded as a *byte stream*, which is what a python string is.

I don't care about the definition of XML at this point of the
program. http://docs.python.org/lib/module-xml.sax.html calls
parseString() a convenience function.

This is Python. Python has a class named unicode. Its literals
look like strings. The base class is basestring.

xml.sax belongs to Python. Batteries included. parseString() is
in Python.

It's not parseString() that tells me something is wrong with the
parameter. It's cStringIO, which is used on platforms where it is
available. On other platforms no exceptions are thrown, because
then StringIO is used, which behaves in Python 2.4 and Python 2.5
the same, regarding unicode strings.

Other libraries like LXML (not included) parse unicode strings.

And these are two additional lines in my code now:

    if isinstance(string, unicode):
            string = string.encode("utf-8")

> A python /unicode/ string could be held internally in any number of
> ways, 2, 3, 4, or even 8 bytes per character if the implementation
> demanded it (a bit contrived, I admit).  Since the xml parser is only
> ever intended to parse *XML*, why should it ever know what to do with
> python unicode strings, which could be stored any number of ways, making
> byte-parsing impossible.

xml.sax is no external parser. The program doesn't have to
communicate with the outside world at this point of execution.
The Python programm calls a Python function of a Python class and
passes a Python unicode string as parameter.

XML parsers only have to support few encodings. But nobody has
something against it when they support more than that.

A Python convenience function isn't broken when it allows Python
unicode strings.

The behavior of cStringIO (the original topic of this thread) is
correct and documented. parseString() uses the old idiom where
cStringIO is imported as StringIO, when available. Despite the
fact that they behave differently.

In my personal opinion: If parseString() shouldn't support
unicode strings, then it should check for it and throw a
meaningful exception.

At the moment the code just looks as if someone has overlooked
the fact that unicode strings (with non-ascii characters in it)
cause a problem. Missing test?

> So your code is faulty in its assumptions, not xml.sax.

As I said in the conclusion, a few messages before: Undocumented,
implementation dependent behavior.

Or maybe just a bug, considering the following on
http://docs.python.org/lib/module-xml.sax.html

        A typical SAX application uses three kinds of objects:
        readers, handlers and input sources. ``Reader'' in this
        context is another term for parser, i.e. some piece of
        code that reads the bytes or characters from the input
        source, and produces a sequence of events.

Bytes _or_ characters.

-- 
Web (en): http://www.no-spoon.de/ -*- Web (de): http://www.frell.de/