[ python-Bugs-1470540 ] XMLGenerator creates a mess with UTF-16

SourceForge.net noreply at sourceforge.net
Fri Apr 14 22:07:46 CEST 2006


Bugs item #1470540, was opened at 2006-04-15 00:07
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1470540&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: XML
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Submitted By: Nikolai Grigoriev (ngrig)
Assigned to: Nobody/Anonymous (nobody)
Summary: XMLGenerator creates a mess with UTF-16

Initial Comment:
When output encoding in xml.sax.saxutils.XMLGenerator
is set to UTF-16, the result is a terrible mess. Namely:

- it does not encode the XML declaration at the very
top of the file (leaving it in single-byte Latin);

- it leaves closing '>' of each start tag unencoded
(that is, always outputs a single byte);

- it inserts a spurious byte order mark for each tag,
each attribute, each text node, and each processing
instruction.

A test illustrating the issue is attached. The issue is
applicable to both stable (2.4.3) and current (2.5)
versions of Python.

---------------------------------------------
Looking in xml/sax/saxutils.py, I see the problem in
XMLGenerator._write():
   - one-byte strings aren't recoded at all (sic!);
   - two-byte strings are converted using
unicode.encode(); this results in a BOM for each call of
_write() on Unicode strings.

The issue is easy to fix by using StreamWriter instead
of  a plain stream as the output sink. I am going to
submit a patch shortly.

Regards,
Nikolai Grigoriev 

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1470540&group_id=5470


More information about the Python-bugs-list mailing list