[Python-Dev] Minidom and Unicode

M.-A. Lemburg mal@lemburg.com
Mon, 03 Jul 2000 18:37:12 +0200

Paul Prescod wrote:
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > IMHO, all auto-conversions should use the default encoding. The
> > main point here is not to confuse the user with even more magic
> > happening under the hood.
> I don't see anything confusing about having unicode-escape be the
> appropriate escape used for repr. Maybe we need to differentiate between
> lossless and lossy encodings. If the default encoding is lossless then
> repr could use it. Otherwise it could use unicode-escape.

Simply because auto-conversion should use one single encoding
throughout the code.
> Anyhow, why would it be wrong for Fredrick to hard-code an encoding in
> repr but right for me to hard-code one in minidom? 

Because hardcoding the encoding into the core Python API touches
all programs. Hardcoded encodings should be userland options
whereever possible.

Besides, we're talking about __repr__ which is mainly a
debug tool and doesn't affect program flow or interfacing
in any way. The format used is a userland decision and the
encoding used for it is too.

> Users should not need
> to comb through the hundreds of modules in the library figuring out what
> kind of Unicode handling they should expect. It should be as centralized
> as possible.

> > If the programmer knows that he'll have to deal with Unicode
> > then he should make sure that the proper encoding is used
> > and document it that way, e.g. use unicode-escape for Minidom's
> > __repr__ methods.
> One of the major goals of our current Unicode auto-conversion
> "compromise" is that modules like xmllib and minidom should work with
> Unicode out of the box without any special enhancements. According to
> Guido, that's the primary reason we have Unicode auto-conversions at
> all.
> http://www.python.org/pipermail/i18n-sig/2000-May/000173.html
> I'm going to fight very hard to make basic Unicode support in Python
> modules "just work" without a bunch of internationalization knowledge
> from the programmer.

Great :-)

The next big project ought to be getting the standard lib
to work with Unicode input. A good way to test drive this, is
running Python with -U option.

> __repr__ is pretty basic.
> > > the reason for this patch was to avoid forcing everyone to deal with
> > > this in their own code, by providing some kind of fallback behaviour.
> >
> > That's what your patch does; I don't see a reason to change it :-)
> If you're still proposing that I should deal with it in a particular
> module's domain-specific code then the patch isn't done yet!

You don't have too: a user who uses Latin-1 tag names will see
the output of __repr__ as Latin-1... pretty straight forward
if you ask me. If you want to make sure that __repr__ output
is printable everywhere you should use an explicit lossless
encoding for your application.

Again, this is a userland decision which you'll have to make.

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/