[Python-Dev] Adding .decode() method to Unicode

Tue, 12 Jun 2001 12:13:21 +0200

"Martin v. Loewis" wrote:
> 
> > str.encode()
> > str.decode()
> > uni.encode()
> > #uni.decode() # still missing
> 
> It's not missing. str.decode and uni.encode go through a single codec;
> that's easy. str.encode is somewhat more confusing, because it really
> is unicode(str).encode. Now, you are not proposing that uni.decode is
> str(uni).decode, are you?

No. uni.decode() will (just like the other methods) directly
interface to the codecs decoder -- there is no magic conversion
involved. It is meant to be used by Unicode-Unicode codecs

> If not that, what else would it mean? And if it means something else,
> it is clearly not symmetric to str.encode, so it is not "missing".

It is in the sense that strings support this method and Unicode
currently doesn't.

> > One very useful application for this method is XML unescaping
> > which turns numeric XML entities into Unicode chars.
> 
> Ok. Please show me how that would work. More precisely, please write a
> PEP describing the rationale for this feature, including use case
> examples and precise semantics of the proposed addition.

There's no need for a PEP. This addition is much too simple
to require a PEP on its own.

As for use cases: I have already given a whole bunch of them
(Unicode compression, normalization, escaping in various ways).

Codecs are in no way constrained to only interface between
strings and Unicode. There are many other possibilities for
their usage out there. Just look at the latest checkins for a
bunch of string-string codecs for examples of codecs which 
solve common real-life problems and do not interface to Unicode.

> > The key argument for these interfaces is that they provide
> > an extensible transformation mechanism for string and binary
> > data.
> 
> That is too general for me to understand; I need to see detailed
> examples that solve real-world problems.
> 
> Regards,
> Martin
> 
> P.S. I don't think that unescaping XML characters entities into
> Unicode characters is a useful application in itself. This is normally
> done by the XML parser, which not only has to deal with character
> entities, but also with general entities and a lot of other markup.
> Very few people write XML parsers, and they are using the string
> methods and the sre module successfully (if the parser is written in
> Python - a C parser would do the unescaping before even passing the
> text to Python).

True, but not all XML text out there is meant for XML parsers to 
read ;-). Preprocessing of e.g. XML text in Python is a rather common
thing to do and this is what the direct codec access methods are
meant for.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/