[Python-Dev] Adding .decode() method to Unicode

M.-A. Lemburg mal@lemburg.com
Tue, 12 Jun 2001 13:42:31 +0200


"Martin v. Loewis" wrote:
> 
> > > > str.encode()
> > > > str.decode()
> > > > uni.encode()
> > > > #uni.decode() # still missing
> > >
> > > It's not missing. str.decode and uni.encode go through a single codec;
> > > that's easy. str.encode is somewhat more confusing, because it really
> > > is unicode(str).encode. Now, you are not proposing that uni.decode is
> > > str(uni).decode, are you?
> >
> > No. uni.decode() will (just like the other methods) directly
> > interface to the codecs decoder -- there is no magic conversion
> > involved. It is meant to be used by Unicode-Unicode codecs
> 
> When invoking "Hallo".encode("utf-8"), two conversions are executed:
> first the default decoding into Unicode, then the UTF-8 encoding. Of
> course, that is not the intended use (but then, is the intended use
> documented anywhere?): instead, people should write
> "Hallo".encode("base64") instead. This is an example I can understand,
> although I'm not sure why it is inherently better to write this
> instead of writing base64.encodestring("Hallo").

Please note that the conversion from string to Unicode is done
by the codec, not the .encode() interface.
 
> > > If not that, what else would it mean? And if it means something else,
> > > it is clearly not symmetric to str.encode, so it is not "missing".
> >
> > It is in the sense that strings support this method and Unicode
> > currently doesn't.
> 
> The rationale for string.encode is weak: it argues that string->string
> conversions are frequent enough to justify this API, even though these
> conversions have nothing to do with coded character sets.

You still don't get it: codecs can be used for much more than
just character set conversion !
 
> So far, I can see *no* rationale for unicode.decode.
> 
> > There's no need for a PEP. This addition is much too simple
> > to require a PEP on its own.
> 
> PEP 1 says:
> 
> # We intend PEPs to be the primary mechanisms for proposing new
> # features, for collecting community input on an issue, and for
> # documenting the design decisions that have gone into Python.  The
> # PEP author is responsible for building consensus within the
> # community and documenting dissenting opinions.
> 
> So we have a proposal for a new feature, and we have dissenting
> opinions. Who are you to decide that this additions is too simple to
> require a PEP on its own?

So you want a PEP for each and every small addition to in the 
core ?! (I am not talking about features which might break code !)
 
> > As for use cases: I have already given a whole bunch of them
> > (Unicode compression, normalization, escaping in various ways).
> 
> I was asking for specific examples: Names of specific codecs that you
> want to implement, and application code fragments using these specific
> codecs. I don't know how to use Unicode compression if I had such this
> proposed feature, for example. I know what XML escaping is, and I
> cannot see how this feature would help.

I think I have given enough examples in this thread already. See
below for some more.
 
> > True, but not all XML text out there is meant for XML parsers to
> > read ;-). Preprocessing of e.g. XML text in Python is a rather common
> > thing to do and this is what the direct codec access methods are
> > meant for.
> 
> Can you give an example of an application which processes XML without
> a parser, but with converting character entities (preferably
> open-source, so I can study its code)? I wonder whether they get CDATA
> sections right... MAL, I really mean that: Please don't make claims
> that something is common or useful without giving an *exact* example.

Yes, I am using these feature in real code and no, I can't show it to
you because it's closed source. XML is only one example where this
would be useful, HTML is another text format which would benefit
from it, URL encoding is yet another application. You basically
find these applications in all situations where some form of
escaping is needed.

What I am trying to do here is simplify codec access and usage
for the casual user. .encode() and .decode() are very intuitive
ways to deal with data transformation, IMHO.
 
> Regards,
> Martin
> 
> P.S. This insistence on adding Unicode and string methods makes it
> appear as if the author of the codecs module now thinks that the API
> of it sucks.

No comment.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/