rrr at ronadam.com
Sat Feb 18 13:17:42 CET 2006
Josiah Carlson wrote:
> Ron Adam <rrr at ronadam.com> wrote:
>> Josiah Carlson wrote:
>>> Bengt Richter had a good idea with bytes.recode() for strictly bytes
>>> transformations (and the equivalent for text), though it is ambiguous as
>>> to the direction; are we encoding or decoding with bytes.recode()? In
>>> my opinion, this is why .encode() and .decode() makes sense to keep on
>>> both bytes and text, the direction is unambiguous, and if one has even a
>>> remote idea of what the heck the codec is, they know their result.
>>> - Josiah
>> I like the bytes.recode() idea a lot. +1
>> It seems to me it's a far more useful idea than encoding and decoding by
>> overloading and could do both and more. It has a lot of potential to be
>> an intermediate step for encoding as well as being used for many other
>> translations to byte data.
> Indeed it does.
>> I think I would prefer that encode and decode be just functions with
>> well defined names and arguments instead of being methods or arguments
>> to string and Unicode types.
> Attaching it to string and unicode objects is a useful convenience.
> Just like x.replace(y, z) is a convenience for string.replace(x, y, z) .
> Tossing the encode/decode somewhere else, like encodings, or even string,
> I see as a backwards step.
>> I'm not sure on exactly how this would work. Maybe it would need two
>> sets of encodings, ie.. decoders, and encoders. An exception would be
>> given if it wasn't found for the direction one was going in.
>> Roughly... something or other like:
>> import encodings
>> encodings.tostr(obj, encoding):
>> if encoding not in encoders:
>> raise LookupError 'encoding not found in encoders'
>> # check if obj works with encoding to string
>> # ...
>> b = bytes(obj).recode(encoding)
>> return str(b)
>> encodings.tounicode(obj, decodeing):
>> if decoding not in decoders:
>> raise LookupError 'decoding not found in decoders'
>> # check if obj works with decoding to unicode
>> # ...
>> b = bytes(obj).recode(decoding)
>> return unicode(b)
>> Anyway... food for thought.
> Again, the problem is ambiguity; what does bytes.recode(something) mean?
> Are we encoding _to_ something, or are we decoding _from_ something?
This was just an example of one way that might work, but here are my
thoughts on why I think it might be good.
In this case, the ambiguity is reduced as far as the encoding and
decodings opperations are concerned.)
somestring = encodings.tostr( someunicodestr, 'latin-1')
It's pretty clear what is happening to me.
It will encode to a string an object, named someunicodestr, with
the 'latin-1' encoder.
And also rusult in clear errors if the specified encoding is
unavailable, and if it is, if it's not compatible with the given
*someunicodestr* obj type.
Further hints could be gained by.
Which could result in... something like...
encoding.tostr( <string|unicode>, <encoder> ) -> string
Encode a unicode string using a encoder codec to a
non-unicode string or transform a non-unicode string
to another non-unicode string using an encoder codec.
And if that's not enough, then help(encodings) could give more clues.
These steps would be what I would do. And then the next thing would be
to find the python docs entry on encodings.
Placing them in encodings seems like a fairly good place to look for
these functions if you are working with encodings. So I find that just
as convenient as having them be string methods.
There is no intermediate default encoding involved above, (the bytes
object is used instead), so you wouldn't get some of the messages the
present system results in when ascii is the default.
(Yes, I know it won't when P3K is here also)
> Are we going to need to embed the direction in the encoding/decoding
> name (to_base64, from_base64, etc.)? That doesn't any better than
> binascii.b2a_base64 .
No, that's why I suggested two separate lists (or dictionaries might be
better). They can contain the same names, but the lists they are in
determine the context and point to the needed codec. And that step is
abstracted out by putting it inside the encodings.tostr() and
So either function would call 'base64' from the correct codec list and
get the correct encoding or decoding codec it needs.
What about .reencode and .redecode? It seems as
> though the 're' added as a prefix to .encode and .decode makes it
> clearer that you get the same type back as you put in, and it is also
> unambiguous to direction.
But then wouldn't we end up with multitude of ways to do things?
s.encode(codec) == s.redecode(codec)
s.decode(codec) == s.reencode(codec)
unicode(s, codec) == s.decode(codec)
str(u, codec) == u.encode(codec)
str(s, codec) == s.encode(codec)
unicode(s, codec) == s.reencode(codec)
str(u, codec) == s.redecode(codec)
str(s, codec) == s.redecode(codec)
Umm .. did I miss any? Which ones would you remove?
Which ones of those will succeed with which codecs?
The method bytes.recode(), always does a byte transformation which can
be almost anything. It's the context bytes.recode() is used in that
determines what's happening. In the above cases, it's using an encoding
transformation, so what it's doing is precisely what you would expect by
There isn't a bytes.decode(), since that's just another transformation.
So only the one method is needed. Which makes it easer to learn.
> The question remains: is str.decode() returning a string or unicode
> depending on the argument passed, when the argument quite literally
> names the codec involved, difficult to understand? I don't believe so;
> am I the only one?
> - Josiah
Using help(str.decode) and help(str.encode) gives:
S.decode([encoding[,errors]]) -> object
S.encode([encoding[,errors]]) -> object
These look an awful lot alike. The descriptions are nearly identical as
well. The Python docs just reproduce (or close to) the doc strings with
only a very small amount of additional words.
Learning how the current system works comes awfully close to reverse
engineering. Maybe I'm overstating it a bit, but I suspect many end up
doing exactly that in order to learn how Python does it.
Or they go with the first solution that seems to work and hope for the
best. I believe that's what Martin said earlier in this thread.
It's much too late (or early now) to think further on this. So until
(please ignore typos) ;-)
More information about the Python-Dev