[Python-Dev] bytes.from_hex()
Ron Adam
rrr at ronadam.com
Fri Feb 24 23:46:00 CET 2006
* The following reply is a rather longer than I intended explanation of
why codings (and how they differ) like 'rot' aren't the same thing as
pure unicode codecs and probably should be treated differently.
If you already understand that, then I suggest skipping this. But if
you like detailed logical analysis, it might be of some interest even if
it's reviewing the obvious to those who already know.
(And hopefully I didn't make any really obvious errors myself.)
Stephen J. Turnbull wrote:
>>>>>> "Ron" == Ron Adam <rrr at ronadam.com> writes:
>
> Ron> We could call it transform or translate if needed.
>
> You're still losing the directionality, which is my primary objection
> to "recode". The absence of directionality is precisely why "recode"
> is used in that sense for i18n work.
I think your not understanding what I suggested. It might help if we
could agree on some points and then go from there.
So, lets consider a "codec" and a "coding" as being two different things
where a codec is a character sub set of unicode characters expressed in
a native format. And a coding is *not* a subset of the unicode
character set, but an _opperation_ performed on text. So you would have
the following properties.
codec -> text is always in *one_codec* at any time.
coding -> operation performed on text.
Lets add a special default coding called 'none' to represent a do
nothing coding. (figuratively for explanation purposes)
'none' -> return the input as is, or the uncoded text
Given the above relationships we have the following possible
transformations.
1. codec to like codec: 'ascii' to 'ascii'
2. codec to unlike codec: 'ascii' to 'latin1'
And we have coding relationships of:
a. coding to like coding # Unchanged, do nothing
b. coding to unlike coding
Then we can express all the possible combinations as...
[1.a, 1.b, 2.a, 2.b]
1.a -> coding in codec to like coding in like codec:
'none' in 'ascii' to 'none' in 'ascii'
1.b -> coding in codec to diff coding in like codec:
'none' in 'ascii' to 'base64' in 'ascii'
2.a -> coding in codec to same coding in diff codec:
'none' in 'ascii' to 'none' in 'latin1'
2.b -> coding in codec to diff coding in diff codec:
'none' in 'latin1' to 'base64' in 'ascii'
This last one is a problem as some codecs combine coding with character
set encoding and return text in a differnt encoding than they recieved.
The line is also blurred between types and encodings. Is unicode and
encoding? Will bytes also be a encoding?
Using the above combinations:
(1.a) is just creating a new copy of a object.
s = str(s)
(1.b) is recoding an object, it returns a copy of the object in the same
encoding.
s = s.encode('hex-codec') # ascii str -> ascii str coded in hex
s = s.decode('hex-codec') # ascii str coded in hex -> ascii str
* these are really two differnt operations. And encoding repeatedly
results in nested codings. Codecs (as a pure subset of unicode) don't
have that property.
* the hex-codec also fit the 2.b pattern below if the source string is
of a differnt type than ascii. (or the the default string?)
(2.a) creates a copy encoded in a new codec.
s = s.encode('latin1')
* I beleive string constructors should have a encoding argument for use
with unicode strings.
s = str(u, 'latin1') # This would match the bytes constructor.
(2.b) are combinations of the above.
s = u.encode('base64')
# unicode to ascii string as base64 coded characters
u = unicode(s.decode('base64'))
# ascii string coded in base64 to unicode characters
or
>>> u = unicode(s, 'base64')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: decoder did not return an unicode object (type=str)
Ooops... ;)
So is coding the same as a codec? I think they have different
properties and should be treated differently except when the
practicality over purity rule is needed. And in those cases maybe the
names could clearly state the result.
u.decode('base64ascii') # name indicates coding to codec
> A string. -> QSBzdHJpbmcu -> UVNCemRISnBibWN1
Looks like the underlying sequence is:
native string -> unicode -> unicode coded base64 -> coded ascii str
And decode operation would be...
coded ascii str -> unicode coded base64 -> unicode -> ascii str
Except it may combine some of these steps to speed it up.
Since it's a hybred codec including a coding operation. We have to treat
it as a codec.
> Ron> * Given that the string type gains a __codec__ attribute
> Ron> to handle automatic decoding when needed. (is there a reason
> Ron> not to?)
>
> Ron> str(object[,codec][,error]) -> string coded with codec
>
> Ron> unicode(object[,error]) -> unicode
>
> Ron> bytes(object) -> bytes
>
> str == unicode in Py3k, so this is a non-starter. What do you want to
> say?
>
> Ron> * a recode() method is used for transformations that
> Ron> *do_not* change the current codec.
>
> I'm not sure what you mean by the "current codec". If it's attached
> to an "encoded object", it should be the codec needed to decode the
> object. And it should be allowed to be a "codec stack".
I wasn't thinking in terms of stacks, but in that case the current codec
would be the top of the stack. I think stackable codecs is a very bad
idea for the record.
Back to recode vs encode/decode, the example used above might be useful.
s = s.encode('hex-codec') # ascii str -> ascii str coded in hex
s = s.decode('hex-codec') # ascii str coded in hex -> ascii str
In my opinion these are actually too very different (although related)
operations that would be better expressed with different names.
Curently it's a hybred codec that converts it's input to an ascii string
(or default encoding?), but when decoding you end up with an ascii
encoding even if you started with something else. So the decode isn't a
true inverse to encode in some cases.
As a coding operation it would be.
u = u.recode('to_hex')
u = u.recode('from_hex')
Where this would work with both unicode and strings without changing the
codec.
It also keeps the 'if i do it again' it will *recode* the coded text'
relationship. So I think the name is appropriate. IMHO
Pure codecs such as latin-1 can be envoked over and over and you can
always get back what you put in in a single step.
>>> s = 'abc'
>>> for n in range(100):
... s = s.encode('latin-1')
...
>>> print s, type(s)
abc <type 'str'>
Supposedly a lot of these issues will go away in Python 3000. And we can
probably live with the current state of things. But even after Python
3000 it seems to me we will still need access to codecs as we may run
across encoded text input from various sources.
Cheers,
Ron
More information about the Python-Dev
mailing list