[Python-Dev] str.translate vs unicode.translate (was: Re: str object going in Py3K)
Bengt Richter
bokr at oz.net
Fri Feb 17 03:25:25 CET 2006
If str becomes unicode for PY 3000, and we then have bytes as out coding-agnostic
byte data, then I think bytes should have the str translation method, with a tweak
that I would hope could also be done to str now.
BTW, str.translate will presumably become unicode.translate, so
perhaps unicode.translate should grow a compatible deletechars parameter.
But that's not the tweak. The tweak is to eliminate unavoidable pre-conversion to unicode
in str(something).translate(u'...', delchars) (and preemptively bytes(something).translate(u'...', delchars))
E.g. suppose you now want to write:
s_str.translate(table, delch).encode('utf-8')
Note that s_str has no encoding information, and translate is conceptually just a 1:1 substitution
minus characters in delch. But if we want to do one-chr:one-unichr substitution by specifying a
256-long table of unicode characters, we cannot. It would be simple to allow it, and that's the
tweak I would like. It would allow easy custom decodes.
At the moment, if you want to write the above, you have to introduce a phony latin-1 decoding
and write it as (not typo-proof)
s_str.translate(table, delch).decode('latin-1').encode('utf-8') # use str.translate
or
s_str.decode('latin-1').translate(mapping).encode('utf-8') # use unicode.translate also for delch
to avoid exceptions if you have non-ascii in your s_str (even if delch would have removed them!!)
It seems s_str.translate(table, delchars) wants to convert the s_str to unicode
if table is unicode, and then use unicode.translate (which bombs on delchars!)
instead of just effectively defining str.translate as
def translate(self, table, deletechars=None):
return ''.join((table or isinstance(table,unicode) and uidentity or sidentity)[ord(x)] for x in self
if not deletechars or x not in deletechars)
# For convenience in just pruning with deletechars, s_str.translate('', deletechars) deletes without translating,
# and s_str.translate(u'', deletechars) does the same and then maps to same-ord unicode characters
# given
# sidentity = ''.join(chr(i) for i in xrange(256))
# and
# uidentity = u''.join(unichr(i) for i in xrrange(256)).
IMO, if you want unicode.translate, then it doesn't hurt to write unicode(s_str).translate and use that.
Let str.translate just use the str ords, so simple custom decodes can be written without
the annoyance of e.g.,
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 3: ordinal not in range(128)
Can we change this for bytes? And why couldn't we change this for str.translate now?
Or what am I missing? I certainly would like to miss the above message for str.translate :-(
BTW This would also allow taking advantage of features of both translates if desired, e.g. by
s_str.translate(unichartable256, strdelchrs).translate(uniord_to_ustr_or_uniord_mapping).
(e.g., the latter permits single to multiple-character substitution)
I think at least a tweaked translate method for bytes would be good for py3k,
and I hope we can do it for str.translate now.
It it is just too handy a high speed conversion goodie to forgo IMO.
Regards,
Bengt Richter
More information about the Python-Dev
mailing list