[Python-Dev] str.translate vs unicode.translate (was: Re: str object going in Py3K)

Fri Feb 17 03:25:25 CET 2006

If str becomes unicode for PY 3000, and we then have bytes as out coding-agnostic
byte data, then I think bytes should have the str translation method, with a tweak
that I would hope could also be done to str now.

BTW, str.translate will presumably become unicode.translate, so
perhaps unicode.translate should grow a compatible deletechars parameter.

But that's not the tweak. The tweak is to eliminate unavoidable pre-conversion to unicode
in str(something).translate(u'...', delchars) (and preemptively bytes(something).translate(u'...', delchars))

E.g. suppose you now want to write:

    s_str.translate(table, delch).encode('utf-8')

Note that s_str has no encoding information, and translate is conceptually just a 1:1 substitution
minus characters in delch. But if we want to do one-chr:one-unichr substitution by specifying a
256-long table of unicode characters, we cannot. It would be simple to allow it, and that's the
tweak I would like. It would allow easy custom decodes.

At the moment, if you want to write the above, you have to introduce a phony latin-1 decoding
and write it as (not typo-proof)

    s_str.translate(table, delch).decode('latin-1').encode('utf-8')     # use str.translate
or
    s_str.decode('latin-1').translate(mapping).encode('utf-8')          # use unicode.translate also for delch

to avoid exceptions if you have non-ascii in your s_str (even if delch would have removed them!!)

It seems s_str.translate(table, delchars) wants to convert the s_str to unicode
if table is unicode, and then use unicode.translate (which bombs on delchars!)
instead of just effectively defining str.translate as

    def translate(self, table, deletechars=None):
        return ''.join((table or isinstance(table,unicode) and uidentity or sidentity)[ord(x)] for x in self
                       if not deletechars or x not in deletechars)

    # For convenience in just pruning with deletechars, s_str.translate('', deletechars) deletes without translating,
    # and s_str.translate(u'', deletechars)  does the same and then maps to same-ord unicode characters
    # given
    #     sidentity = ''.join(chr(i) for i in xrange(256))
    # and
    #     uidentity = u''.join(unichr(i) for i in xrrange(256)).

IMO, if you want unicode.translate, then it doesn't hurt to write unicode(s_str).translate and use that.

Let str.translate just use the str ords, so simple custom decodes can be written without
the annoyance of e.g.,

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 3: ordinal not in range(128)

Can we change this for bytes? And why couldn't we change this for str.translate now?
Or what am I missing? I certainly would like to miss the above message for str.translate :-(

BTW This would also allow taking advantage of features of both translates if desired, e.g. by
    s_str.translate(unichartable256, strdelchrs).translate(uniord_to_ustr_or_uniord_mapping).
(e.g., the latter permits single to multiple-character substitution)

I think at least a tweaked translate method for bytes would be good for py3k,
and I hope we can do it for str.translate now.
It it is just too handy a high speed conversion goodie to forgo IMO.

Regards,
Bengt Richter