[Python-porting] bytes != str ... a few notes

Mon Dec 15 19:01:50 CET 2008

On Mon, December 15, 2008 3:03 am, John Machin wrote:
> === Comparing bytes objects with str objects ===
> A tentative solution when maintaining one codebase which runs as is on
> 2.x and from which 3.x code is generated:
>
> if python_version >= (3, 0):
>     def STR2BYTES(x, encoding='latin1'):
>         return x.encode(encoding)
> else:
>     def STR2BYTES(x): return x

Perhaps your STR2BYTES function should test to see if "x" is already a
byte string, to avoid recasting errors. As it stands should "x" be recast
later down the road by some other chunk of code which is oblivious to the
history of the recast "x", it will fail with an AttributeError:

>>> STR2BYTES(something_already_a_byte_string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in STR2BYTES
AttributeError: 'bytes' object has no attribute 'encode'

An approach similar to yours is what the authors of Durus, a ZODB-like
Python Object Database, have done. They add an isinstance(s, byte_string)
test to avoid any attempt at re-encoding a byte string (which would lead
to an attribute error since a byte string will never have an "encode"
method.

Sadly the same is not true in 2.x and below.

Browse the relevant module:

http://www.mems-exchange.org/software/durus/Durus-3.8.tar.gz/Durus-3.8/utils.py

Or peek at this snippet from within::

    if sys.version < "3":
        from __builtin__ import xrange
        from __builtin__ import str as byte_string
        def iteritems(x):
            return x.iteritems()
        def next(x):
            return x.next()
        from cStringIO import StringIO as BytesIO
        from cPickle import dumps, loads, Unpickler, Pickler
    else:
        xrange = range
        from builtins import next, bytearray, bytes
        byte_string = (bytearray, bytes)
        def iteritems(x):
            return x.items()
        from io import BytesIO
        from pickle import dumps, loads, Unpickler, Pickler

    def as_bytes(s):
        """Return a byte_string produced from the string s."""
        if isinstance(s, byte_string):
            return s
        else:
            return s.encode('latin1')

    empty_byte_string = as_bytes("")

I wish it were as easy as searching for '\x'-y looking literals to find
areas that will work in 2 but fail in 3. That's a start but there are
little surprises to find elsewhere. Consider the following which of course
won't fail on Python < 3 but will on >= 3::

    Python 3:
    >>> x = ":".join(('1','plus','two'))
    >>> x
    '1:plus:two'
    >>> hashlib.md5(x)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: object supporting the buffer API required

    # an easy cross platform fix:
    >>> hashlib.md5(as_bytes(x))
    <md5 HASH object @ 0x8933400>
    >>> hashlib.md5(as_bytes(x)).hexdigest()
    '4e3a3a8075a6982177c24af5179ec82c'

Failing code and failing unit tests ought to pick up most of these sorts
of issues.