2to3, str, and basestring
Terry Reedy
tjreedy at udel.edu
Sat Sep 7 15:31:07 EDT 2019
2to3 converts syntactically valid 2.x code to syntactically valid 3.x
code. It cannot, however, guarantee semantic correctness. A particular
problem is that str is semantically ambiguous in 2.x, as it is used both
for text encoded as bytes and binary data.
To resolve the ambiguity for conversions to 3.x, 2.6 introduced 'bytes'
as a synonym for 'str'. The intention is that one use 'bytes' to create
or refer to 2.x bytes that should remain bytes in 3.x and use 'str' to
create or refer to 2.x text bytes that should become or will be unicode
in 3.x. 3.x and hence 2to3 *assume* that one is using 'bytes' and 'str'
this way, so that 'unicode' becomes an unneeded synonym for 'str' and
2to3 changes 'unicode' to 'str'. If one does not use 'str' and 'bytes'
as intended, 2to3 may produce semantically different code.
2.3 introduced abstract superclass 'basestring', which can be viewed as
Union(unicode, str). "isinstance(value, basestring)" is defined as
"isinstance(value, (unicode, str))" I believe the intended meaning was
'text, whether unicode or encoded bytes'. Certainly, any code following
if isinstance(value, basestring):
would likely only make sense if that were true.
In any case, after 2.6, one should only use 'basestring' when the 'str'
part has its restricted meaning of 'unicode in 3.x'. "(unicode, bytes)"
is semantically different from "basestring" and "(unicode, str)" when
used in isinstance. 2to3 converts then to "(std, bytes)", 'str', and
'(str, str)' (the same as 'str' when used in isinstance). If one uses
'basestring' when one means '(unicode, bytes)', 2to3 may produce
semantically different code.
Example based on https://bugs.python.org/issue38003:
if isinstance(value, basestring):
if not isinstance(value, unicode):
value = value.decode(encoding)
process_text(value)
else:
process_nontext(value)
2to3 produces
if isinstance(value, str):
if not isinstance(value, str):
value = value.decode(encoding)
process_text(value)
else:
process_nontext(value)
If, in 3.x, value is always unicode, then the inner conditional is dead
and can be removed. But if, in 3.x, value might be byte-encoded text,
it will not be decoded and the code is wrong. Fixes:
1. Instead of decoding value after the check, do it before the check. I
think this is best for new code.
if isinstance(value, bytes):
value = value.decode(encoding)
...
if isinstance(value, unicode):
process_text(value)
else:
process_nontext(value)
2. Replace 'basestring' with '(unicode, bytes)'. This is easier with
existing code.
if isinstance(value, basestring):
if not isinstance(value, unicode):
value = value.decode(encoding)
process_text(value)
else:
process_nontext(value)
(I believe but have not tested that) 2to3 produces correct 3.x code from
either 1 or 2 after replacing 'unicode' with 'str'.
In both cases, the 'unicode' to 'str' replacement should result in
correct 3.x code.
3. Edit Lib/lib2to3/fixes/fix_basestring.py to replace 'basestring' with
'(str, bytes)' instead of 'str'. This should be straightforward if one
understands the ast format.
Note that 2to3 is not meant for 2&3 code using exception tricks and
six/future imports. Turning 2&3 code into idiomatic 3-only code is a
separate subject.
--
Terry Jan Reedy
More information about the Python-list
mailing list