Totally confused by the str/bytes/unicode differences introduced in Pythyon 3.x

Sun Jan 18 06:56:39 EST 2009

On Jan 18, 2:02 pm, Terry Reedy <tjre... at udel.edu> wrote:
> John Machin wrote:
> > On Jan 18, 9:10 am, Terry Reedy <tjre... at udel.edu> wrote:
> >> Martin v. Löwis wrote:
> >>>>> Does he intend to maintain two separate codebases, one 2.x and the
> >>>>> other 3.x?
> >>>> I think I have no other choice.
> >>>> Why? Is theoretically possible to maintain an unique code base for
> >>>> both 2.x and 3.x?
> >>> That is certainly possible! One might have to make tradeoffs wrt.
> >>> readability sometimes, but I found that this approach works quite
> >>> well for Django. I think Mark Hammond is also working on maintaining
> >>> a single code base for both 2.x and 3.x, for PythonWin.
> >> Where 'single codebase' means that the code runs as is in 2.x and as
> >> autoconverted by 2to3 (or possibly a custom comverter) in 3.x.
>
> >> One barrier to doing this is when the 2.x code has a mix of string
> >> literals with some being character strings that should not have 'b'
> >> prepended and some being true byte strings that should have 'b'
> >> prepended.  (Many programs do not have such a mix.)
>
> >> One approach to dealing with string constants I have not yet seen
> >> discussed here is to put them all in separate file(s) to be imported.
> >> Group the text and bytes separately.  Them marking the bytes with a 'b',
> >> either by hand or program would be easy.
>
> > (1) How would this work for somebody who wanted/needed to support 2.5
> > and earlier?
>
> See reposts in python wiki, one by Martin.

Most relevant of these is Martin's article on porting Django, using a
single codebase. The """goal is to support all versions that Django
supports, plus 3.0""" -- indicating that it supports at least 2.5,
which won't eat b"blah" syntax. He is using 2to3, and handles bytes
constants by """django.utils.py3.b, which is a function that converts
its argument to an ASCII-encoded byte string. In 2.x, it is another
alias for str; in 3.x, it leaves byte strings alone, and encodes
regular (unicode) strings as ASCII. This function is used in all
places where string literals are meant as bytes, plus all cases where
str() was used to invoke the default conversion of 2.x."""

Very similar to what I expected. However it doesn't answer my question
about how your "move byte strings to a separate file, prepend 'b', and
import the separate file" strategy would help ... and given that 2.5
and earlier will barf on b"arf", I don't expect it to.

> > (2) Assuming supporting only 2.6 and 3.x:
>
> > Suppose you have this line:
> > if binary_data[:4] == "PK\x03\x04": # signature of ZIP file
>
> > Plan A:
> > Change original to:
> > if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
> > Add this to the bytes section of the separate file:
> > ZIPFILE_SIG = "PK\x03\x04"
> > [somewhat later]
> > Change the above to:
> > ZIPFILE_SIG = b"PK\x03\x04"
> > [once per original file]
> > Add near the top:
> > from separatefile import *
>
> > Plan B:
> > Change original to:
> > if binary_data[:4] == ZIPFILE_SIG: # "PK\x03\x04"
> > Add this to the separate file:
> > ZIPFILE_SIG = b"PK\x03\x04"
> > [once per original file]
> > Add near the top:
> > from separatefile import *
>
> > Plan C:
> > Change original to:
> > if binary_data[:4] == b"PK\3\4": # signature of ZIP file
>
> > Unless I'm gravely mistaken, you seem to be suggesting Plan A or some
> > variety thereof -- what advantages do you see in this over Plan C?
>
> For 2.6 only (which is much easier than 2.x), do C.  Plan A is for 2.x
> where C does not work.

Excuse me? I'm with the OP now, I'm totally confused. Plan C is *not*
what you were proposing; you were proposing something like Plan A
which definitely involved a separate file.

Why won't Plan C work on 2.x (x <= 5)? Because the 2.X will b"arf".
But you say Plan A is for 2.x -- but Plan A involves importing the
separate file which contains and causes b"arf" also!

To my way of thinking, one obvious DISadvantage of a strategy that
actually moves the strings to another file (requiring invention of a
name for each string (that doesn't have one already) so that it can be
imported is the amount of effort and exposure to error required to get
the same functional result as a strategy that keeps the string in the
same file ... and this disadvantage applies irrespective of what one
does to the string: b"arf", Martin's b("arf"), somebody else's _b
("arf") [IIRC] or my you-aint-gonna-miss-noticing-this-in-the-code
BYTES_LITERAL("arf").

Cheers,
John