[Tutor] string replacement in Python 2 and 3

Steven D'Aprano steve at pearwood.info
Wed Nov 27 00:36:24 CET 2013


On Tue, Nov 26, 2013 at 11:42:29AM -0800, Albert-Jan Roskam wrote:
> Hi,
> 
> String replacement works quite differently with bytes objects in 
> Python 3 than with string objects in Python 2. What is the best way to 
> make example #1 below run in Python 2 and 3? 

If you are working with text strings, always use the text string type. 
In Python 3, that is called "str". In Python 2, that is called 
"unicode". To make it easier, I do this at the top of the module:

try:
    unicode
except NameError:
    # Python 3.
    pass
else:
    # Python 2.
    str = unicode

then always use str. Or, if you prefer:

try:
    unicode
except NameError:
    # Python 3.
    unicode = str


and always use unicode.

As an alternative, if you need to support Python 2.7 and 3.3 only, you 
can use u'' string literals:

s = u"Hello World!"

Sadly, Python 3.1 and 3.2 (don't use 3.0, it's broken) don't support the 
u string prefix. If you have to support them:

if sys.version < '3':
    def u(astr):
        return unicode(astr)
else:
    def u(astr):
        return astr


and then call:

s = u("Hello World!")

*but* be aware that this only works with ASCII string literals. We can 
make u() be smarter and handle more cases:

if sys.version < '3':
    def u(obj, encoding='utf-8', errors='strict'):
        if isinstance(obj, str):
            return obj.decode(encoding, errors)
        elif isinstance(obj, unicode):
            return obj
        else:
            return unicode(obj)
else:
    def u(obj, encoding='utf-8', errors='strict'):
        if isinstance(obj, str):
            return obj
        elif isinstance(obj, bytes):
            return obj.decode(encoding, errors)
        else:
            return str(obj)

then use the u() function on any string, text or bytes, or any other 
object, as needed. 

But the important thing here is:

* convert bytes to text as early as possible;

* then do all your work using text;

* and only convert back to bytes if you really need to, 
  and as late as possible.


If you find yourself converting backwards and forwards between bytes and 
text multiple times for each piece of data, you're doing it wrong. Look 
at file input in Python 3: when you open a file for reading in text 
mode, it returns a text string, even though the underlying file on disk 
is bytes. It decodes those bytes once, as early as it can (when 
reading), and then for the rest of your program you treat it as text. 
Then when you write it back out to a file, it encodes it to bytes only 
when doing the write(). That's the strategy you should aim to copy.

Ideally, no string should be encoded or decoded more than once each in 
its entire lifespan.



-- 
Steven


More information about the Tutor mailing list