[Tutor] string replacement in Python 2 and 3
Albert-Jan Roskam
fomcl at yahoo.com
Wed Nov 27 11:24:45 CET 2013
--------------------------------------------
On Wed, 11/27/13, Steven D'Aprano <steve at pearwood.info> wrote:
Subject: Re: [Tutor] string replacement in Python 2 and 3
To: tutor at python.org
Date: Wednesday, November 27, 2013, 12:36 AM
On Tue, Nov 26, 2013 at 11:42:29AM
-0800, Albert-Jan Roskam wrote:
> Hi,
>
> String replacement works quite differently with bytes
objects in
> Python 3 than with string objects in Python 2. What is
the best way to
> make example #1 below run in Python 2 and 3?
If you are working with text strings, always use the text
string type.
In Python 3, that is called "str". In Python 2, that is
called
"unicode". To make it easier, I do this at the top of the
module:
try:
unicode
except NameError:
# Python 3.
pass
else:
# Python 2.
str = unicode
then always use str. Or, if you prefer:
try:
unicode
except NameError:
# Python 3.
unicode = str
and always use unicode.
As an alternative, if you need to support Python 2.7 and 3.3
only, you
can use u'' string literals:
s = u"Hello World!"
Sadly, Python 3.1 and 3.2 (don't use 3.0, it's broken) don't
support the
u string prefix. If you have to support them:
if sys.version < '3':
def u(astr):
return unicode(astr)
else:
def u(astr):
return astr
and then call:
s = u("Hello World!")
*but* be aware that this only works with ASCII string
literals. We can
make u() be smarter and handle more cases:
if sys.version < '3':
def u(obj, encoding='utf-8',
errors='strict'):
if isinstance(obj, str):
return
obj.decode(encoding, errors)
elif isinstance(obj, unicode):
return obj
else:
return
unicode(obj)
else:
def u(obj, encoding='utf-8',
errors='strict'):
if isinstance(obj, str):
return obj
elif isinstance(obj, bytes):
return
obj.decode(encoding, errors)
else:
return str(obj)
then use the u() function on any string, text or bytes, or
any other
object, as needed.
But the important thing here is:
* convert bytes to text as early as possible;
* then do all your work using text;
* and only convert back to bytes if you really need to,
and as late as possible.
If you find yourself converting backwards and forwards
between bytes and
text multiple times for each piece of data, you're doing it
wrong. Look
at file input in Python 3: when you open a file for reading
in text
mode, it returns a text string, even though the underlying
file on disk
is bytes. It decodes those bytes once, as early as it can
(when
reading), and then for the rest of your program you treat it
as text.
Then when you write it back out to a file, it encodes it to
bytes only
when doing the write(). That's the strategy you should aim
to copy.
Ideally, no string should be encoded or decoded more than
once each in
its entire lifespan.
===> Hi Steven,
Thanks for your advice (esp. the bullets are placemat-worthy ;-). I will have a crtical look at my code to see where I can improve it. I am reading binary data so I start with str (Python 2) or bytes (Python 3). Before I was adapting my code for use in both Python versions, I simply returned string (Python 2 sense of the word) data. So in Pyhon 3 it seemed consisent to return bytes.
In one case, I add padding to values to get rid of null bytes. That's a small operation that's done very very often. The first example below is MUCH faster than ljust, but it does not work in Python 3. Maybe Donald Knuth should slap me because I am optimzing prematurely,
>>> value = b"blah"
>>> b"%-50s" % value # fast, but not python 3 proof
'blah '
>>> value.ljust(50) # okay, this will have to do for python 3, then
'blah '
'
regards,
Albert-Jan
More information about the Tutor
mailing list