[Tutor] %s %r with cutom type
Steven D'Aprano
steve at pearwood.info
Sat Mar 13 03:50:55 CET 2010
On Fri, 12 Mar 2010 10:29:17 pm spir wrote:
> Hello again,
>
> A different issue. On the custom Unicode type discussed in another
> thread, I have overloaded __str__ and __repr__ to get encoded byte
> strings (here with debug prints & special formats to distinguish from
> builtin forms):
[...]
> Note that Unicode.__str__ is called neither by "print us", nore by
> %s. What happens? Why does the issue only occur when using both
> format %s & %s?
The print statement understands how to directly print strings
(byte-strings and unicode-strings) and doesn't call your __str__
method.
http://docs.python.org/reference/simple_stmts.html#the-print-statement
You can demonstrate that with a much simpler example:
>>> class K(unicode):
... def __str__(self): return "xyz"
... def __repr__(self): return "XYZ"
...
>>> k = K("some text")
>>> str(k)
'xyz'
>>> repr(k)
'XYZ'
>>> print k
some text
print only calls __str__ if the object isn't already a string.
As for string interpolation, I have reported this as a bug:
http://bugs.python.org/issue8128
I have some additional comments on your class below:
> class Unicode(unicode):
> ENCODING = "utf8"
> def __new__(self, string='', encoding=None):
This is broken according to the Liskov substitution principle.
http://en.wikipedia.org/wiki/Liskov_substitution_principle
The short summary: subclasses should only ever *add* functionality, they
should never take it away.
The unicode type has a function signature that accepts an encoding and
an errors argument, but you've missed errors. That means that code that
works with built-in unicode objects will break if your class is used
instead. If that's intentional, you need to clearly document that your
class is *not* entirely compatible with the built-in unicode, and
preferably explain why you have done so.
If it's accidental, you should fix it. A good start is the __new__
method I posted earlier.
> if isinstance(string,str):
> encoding = Unicode.ENCODING if encoding is None else
> encoding string = string.decode(encoding)
> return unicode.__new__(Unicode, string)
> def __repr__(self):
> print '+',
> return '"%s"' %(self.__str__())
This may be a problem. Why are you making your unicode class pretend to
be a byte-string?
Ideally, the output of repr(obj) should follow this rule:
eval(repr(obj)) == obj
For instance, for built-in unicode strings:
>>> u"éâÄ" == eval(repr(u"éâÄ"))
True
but for your subclass, us != eval(repr(us)). So again, code that works
perfectly with built-in unicode objects will fail with your subclass.
Ideally, repr of your class should return a string like:
"Unicode('...')"
but if that's too verbose, it is acceptable to just inherit the __repr__
of unicode and return something like "u'...'". Anything else should be
considered non-standard behaviour and is STRONGLY discouraged.
> def __str__(self):
> print '*',
> return '`'+ self.encode(Unicode.ENCODING) + '`'
What's the purpose of the print statements in the __str__ and __repr__
methods?
Again, unless you have a good reason to do different, you are best to
just inherit __str__ from unicode. Anything else is strongly
discouraged.
> An issue happens in particuliar cases, when using both %s and %r:
>
> s = "éâÄ"
This may be a problem. "éâÄ" is not a valid str, because it contains
non-ASCII characters. The result that you get may depend on your
external environment. For instance, if I run it in my terminal, with
encoding set to UTF-8, I get this:
>>> s = "éâÄ"
>>> print s
éâÄ
>>> len(s)
6
>>> list(s)
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']
but if I set it to ISO 8859-1, I get this:
>>> list("éâÄ")
['\xe9', '\xe2', '\xc4']
As far as I know, the behaviour of stuffing unicode characters into
byte-strings is not well-defined in Python, and will depend on external
factors like the terminal you are running in, if any. It may or may not
work as you expect. It is better to do this:
u = u"éâÄ"
s = u.encode('uft-8')
which will always work consistently so long as you declare a source
encoding at the top of your module:
# -*- coding: UTF-8 -*-
--
Steven D'Aprano
More information about the Tutor
mailing list