[Tutor] %s %r with cutom type

Sat Mar 13 03:50:55 CET 2010

On Fri, 12 Mar 2010 10:29:17 pm spir wrote:
> Hello again,
>
> A different issue. On the custom Unicode type discussed in another
> thread, I have overloaded __str__ and __repr__ to get encoded byte
> strings (here with debug prints & special formats to distinguish from
> builtin forms):
[...]
> Note that Unicode.__str__ is called neither by "print us", nore by
> %s. What happens? Why does the issue only occur when using both
> format %s & %s?

The print statement understands how to directly print strings 
(byte-strings and unicode-strings) and doesn't call your __str__ 
method.

http://docs.python.org/reference/simple_stmts.html#the-print-statement

You can demonstrate that with a much simpler example:

>>> class K(unicode):
...     def __str__(self): return "xyz"
...     def __repr__(self): return "XYZ"
...
>>> k = K("some text")
>>> str(k)
'xyz'
>>> repr(k)
'XYZ'
>>> print k
some text

print only calls __str__ if the object isn't already a string.

As for string interpolation, I have reported this as a bug:

http://bugs.python.org/issue8128

I have some additional comments on your class below:

> class Unicode(unicode):
>     ENCODING = "utf8"
>     def __new__(self, string='', encoding=None):

This is broken according to the Liskov substitution principle.

http://en.wikipedia.org/wiki/Liskov_substitution_principle

The short summary: subclasses should only ever *add* functionality, they 
should never take it away.

The unicode type has a function signature that accepts an encoding and 
an errors argument, but you've missed errors. That means that code that 
works with built-in unicode objects will break if your class is used 
instead. If that's intentional, you need to clearly document that your 
class is *not* entirely compatible with the built-in unicode, and 
preferably explain why you have done so.

If it's accidental, you should fix it. A good start is the __new__ 
method I posted earlier.

>         if isinstance(string,str):
>             encoding = Unicode.ENCODING if encoding is None else
> encoding string = string.decode(encoding)
>         return unicode.__new__(Unicode, string)
>     def __repr__(self):
>         print '+',
>         return '"%s"' %(self.__str__())

This may be a problem. Why are you making your unicode class pretend to 
be a byte-string? 

Ideally, the output of repr(obj) should follow this rule:

eval(repr(obj)) == obj

For instance, for built-in unicode strings:

>>> u"éâÄ" == eval(repr(u"éâÄ"))
True

but for your subclass, us != eval(repr(us)). So again, code that works 
perfectly with built-in unicode objects will fail with your subclass.

Ideally, repr of your class should return a string like:

"Unicode('...')"

but if that's too verbose, it is acceptable to just inherit the __repr__ 
of unicode and return something like "u'...'". Anything else should be 
considered non-standard behaviour and is STRONGLY discouraged.

>     def __str__(self):
>         print '*',
>         return '`'+ self.encode(Unicode.ENCODING) + '`'

What's the purpose of the print statements in the __str__ and __repr__ 
methods?

Again, unless you have a good reason to do different, you are best to 
just inherit __str__ from unicode. Anything else is strongly 
discouraged.

> An issue happens in particuliar cases, when using both %s and %r:
>
> s = "éâÄ"

This may be a problem. "éâÄ" is not a valid str, because it contains 
non-ASCII characters. The result that you get may depend on your 
external environment. For instance, if I run it in my terminal, with 
encoding set to UTF-8, I get this:

>>> s = "éâÄ"
>>> print s
éâÄ
>>> len(s)
6
>>> list(s)
['\xc3', '\xa9', '\xc3', '\xa2', '\xc3', '\x84']

but if I set it to ISO 8859-1, I get this:

>>> list("éâÄ")
['\xe9', '\xe2', '\xc4']

As far as I know, the behaviour of stuffing unicode characters into 
byte-strings is not well-defined in Python, and will depend on external 
factors like the terminal you are running in, if any. It may or may not 
work as you expect. It is better to do this:

u = u"éâÄ"
s = u.encode('uft-8')

which will always work consistently so long as you declare a source 
encoding at the top of your module:

# -*- coding: UTF-8 -*-

-- 
Steven D'Aprano