[Tutor] unicode: % & __str__ & str()

spir denis.spir at free.fr
Fri Oct 30 15:23:17 CET 2009


[back to the list after a rather long break]

Hello,

I stepped on a unicode issue ;-) (one more)
Below an illustration:

===============================
class U(unicode):
	def __str__(self):
		return self

# if you can't properly see the string below,
# 128<ordinals<255
c0 = "¶ÿµ"
c1 = U("¶ÿµ","utf8")
c2 = unicode("¶ÿµ","utf8")

for c in (c0,c1,c2):
	try:
		print "%s" %c,
	except UnicodeEncodeError:
		print "***",
	try:
		print c.__str__(),
	except UnicodeEncodeError:
		print "***",
	try:
		print str(c)
	except UnicodeEncodeError:
		print "***"

==>

¶ÿµ ¶ÿµ ¶ÿµ
¶ÿµ ¶ÿµ ***
¶ÿµ *** ***
================================

The last line shows that a regular unicode cannot be passed to str() (more or less ok) nor __str__() (not ok at all).
Maybe I overlook some obvious point (again). If not, then this means 2 issues in fact:

-1- The old ambiguity of str() meaning both "create an instance of type str from the given data" and "build a textual representation of the given object, through __str__", which has always been a semantic flaw for me, becomes concretely problematic when we have text that is not str.
Well, i'm very surprised of this. Actually, how comes this point doesn't seem to be very well known; how is it simply possible to use unicode without stepping on this problem? I guess this breaks years or even decades of habits for coders used to write str() when they mean __str__().

-2- How is it possible that __str__ does not work on a unicode object? It seems that the method is simply not implemented on unicode, the type, and __repr__ neither. So that it falls back to str().
Strangely enough, % interpolation works, which means that for both types of text a short circuit is used, namely return the text itself as is. I would have bet my last cents that % would simply delegate to __str__, or maybe that they were the same func in fact, synonyms, but obviously I was wrong!

Looking for workarounds, I first tried to overload (or rather create) __str__ like in the U type above. But this solution is far to be ideal cause we still cannot use str() (I mean my digits can write it while my head is who-knows-where). Also, it is really unusable in fact for the following reason:
===================================
print c1.__class__
print c1[1].__class__
c3 = c1 ; print (c1+c3).__class__
==>
<class '__main__.U'>
<type 'unicode'>
<type 'unicode'>
====================================
Any operation will return back a unicode instead of the original type. So that the said type would have to overload all possible operations on text, which is much, indeed, to convert back the results. I don't even speak of performance issues.

So, the only solution seems to me to use % everywhere, hunt all str and __str__ and __repr__ and such in all code.

I hope I'm wrong on this. Please, give me a better solution ;-)



------
la vita e estrany




More information about the Tutor mailing list