[Python-bugs-list] [ python-Bugs-676346 ] String formatting operation Unicode problem.
SourceForge.net
noreply@sourceforge.net
Tue, 28 Jan 2003 14:23:11 -0800
Bugs item #676346, was opened at 2003-01-28 21:59
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=676346&group_id=5470
Category: Unicode
Group: Python 2.2.2
Status: Open
Resolution: None
>Priority: 3
Submitted By: David M. Grimes (dmgrime)
Assigned to: M.-A. Lemburg (lemburg)
Summary: String formatting operation Unicode problem.
Initial Comment:
When performing a string formatting operation using %s
and a unicode argument, the argument evaluation is
performed more than once. In certain environments (see
example) this leads to excessive calls.
It seems Python-2.2.2:Objects/stringobject.c:3394 is
where PyObject_GetItem is used (for dictionary-like
formatting args). Later, at :3509, there is a"goto
unicode" when a string argument is actually unicode.
At this point, everything resets and we do it all over
again in PyUnicode_Format.
There is an underlying assumption that the cost of the
call to PyObject_GetItem is very low (since we're going
to do them all again for unicode). We've got a
Python-based templating system which uses a very simple
Mix-In class to facilitate flexible page generation.
At the core is a simple __getitem__ implementation
which maps calls to getattr():
class mixin:
def __getitem__(self, name):
print '%r::__getitem__(%s)' % (self, name)
hook = getattr(self, name)
if callable(hook):
return hook()
else:
return hook
Obviously, the print is diagnostic. So, this basic
mechanism allows one to write hierarchical templates
filling in content found in "%(xxxx)s" escapes with
functions returning strings. It has worked extremely
well for us.
BUT, we recently did some XML-based work which
uncovered this strange unicode behaviour. Given the
following classes:
class w1u(mixin):
v1 = u'v1'
class w2u(mixin):
def v2(self):
return '%(v1)s' % w1u()
class w3u(mixin):
def v3(self):
return '%(v2)s' % w2u()
class w1(mixin):
v1 = 'v1'
class w2(mixin):
def v2(self):
return '%(v1)s' % w1()
class w3(mixin):
def v3(self):
return '%(v2)s' % w2()
And test case:
print 'All string:'
print '%(v3)s' % w3()
print
print 'Unicode injected at w1u:'
print '%(v3)s' % w3u()
print
As we can see, the only difference between the w{1,2,3}
and w{1,2,3}u classes is that w1u defines v1 as unicode
where w1 uses a "normal" string.
What we see is the string-based one shows 3 calls, as
expected:
All string:
<__main__.w3 instance at 0x8150524>::__getitem__(v3)
<__main__.w2 instance at 0x814effc>::__getitem__(v2)
<__main__.w1 instance at 0x814f024>::__getitem__(v1)
v1
But the unicode causes a tree-like recursion:
Unicode injected at w1u:
<__main__.w3u instance at 0x8150524>::__getitem__(v3)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w3u instance at 0x8150524>::__getitem__(v3)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
v1
I'm sure this isn't a "common" use of the string
formatting mechanism, but it seems that evaluating the
arguments multiple times could be a bad thing. It
certainly is for us 8^)
We're running this on a RedHat 7.3/8.0 setup, not that
it appears to matter (from looking in stringojbect.c).
Also appears to still be a problem in 2.3a1.
Any comments? Help? Questions?
----------------------------------------------------------------------
>Comment By: M.-A. Lemburg (lemburg)
Date: 2003-01-28 23:23
Message:
Logged In: YES
user_id=38388
I don't see how you can avoid fetching the Unicode
argument a second time without restructuring the
formatting code altogether.
If you know that your arguments can be Unicode, you
should start with a Unicode formatting string to begin
with. That's faster and doesn't involve a fallback
solution.
If you still want to see this fixed, I'd suggest to submit
a patch.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=676346&group_id=5470