[Python-bugs-list] [ python-Bugs-676346 ] String formatting operation Unicode problem.

SourceForge.net noreply@sourceforge.net
Tue, 28 Jan 2003 14:23:11 -0800


Bugs item #676346, was opened at 2003-01-28 21:59
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=676346&group_id=5470

Category: Unicode
Group: Python 2.2.2
Status: Open
Resolution: None
>Priority: 3
Submitted By: David M. Grimes (dmgrime)
Assigned to: M.-A. Lemburg (lemburg)
Summary: String formatting operation Unicode problem.

Initial Comment:
When performing a string formatting operation using %s
and a unicode argument, the argument evaluation is
performed more than once.  In certain environments (see
example) this leads to excessive calls.

It seems Python-2.2.2:Objects/stringobject.c:3394 is
where PyObject_GetItem is used (for dictionary-like
formatting args).  Later, at :3509, there is a"goto
unicode" when a string argument is actually unicode. 
At this point, everything resets and we do it all over
again in PyUnicode_Format.

There is an underlying assumption that the cost of the
call to PyObject_GetItem is very low (since we're going
to do them all again for unicode).  We've got a
Python-based templating system which uses a very simple
Mix-In class to facilitate flexible page generation. 
At the core is a simple __getitem__ implementation
which maps calls to getattr():

class mixin:
    def __getitem__(self, name):
        print '%r::__getitem__(%s)' % (self, name)
        hook = getattr(self, name)
        if callable(hook):
            return hook()
        else:
            return hook

Obviously, the print is diagnostic.  So, this basic
mechanism allows one to write hierarchical templates
filling in content found in "%(xxxx)s" escapes with
functions returning strings.  It has worked extremely
well for us.

BUT, we recently did some XML-based work which
uncovered this strange unicode behaviour.  Given the
following classes:

class w1u(mixin):
    v1 = u'v1'

class w2u(mixin):
    def v2(self):
        return '%(v1)s' % w1u()

class w3u(mixin):
    def v3(self):
        return '%(v2)s' % w2u()

class w1(mixin):
    v1 = 'v1'

class w2(mixin):
    def v2(self):
        return '%(v1)s' % w1()

class w3(mixin):
    def v3(self):
        return '%(v2)s' % w2()

And test case:

print 'All string:'
print '%(v3)s' % w3()
print

print 'Unicode injected at w1u:'
print '%(v3)s' % w3u()
print


As we can see, the only difference between the w{1,2,3}
and w{1,2,3}u classes is that w1u defines v1 as unicode
where w1 uses a "normal" string.

What we see is the string-based one shows 3 calls, as
expected:

All string:
<__main__.w3 instance at 0x8150524>::__getitem__(v3)
<__main__.w2 instance at 0x814effc>::__getitem__(v2)
<__main__.w1 instance at 0x814f024>::__getitem__(v1)
v1

But the unicode causes a tree-like recursion:

Unicode injected at w1u:
<__main__.w3u instance at 0x8150524>::__getitem__(v3)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w3u instance at 0x8150524>::__getitem__(v3)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
v1

I'm sure this isn't a "common" use of the string
formatting mechanism, but it seems that evaluating the
arguments multiple times could be a bad thing.  It
certainly is for us 8^)

We're running this on a RedHat 7.3/8.0 setup, not that
it appears to matter (from looking in stringojbect.c).
Also appears to still be a problem in 2.3a1.

Any comments?  Help?  Questions?


----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2003-01-28 23:23

Message:
Logged In: YES 
user_id=38388

I don't see how you can avoid fetching the Unicode
argument a second time without restructuring the
formatting code altogether.

If you know that your arguments can be Unicode, you
should start with a Unicode formatting string to begin
with. That's faster and doesn't involve a fallback
solution.

If you still want to see this fixed, I'd suggest to submit
a patch.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=676346&group_id=5470