[Python-Dev] RE: [Spambayes] Question (or possibly a bug report)

Tim Peters tim.one@comcast.net
Wed, 23 Jul 2003 22:59:01 -0400

[copying to Python-Dev because it relates to a recent thread there too]

[Meyer, Tony]
> No doubt I should ask this on python-list not here, but since Tim
> would probably be the one to answer it anyway... :)  Why does the
> setlocale() function give that impression?  If LC_NUMERIC should
> always be "C", shouldn't setlocale(LC_NUMERIC, x) raise some sort of
> exception?

Reading the locale module docs should make it clearer.  If it's still
unclear after reading the docs, ask again <wink>.  For a concrete example of
why it's still useful to "pretend" to change LC_NUMERIC, see below (the
locale module functions are sensitive to the change, and the code below
couldn't be written otherwise).

> Maybe Outlook is at fault here?  I've certainly seen that some of the
> Outlook/COM/MAPI calls make changes to the locale.  In particular,
> mapi.MAPILogonEx() does - it changes the locale to whatever Outlook
> (i.e. Windows) thinks it is.  Could this then be screwing things up
> for us?

Oh yes.  A .pyc file contains a compiled form of Python code objects.  Part
of what's in a .pyc file is the "marshal" form of numeric literals
referenced by the source code.  It so happens that marshal stores float
literals as strings (repr-style).  The unmarshaling code (executed when a
.pyc file is loaded) is written in C, and uses C's atof() to convert these
strings back to C doubles.  atof() is locale-sensitive, so can screw up
royally if LC_NUMERIC isn't "C" at the time a module is loaded.

Here's a little program we can use to predict this kind of damage:

import marshal, locale

def damage(afloat, lcnumeric):
    s = marshal.dumps(afloat)
    print repr(afloat), "->", repr(s)
    # Now emulate unmarshaling that under a given locale.
    # Strip the type code and byte count.
    assert s[0] == 'f'
    raw = s[2:]
    print "Under locale %r that loads as" % lcnumeric,
    locale.setlocale(locale.LC_NUMERIC, lcnumeric)
    print repr(locale.atof(raw))
    locale.setlocale(locale.LC_NUMERIC, "C")

For example, running damage(0.001, "German") displays:

0.001 -> 'f\x050.001'
Under locale 'German' that loads as 1.0

while damage(0.001, "C") displays what Python needs to happen instead:

0.001 -> 'f\x050.001'
Under locale 'C' that loads as 0.001

So all kinds of bad things *can* happen.  I'm still baffled by the spambayes
logfile, though, because the failing assert is here:

    def set_stages(self, stages):
        self.stages = []
        start_pos = 0.0
        for name, prop in stages:
            stage = name, start_pos, prop
            start_pos += prop
        assert (abs(start_pos-1.0)) < 0.001, \
               "Proportions must add to 1.0 (%g,%r)" % (start_pos, stages)

and the failing call is here:

        self.set_stages( (("", 1.0),) )

Under a locale that ignores periods,

 string -> double
    0.0 -> 0.0
    1.0 -> 10.0
  0.001 -> 1.0

So the assert above would act like

        assert (abs(start_pos-10.0)) < 1.0, \

and the call would act like

        self.set_stages( (("", 10.0),) )

start_pos - 10.0 would still be 0 then, and the assert should not fail.
>From the logfile, we also know that start_pos was actually 1.0 in the
failing case, and that the "1.0" at the call site also loaded as expected:

   AssertionError: Proportions must add to 1.0 (1,(('', 1.0),))

If the literals in this line alone got screwed up:

        assert (abs(start_pos-1.0)) < 0.001, \

that would fit all the symptoms.  Then start_pos would be 1.0, 10.0 would
get subtracted from that, and 9.0 is not less than 1.0.  So we should change
the assert to show us also the value of start_pos-1.0.  If that's -9.0, I'll
be baffled for a different reason.

PS:  For fun, look at what this displays:

damage(0.1, "German")

If you guessed 1.0 was the final loaded result, you're not even close to the
right universe <wink>.