[spambayes-dev] RE: [Python-Dev] RE: [Spambayes] Question (or possibly a bug report)

Thu Jul 24 15:53:32 EDT 2003

[Tony Meyer]
> ...
> (Glad you posted this - I was wading through the progress of
> marshalling (PyOS_snprintf etc) and getting rapidly lost).

It's the unmarshalling code that's relevant -- that just passes a string to
atof().

>> 1. When LC_NUMERIC is "german", MS C's atof() stops at the first
>>    period it sees.

> This is the case:
> """
> #include <locale.h>
> #include <stdio.h>
> #include <stdlib.h>
>
> int main()
> {
>     float f;
>     setlocale(LC_NUMERIC, "german");
>     f = atof("0.1");
>     printf("%f\n", f);
> }
> """
>
> Gives me with gcc version 3.2 20020927 (prerelease):
> 	0.100000

It's possible that glibc doesn't recognize "german" as a legitimate locale
name (so that the setlocale() call had no effect).

> Gives me with Microsoft C++ Builder (I don't have Visual C++ handy,
> but I suppose it would be the same):
>       0,00000
>
> The help file for Builder does say that this is the correct behaviour
> - it will stop when it finds an unrecognised character - here '.' is
> unrecognised (because we are in German), so it stops.

atof does have to stop at the first unrecognized character, but atof is
locale-dependent, so which characters are and aren't recognized depends on
the locale.

After I set locale to "german" on Win2K:

>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, "german")
'German_Germany.1252'

MS tells me that the decimal_point character is ',' and the thousands_sep
character is '.':

>>> import pprint
>>> pprint.pprint(locale.localeconv())
{'currency_symbol': '',
 'decimal_point': ',',                     HERE
 'frac_digits': 127,
 'grouping': [3, 0],
 'int_curr_symbol': '',
 'int_frac_digits': 127,
 'mon_decimal_point': '',
 'mon_grouping': [],
 'mon_thousands_sep': '',
 'n_cs_precedes': 127,
 'n_sep_by_space': 127,
 'n_sign_posn': 127,
 'negative_sign': '',
 'p_cs_precedes': 127,
 'p_sep_by_space': 127,
 'p_sign_posn': 127,
 'positive_sign': '',
 'thousands_sep': '.'}                     AND HERE
>>>

Python believes that the locale-specified thousands_sep character should be
ignored, and that's what locale.atof() does.  It may well be a bug in MS's
atof() that it doesn't ignore the current thousands_sep character -- I don't
have time now to look up the rules in the C standard, and it doesn't matter
to spambayes either way (whether we load .001 as 0.0 as 1.0 is a disaster
either way).

> Does this then mean that this is a Python bug?

That Microsoft's atof() doesn't ignore the thousands_sep character is
certainly not Pyton's bug <wink>.

> Or because Python tells us not to change the c locale and we (Outlook)
> are, it's our fault/problem?

The way we're using Python with Outlook doesn't meet the documented
requirements for using Python, so for now everything that goes wrong here is
our problem.  It would be better if Python didn't use locale-dependent
string<->float conversions internally, but that's just not the case (yet).

> Presumably what we'll have to do for a solution is just what Mark is
> doing now - find the correct place to put a call that (re)sets the c
> locale to English.

Python requires that the (true -- from the C library's POV) LC_NUMERIC
category be "C" locale.  That isn't English (although it looks a lot like it
to Germans <wink>), and we don't care about any category other than
LC_NUMERIC here.