Python and the Unicode Character Database
Two recently reported issues brought into light the fact that Python language definition is closely tied to character properties maintained by the Unicode Consortium. [1,2] For example, when Python switches to Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two additional characters that Python can use in identifiers. [3] With Python 3.1:
exec('\u0CF1 = 1') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier
but with Python 3.2a4:
exec('\u0CF1 = 1') eval('\u0CF1') 1
Of course, the likelihood is low that this change will affect any user, but the change in str.isspace() reported in [1] is likely to cause some trouble: Python 2.6.5:
u'A\u200bB'.split() [u'A', u'B']
Python 2.7:
u'A\u200bB'.split() [u'A\u200bB']
While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins. For example, I don't think that supporting
float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII. [1] http://bugs.python.org/issue10567 [2] http://bugs.python.org/issue10557 [3] http://www.unicode.org/versions/Unicode6.0.0/#Database_Changes
On Sun, 28 Nov 2010 15:24:37 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins.
Well, if "unicode support" means "support the latest version of the Unicode standard", I'm not sure we have a choice. We can make exceptions, but that would only confuse users even more, wouldn't it?
For example, I don't think that supporting
float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Why would they assume the text is ASCII? Regards Antoine.
On Sun, Nov 28, 2010 at 3:43 PM, Antoine Pitrou <solipsis@pitrou.net> wrote: ..
For example, I don't think that supporting
float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Why would they assume the text is ASCII?
def deposit(self, amountstr): self.balance += float(amountstr) audit_log("Deposited: " + amountstr) Auditor: $ cat numbered-account.log Deposited: ?????.?? ...
On Sun, 28 Nov 2010 15:58:33 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
On Sun, Nov 28, 2010 at 3:43 PM, Antoine Pitrou <solipsis@pitrou.net> wrote: ..
For example, I don't think that supporting
float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Why would they assume the text is ASCII?
def deposit(self, amountstr): self.balance += float(amountstr) audit_log("Deposited: " + amountstr)
Auditor:
$ cat numbered-account.log Deposited: ?????.??
I'm not sure that's how banking applications are written :) Antoine.
On Sun, Nov 28, 2010 at 7:04 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 28 Nov 2010 15:58:33 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
On Sun, Nov 28, 2010 at 3:43 PM, Antoine Pitrou <solipsis@pitrou.net> wrote: ..
For example, I don't think that supporting
> float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Why would they assume the text is ASCII?
def deposit(self, amountstr): self.balance += float(amountstr) audit_log("Deposited: " + amountstr)
Auditor:
$ cat numbered-account.log Deposited: ?????.??
I'm not sure that's how banking applications are written :)
+1 for this being bogus - I see no correlation whatsoever in numbers inside unicode having to be "ASCII" if we have surpassed all technical barriers for needing to behave like that. ASCII is an oversimplification of human communication needed for computing devices not complex enough to represent it fully. Let novice C programmers in English speaking countries deal with the fact that 1 character is not 1 byte anymore. We are past this point. js -><-
Antoine. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/jsbueno%40python.org.br
On Sun, Nov 28, 2010 at 4:12 PM, Joao S. O. Bueno <jsbueno@python.org.br> wrote: ..
Let novice C programmers in English speaking countries deal with the fact that 1 character is not 1 byte anymore. We are past this point.
If you are, please contribute your expertise here: http://bugs.python.org/issue2382
On 11/28/2010 3:58 PM, Alexander Belopolsky wrote:
On Sun, Nov 28, 2010 at 3:43 PM, Antoine Pitrou<solipsis@pitrou.net> wrote: ..
For example, I don't think that supporting
float('١٢٣٤.٥٦') 1234.56
Even if this is somehow an accident or something that someone snuck in, I think it a good idea that *users* be able to input amounts with their native digits. That is different from requiring *programmers* to write literals with euro-ascii-digits
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Why would they assume the text is ASCII?
def deposit(self, amountstr): self.balance += float(amountstr) audit_log("Deposited: " + amountstr)
If the programmer want to assure ascii, he can produce a string, possible formatted, from the amount depform = "Deposited: ${:14.2f}".format def deposit(self, amountstr): amount = float(amountstr) self.balance += amount # audit_log("Deposited: " + str(amount) # simple version audit_log(depform(amount)) Given that amountstr could be something like ' 182.33 ', I think programmer should plan to format it. -- Terry Jan Reedy
> float('١٢٣٤.٥٦') 1234.56
Even if this is somehow an accident or something that someone snuck in, I think it a good idea that *users* be able to input amounts with their native digits. That is different from requiring *programmers* to write literals with euro-ascii-digits
So one question is what kind of data float() is aimed at. I claim that it is about "programmer" data, not "user" data. If it supported "user" data, it probably would have to support "1,000" to denote 1e3 in the U.S., and denote 1e0 in Germany. Our users are generally confused on whether they should use th full stop or the comma as the decimal separator. As not even the locale-dependent issues are considered in float(), it is clear to me that entering local numbers cannot possibly be the objective of the function. Instead, following a wide-spread Python convention, it is meant to be the reverse of repr(). Can you name a single person who actually wants to write '١٢٣٤.٥٦' as a number? I'm fairly skeptical that users of arabic-indic digits. Instead, http://en.wikipedia.org/wiki/Decimal_separator suggests that they would rather U+066B, i.e. '١٢٣٤٫٥٦', which isn't supported by Python. Regards, Martin
On 28/11/2010 23:33, "Martin v. Löwis" wrote:
>> float('١٢٣٤.٥٦') 1234.56 Even if this is somehow an accident or something that someone snuck in, I think it a good idea that *users* be able to input amounts with their native digits. That is different from requiring *programmers* to write literals with euro-ascii-digits So one question is what kind of data float() is aimed at. I claim that it is about "programmer" data, not "user" data. If it supported "user" data, it probably would have to support "1,000" to denote 1e3 in the U.S., and denote 1e0 in Germany. Our users are generally confused on whether they should use th full stop or the comma as the decimal separator.
FWIW the C# equivalent is locale aware *unless* you pass in a specific culture. (System.Double.Parse): http://msdn.microsoft.com/en-us/library/fd84bdyt.aspx If you're not aware that your code may be run on non-US computers this is a trap for the unwary. If you *are* aware then it is very useful. An alternative overload allows you to specify the culture used to do the conversion: http://msdn.microsoft.com/en-us/library/t9ebt447.aspx Michael
As not even the locale-dependent issues are considered in float(), it is clear to me that entering local numbers cannot possibly be the objective of the function.
Instead, following a wide-spread Python convention, it is meant to be the reverse of repr().
Can you name a single person who actually wants to write '١٢٣٤.٥٦' as a number? I'm fairly skeptical that users of arabic-indic digits. Instead,
http://en.wikipedia.org/wiki/Decimal_separator
suggests that they would rather U+066B, i.e. '١٢٣٤٫٥٦', which isn't supported by Python.
Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.voidspace.org.uk/ READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
FWIW the C# equivalent is locale aware *unless* you pass in a specific culture. (System.Double.Parse):
That's not quite the equivalent of float(), I would say: this one apparently is locale-aware, so it is more the equivalent of locale.atof. The next question then is if it supports indo-arabic digits in any locale (or more specifically in an arabic locale). Regards, Martin
On Sun, Nov 28, 2010 at 6:59 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
The next question then is if it supports indo-arabic digits in any locale (or more specifically in an arabic locale).
And once you answered that question, does it support Devanagari or Bengali digits? And if so, an arbitrary mix of those and indo-arabic digits?
On 29/11/2010 00:04, Alexander Belopolsky wrote:
On Sun, Nov 28, 2010 at 6:59 PM, "Martin v. Löwis"<martin@v.loewis.de> wrote: ..
The next question then is if it supports indo-arabic digits in any locale (or more specifically in an arabic locale). And once you answered that question, does it support Devanagari or Bengali digits? And if so, an arbitrary mix of those and indo-arabic digits? Haha. Go and try it yourself. :-)
Michael -- http://www.voidspace.org.uk/ READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
On 28/11/2010 23:59, "Martin v. Löwis" wrote:
FWIW the C# equivalent is locale aware *unless* you pass in a specific culture. (System.Double.Parse): That's not quite the equivalent of float(), I would say: this one apparently is locale-aware, so it is more the equivalent of locale.atof.
Right. It is *the* standard way of getting a float from a string though, whereas in Python we have two depending on whether or not you want to be locale aware. The standard way in C# is locale aware. To be non-locale aware you pass in a specific culture or number format.
The next question then is if it supports indo-arabic digits in any locale (or more specifically in an arabic locale).
I don't think so actually. The float parse formatting rules are defined like this: [ws][$][sign][integral-digits[,]]integral-digits[.[fractional-digits]][E[sign]exponential-digits][ws] (From http://msdn.microsoft.com/en-us/library/7yd1h1be.aspx ) integral-digits, fractional-digits and exponential-digits are all defined as "A series of digits ranging from 0 to 9". Arguably this is not be conclusive. In fact the NumberFormatInfo class seems to hint that it may be otherwise: http://msdn.microsoft.com/en-us/library/system.globalization.numberformatinf... See DigitSubstitution on that page. I would have to try it to be sure and I don't have a Windows VM in convenient reach right now. All the best, Michael
Regards, Martin
-- http://www.voidspace.org.uk/ READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
float('١٢٣٤.٥٦') 1234.56
I think it's a bug that this works. The definition of the float builtin says Convert a string or a number to floating point. If the argument is a string, it must contain a possibly signed decimal or floating point number, possibly embedded in whitespace. The argument may also be '[+|-]nan' or '[+|-]inf'. Now, one may wonder what precisely a "possibly signed floating point number" is, but most likely, this refers to floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9" Regards, Martin
On Sun, Nov 28, 2010 at 5:17 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
float('١٢٣٤.٥٦') 1234.56
I think it's a bug that this works. The definition of the float builtin says
Convert a string or a number to floating point. If the argument is a string, it must contain a possibly signed decimal or floating point number, possibly embedded in whitespace. The argument may also be '[+|-]nan' or '[+|-]inf'.
This definition fails long before we get beyond 127-th code point:
float('infinity') inf
Am 28.11.2010 23:31, schrieb Alexander Belopolsky:
On Sun, Nov 28, 2010 at 5:17 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
> float('١٢٣٤.٥٦') 1234.56
I think it's a bug that this works. The definition of the float builtin says
Convert a string or a number to floating point. If the argument is a string, it must contain a possibly signed decimal or floating point number, possibly embedded in whitespace. The argument may also be '[+|-]nan' or '[+|-]inf'.
This definition fails long before we get beyond 127-th code point:
float('infinity') inf
What do infer from that? That the definition is wrong, or the code is wrong? Regards, Martin
On Sun, Nov 28, 2010 at 5:56 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
This definition fails long before we get beyond 127-th code point:
float('infinity') inf
What do infer from that? That the definition is wrong, or the code is wrong?
The development version of the reference manual is more detailed, but as far as I can tell, it still defines digit as 0-9. http://docs.python.org/dev/py3k/library/functions.html#float
Am 29.11.2010 00:01, schrieb Alexander Belopolsky:
On Sun, Nov 28, 2010 at 5:56 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
This definition fails long before we get beyond 127-th code point:
float('infinity') inf
What do infer from that? That the definition is wrong, or the code is wrong?
The development version of the reference manual is more detailed, but as far as I can tell, it still defines digit as 0-9.
http://docs.python.org/dev/py3k/library/functions.html#float
I wasn't asking about 0..9, but about "infinity". According to the spec, it shouldn't accept that (and neither should it accept 'infinitY'). However, whether that's a spec bug or an implementation bug - it seems like a minor issue to me (i.e. easily fixed). Regards, Martin
On Sun, Nov 28, 2010 at 6:08 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Am 29.11.2010 00:01, schrieb Alexander Belopolsky:
On Sun, Nov 28, 2010 at 5:56 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
This definition fails long before we get beyond 127-th code point:
> float('infinity') inf
What do infer from that? That the definition is wrong, or the code is wrong?
The development version of the reference manual is more detailed, but as far as I can tell, it still defines digit as 0-9.
http://docs.python.org/dev/py3k/library/functions.html#float
I wasn't asking about 0..9, but about "infinity". According to the spec, it shouldn't accept that (and neither should it accept 'infinitY').
According to the link that I mentioned, infinity ::= "Infinity" | "inf" and "Case is not significant, so, for example, “inf”, “Inf”, “INFINITY” and “iNfINity” are all acceptable spellings for positive infinity." I completely agree with your arguments and the reference manual has been improved a lot in the recent years.
"Martin v. Löwis" wrote:
float('١٢٣٤.٥٦') 1234.56
I think it's a bug that this works. The definition of the float builtin says
Convert a string or a number to floating point. If the argument is a string, it must contain a possibly signed decimal or floating point number, possibly embedded in whitespace. The argument may also be '[+|-]nan' or '[+|-]inf'.
Now, one may wonder what precisely a "possibly signed floating point number" is, but most likely, this refers to
floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"
I don't see why the language spec should limit the wealth of number formats supported by float(). It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded. Please also remember that Python3 now allows Unicode names for identifiers for much the same reasons. Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6. It is not a bug by any definition of "bug", even though the feature may bug someone occasionally to go read up a bit on what else the world has to offer other than Arabic numerals :-) http://en.wikipedia.org/wiki/Numeral_system -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 28 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Sun, Nov 28, 2010 at 5:42 PM, M.-A. Lemburg <mal@egenix.com> wrote: ..
I don't see why the language spec should limit the wealth of number formats supported by float().
The Language Spec (whatever it is) should not, but hopefully the Library Reference should. If you follow http://docs.python.org/dev/py3k/library/functions.html#float link and the references therein, you'll end up with digit ::= "0"..."9" http://docs.python.org/dev/py3k/reference/lexical_analysis.html#grammar-toke...
On 11/28/2010 5:51 PM, Alexander Belopolsky wrote:
The Language Spec (whatever it is) should not, but hopefully the Library Reference should. If you follow http://docs.python.org/dev/py3k/library/functions.html#float link and the references therein, you'll end up with
digit ::= "0"..."9"
http://docs.python.org/dev/py3k/reference/lexical_analysis.html#grammar-toke...
So fix the doc for builtin float() and perhaps int(). -- Terry Jan Reedy
Alexander Belopolsky wrote:
On Sun, Nov 28, 2010 at 5:42 PM, M.-A. Lemburg <mal@egenix.com> wrote: ..
I don't see why the language spec should limit the wealth of number formats supported by float().
The Language Spec (whatever it is) should not, but hopefully the Library Reference should. If you follow http://docs.python.org/dev/py3k/library/functions.html#float link and the references therein, you'll end up with
... the language spec again :-)
digit ::= "0"..."9"
http://docs.python.org/dev/py3k/reference/lexical_analysis.html#grammar-toke...
That's obviously a bug in the documentation, since the Python 2.7 docs don't mention any such relationship to the language spec: http://docs.python.org/library/functions.html#float -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Now, one may wonder what precisely a "possibly signed floating point number" is, but most likely, this refers to
floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"
I don't see why the language spec should limit the wealth of number formats supported by float().
If it doesn't, there should be some other specification of what is correct and what is not. It must not be unspecified.
It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded.
Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0.
Please also remember that Python3 now allows Unicode names for identifiers for much the same reasons.
No no no. Addition of Unicode identifiers has a well-designed, deliberate specification, with a PEP and all. The support for non-ASCII digits in float appears to be ad-hoc, and not founded on actual needs of actual users.
Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6.
That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes).
It is not a bug by any definition of "bug"
Most certainly it is: the documentation is either underspecified, or deviates from the implementation (when taking the most plausible interpretation). This is the very definition of "bug". Regards, Martin
+1 on all point below. On Sun, Nov 28, 2010 at 6:03 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Now, one may wonder what precisely a "possibly signed floating point number" is, but most likely, this refers to
floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"
I don't see why the language spec should limit the wealth of number formats supported by float().
If it doesn't, there should be some other specification of what is correct and what is not. It must not be unspecified.
It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded.
Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0.
Please also remember that Python3 now allows Unicode names for identifiers for much the same reasons.
No no no. Addition of Unicode identifiers has a well-designed, deliberate specification, with a PEP and all. The support for non-ASCII digits in float appears to be ad-hoc, and not founded on actual needs of actual users.
Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6.
That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes).
It is not a bug by any definition of "bug"
Most certainly it is: the documentation is either underspecified, or deviates from the implementation (when taking the most plausible interpretation). This is the very definition of "bug".
Regards, Martin
On Sun, Nov 28, 2010 at 6:03 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6.
That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes).
What makes it worse, is that while superficially, Unicode versions follow the same X.Y.Z format as Python versions, the stability promises are completely different. For example, it appears that the general category for the ZERO WIDTH SPACE was changed in Unicode 4.0.1. I don't think a change affecting str.split(), int(), float() and probably numerous other library functions would be acceptable in a Python micro release.
What makes it worse, is that while superficially, Unicode versions follow the same X.Y.Z format as Python versions, the stability promises are completely different. For example, it appears that the general category for the ZERO WIDTH SPACE was changed in Unicode 4.0.1. I don't think a change affecting str.split(), int(), float() and probably numerous other library functions would be acceptable in a Python micro release.
Well, we managed to completely break Unicode normalization between 2.6.5 and 2.6.6, due to a bug. You can see the Unicode Consortium's stability policy at http://unicode.org/policies/stability_policy.html In a sense, this is stronger than Python's backwards compatibility promises (which allow for certain incompatible changes to occur over time, whereas Unicode makes promises about all future versions). Regards, Martin
On Sun, Nov 28, 2010 at 6:19 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
You can see the Unicode Consortium's stability policy at
From the link above: """ As more experience is gathered in implementing the characters, adjustments in the properties may become necessary. Examples of such
properties include, but are not limited to, the following: General_Category ... """
In a sense, this is stronger than Python's backwards compatibility promises (which allow for certain incompatible changes to occur over time, whereas Unicode makes promises about all future versions).
I would say it is *different* and should be taken into account when tying language features to Unicode specifications. This was done in PEP 3131. Note that one of the stated objections was "Unicode is young; its problems are not yet well understood and solved;" (It is still true.)
On Sun, Nov 28, 2010 at 6:03 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
No no no. Addition of Unicode identifiers has a well-designed, deliberate specification, with a PEP and all. The support for non-ASCII digits in float appears to be ad-hoc, and not founded on actual needs of actual users.
I wonder how carefully right-to-left scripts were considered when PEP 3131 was discussed. Try the following on the python prompt:
ڦ= int('١٢٣') ڦ 123
In my OSX Terminal window, entering ڦ flips the >>> prompt and the session looks like this: ('???')int = ? <<<
Am 29.11.2010 00:56, schrieb Alexander Belopolsky:
On Sun, Nov 28, 2010 at 6:03 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
No no no. Addition of Unicode identifiers has a well-designed, deliberate specification, with a PEP and all. The support for non-ASCII digits in float appears to be ad-hoc, and not founded on actual needs of actual users.
I wonder how carefully right-to-left scripts were considered when PEP 3131 was discussed.
IIRC, some Hebrew users have spoken in favor of the PEP, despite the obvious difficulties it would create. I may misremember, but I think someone pointed out that they had these difficulties all the time, and that it wasn't really a burden. Unicode specifies that one should always use "logical order" in memory, and that's what the PEP does. Rendering is then a tool issue. Regards, Martin
"Martin v. Löwis" wrote:
Now, one may wonder what precisely a "possibly signed floating point number" is, but most likely, this refers to
floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"
I don't see why the language spec should limit the wealth of number formats supported by float().
If it doesn't, there should be some other specification of what is correct and what is not. It must not be unspecified.
True.
It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded.
Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0.
I'm not sure what you're after here.
Please also remember that Python3 now allows Unicode names for identifiers for much the same reasons.
No no no. Addition of Unicode identifiers has a well-designed, deliberate specification, with a PEP and all. The support for non-ASCII digits in float appears to be ad-hoc, and not founded on actual needs of actual users.
Please note that we didn't have PEPs and the PEP process at the time. The Unicode proposal predates and in some respects inspired the PEP process. The decision to add this support was deliberate based on the desire to support as much of the nice features of Unicode in Python as we could. At least that was what was driving me at the time. Regarding actual needs of actual users: I don't buy that as an argument when it comes to supporting a standard that is meant to attract users with non-ASCII origins. Some references you may want to read up on: http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture http://en.wikipedia.org/wiki/Vietnamese_numerals http://en.wikipedia.org/wiki/Korean_numerals http://en.wikipedia.org/wiki/Japanese_numerals Even MS Office supports them: http://languages.siuc.edu/Chinese/Language_Settings.html
Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6.
That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes).
If that were true, then all Unicode database (UCD) changes would make Python unstable. However, most changes to existing code points in the UCS are bug fixes, so they actually have a stabilizing quality more than a destabilizing one.
It is not a bug by any definition of "bug"
Most certainly it is: the documentation is either underspecified, or deviates from the implementation (when taking the most plausible interpretation). This is the very definition of "bug".
The implementation is not a bug and neither was this a bug in the 2.x series of the Python documentation. The Python 3.x docs apparently introduced a reference to the language spec which is clearly not capturing the wealth of possible inputs. So, yes, we're talking about a documentation bug, but not an implementation bug. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0.
I'm not sure what you're after here.
That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so.
The decision to add this support was deliberate based on the desire to support as much of the nice features of Unicode in Python as we could. At least that was what was driving me at the time.
At the time, this may have been the right thing to do. With the experience gained, we should now conclude to revert this particular aspect.
Some references you may want to read up on:
http://en.wikipedia.org/wiki/Numbers_in_Chinese_culture http://en.wikipedia.org/wiki/Vietnamese_numerals http://en.wikipedia.org/wiki/Korean_numerals http://en.wikipedia.org/wiki/Japanese_numerals
I don't question that people use non-ASCII characters to denote numbers. I claim that the specific support in Python for that has no connection to reality. I further claim that the use of non-ASCII numbers is a local convention, and that if you provide a library to parse numbers, users (of that library) will somehow have to specify which notational convention(s) is reasonable for the input they have.
Even MS Office supports them:
That's printing, though, not parsing. Notice that Python does *not* currently support printing numbers in other scripts - even though this may actually be more useful than parsing.
Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6.
That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes).
If that were true, then all Unicode database (UCD) changes would make Python unstable.
That's indeed the case - they do (see the recent bug report on white space processing). However, any change makes Python unstable (in the sense that it can potentially break existing applications), and, in many cases, the risk of breaking something is well worth it. In the case of number parsing, I think Python would be better if float() rejected non-ASCII strings, and any support for such parsing should be redone correctly in a different place (preferably along with printing of numbers).
Most certainly it is: the documentation is either underspecified, or deviates from the implementation (when taking the most plausible interpretation). This is the very definition of "bug".
The implementation is not a bug and neither was this a bug in the 2.x series of the Python documentation.
Of course the 2.x documentation is wrong, in that it is severely underspecified, and the most straight-forward interpretation of the specific wording gives an incorrect impression of the implementation.
The Python 3.x docs apparently introduced a reference to the language spec which is clearly not capturing the wealth of possible inputs.
Right - but only because the 2.x documentation *already* suggested that the supported syntax matches the literal syntax - as that's the most natural thing to assume. Regards, Martin
Martin v. Löwis wrote:
Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here.
That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so.
What bogus characters do the float() and int() constructors accept? As far as I can see, they only accepts numerals. [...]
Notice that Python does *not* currently support printing numbers in other scripts - even though this may actually be more useful than parsing.
Lack of one function, even if more useful, does not imply that an existing function should be removed. [...]
In the case of number parsing, I think Python would be better if float() rejected non-ASCII strings, and any support for such parsing should be redone correctly in a different place (preferably along with printing of numbers).
So your problems with the current behaviour are: (1) in some unspecified way, it's not done correctly; (2) it belongs somewhere other than float() and int(). That second is awfully close to bike-shedding. Since you accept that Python *should* have the current behaviour, and Python *already* has the current behaviour, it seems strange that you are kicking up such a fuss merely to *move* the implementation of that behaviour out of the numeric constructors into some unspecified "different place". I think it would be constructive to explain: - how the current behaviour is incorrect; - your suggestions for correcting it; and - a concrete suggestion for where you would like to see the behaviour moved to, and why that would be better than where it currently is. -- Steven
Am 02.12.2010 22:30, schrieb Steven D'Aprano:
Martin v. Löwis wrote:
Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here.
That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so.
What bogus characters do the float() and int() constructors accept? As far as I can see, they only accepts numerals.
Not bogus characters, but bogus character strings. E.g. strings that mix digits from different scripts, and mix them with the Python decimal separator.
Notice that Python does *not* currently support printing numbers in other scripts - even though this may actually be more useful than parsing.
Lack of one function, even if more useful, does not imply that an existing function should be removed.
No. But if the specific function(ality) is not useful and underspecified, it should be removed.
So your problems with the current behaviour are:
(1) in some unspecified way, it's not done correctly;
No. My main concern is that it is not properly specified. If it was specified, I could then tell you what precisely is wrong about it. Right now, I can only give examples for input that it should not accept, and examples of input that it should, but does not accept.
(2) it belongs somewhere other than float() and int().
That's only because it also needs a parameter to specify what syntax to follow, somehow. That parameter could be explicit or implicit, and it could be to float or to some other function. But it must be available, and is not.
That second is awfully close to bike-shedding. Since you accept that Python *should* have the current behaviour
No, I don't. I think it behaves incorrectly, accepting garbage input and guessing some meaning out of it.
- how the current behaviour is incorrect;
See above: it accepts strings that do not denote real numbers in any writing system, and, despite the claim that the feature is there to support other writing systems, actually does not truly support other writing systems.
- your suggestions for correcting it; and
Make the current implementation exactly match the current documentation. I think the documentation is correct; the implementation is wrong.
- a concrete suggestion for where you would like to see the behaviour moved to, and why that would be better than where it currently is.
The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module. Regards, Martin
On 12/2/2010 4:48 PM, "Martin v. Löwis" wrote:
Am 02.12.2010 22:30, schrieb Steven D'Aprano:
Martin v. Löwis wrote:
Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here.
That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so.
What bogus characters do the float() and int() constructors accept? As far as I can see, they only accepts numerals.
Not bogus characters, but bogus character strings. E.g. strings that mix digits from different scripts, and mix them with the Python decimal separator.
Notice that Python does *not* currently support printing numbers in other scripts - even though this may actually be more useful than parsing.
Lack of one function, even if more useful, does not imply that an existing function should be removed.
No. But if the specific function(ality) is not useful and underspecified, it should be removed.
So your problems with the current behaviour are:
(1) in some unspecified way, it's not done correctly;
No. My main concern is that it is not properly specified. If it was specified, I could then tell you what precisely is wrong about it. Right now, I can only give examples for input that it should not accept, and examples of input that it should, but does not accept.
(2) it belongs somewhere other than float() and int().
That's only because it also needs a parameter to specify what syntax to follow, somehow. That parameter could be explicit or implicit, and it could be to float or to some other function. But it must be available, and is not.
That second is awfully close to bike-shedding. Since you accept that Python *should* have the current behaviour
No, I don't. I think it behaves incorrectly, accepting garbage input and guessing some meaning out of it.
- how the current behaviour is incorrect;
See above: it accepts strings that do not denote real numbers in any writing system, and, despite the claim that the feature is there to support other writing systems, actually does not truly support other writing systems.
- your suggestions for correcting it; and
Make the current implementation exactly match the current documentation. I think the documentation is correct; the implementation is wrong.
- a concrete suggestion for where you would like to see the behaviour moved to, and why that would be better than where it currently is.
The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module.
I agree with everything Martin says here. I think the basic premise is: you won't find strings "in the wild" that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.) Eric.
Eric Smith wrote:
The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module.
I agree with everything Martin says here. I think the basic premise is: you won't find strings "in the wild" that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.)
http://en.wikipedia.org/wiki/Decimal_mark "In China, comma and space are used to mark digit groups because dot is used as decimal mark." Note that float() can also parse integers, it just returns them as floats :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 12/2/2010 5:43 PM, M.-A. Lemburg wrote:
Eric Smith wrote:
The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module.
I agree with everything Martin says here. I think the basic premise is: you won't find strings "in the wild" that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.)
http://en.wikipedia.org/wiki/Decimal_mark
"In China, comma and space are used to mark digit groups because dot is used as decimal mark."
Is that an ASCII dot? That page doesn't say.
Note that float() can also parse integers, it just returns them as floats :-)
:)
Eric Smith wrote:
On 12/2/2010 5:43 PM, M.-A. Lemburg wrote:
Eric Smith wrote:
The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module.
I agree with everything Martin says here. I think the basic premise is: you won't find strings "in the wild" that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.)
http://en.wikipedia.org/wiki/Decimal_mark
"In China, comma and space are used to mark digit groups because dot is used as decimal mark."
Is that an ASCII dot? That page doesn't say.
Yes, but to be fair: I think that the page actually refers to the use of the Arabic numeral format in China, rather than with their own script symbols.
Note that float() can also parse integers, it just returns them as floats :-)
:)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Am 02.12.2010 23:43, schrieb M.-A. Lemburg:
Eric Smith wrote:
The current behavior should go nowhere; it is not useful. Something very similar to the current behavior (but done correctly) should go into the locale module.
I agree with everything Martin says here. I think the basic premise is: you won't find strings "in the wild" that use non-ASCII digits but do use the ASCII dot as a decimal point. And that's what float() is looking for. (And that doesn't even begin to address what it expects for an exponent 'e'.)
http://en.wikipedia.org/wiki/Decimal_mark
"In China, comma and space are used to mark digit groups because dot is used as decimal mark."
I may be misinterpreting that, but I think that refers to the case of writing numbers using Arabic digits. "Chinese" digits are, e.g., used in the Suzhou numerals http://en.wikipedia.org/wiki/Suzhou_numerals This doesn't have a decimal point at all. Instead, the second line (below or left to the actual digits) describes the power of ten and the unit of measurement (i.e. similar to scientific notation, but with ideographs for the powers of ten). In another writing system, they use 点 (U+70B9) as the decimal separator, see http://en.wikipedia.org/wiki/Chinese_numerals#Fractional_values In the same system, the integral part uses multipliers, i.e. 12345 is [1][10000][2][1000][3][100][4][10][5]; the fractional part uses regular digits. Regards, Martin
On Thu, Dec 2, 2010 at 8:23 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
In the case of number parsing, I think Python would be better if float() rejected non-ASCII strings, and any support for such parsing should be redone correctly in a different place (preferably along with printing of numbers).
x = '\uff11\uff25\uff0b\uff11\uff10' x '1E+10' float(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'decimal' codec can't encode character '\uff25' in
+1. The set of strings currently accepted by the float constructor just seems too ad hoc to be at all useful. Apart from the decimal separator issue, and the question of exactly which decimal digits are accepted and which aren't, there are issues like this one: position 1: invalid decimal Unicode string
y = '\uff11E+\uff11\uff10' y '1E+10' float(y) 10000000000.0
That is, fullwidth *digits* are allowed, but none of the other characters can be fullwidth variants. Unfortunately, a float string doesn't consist solely of digits, and it seems to me to make little sense to allow variation in the digits without allowing corresponding variations in the other characters that might appear ('.', 'e', 'E', '+', '-'). A couple of slightly trickier decisions: (1) the float constructor currently does accept leading and trailing whitespace; should it allow any Unicode whitespace characters here? I'd say yes. (2) For int() rather than float(), there's a bit more value in allowing the variant digits, since it provides an easy way to interpret those digits. The decimal module currently makes use of this, for example (the decimal spec requires that non-European digits be accepted). I'd be happier if this functionality were moved elsewhere, though. The int constructor is, if anything, currently worse off than float, thanks to its attempts to support non-decimal bases. There's value in having an easy-to-specify, easy-to-maintain API for these basic builtin functions. For one thing, it helps non-CPython implementations. [MAL]
The Python 3.x docs apparently introduced a reference to the language spec which is clearly not capturing the wealth of possible inputs.
That documentation update was my fault; I was motivated to make the update by issues unrelated to this one (mostly to do with Python 3's more consistent handling of inf and nan, as a result of all the new float<->string conversion code). If I'd been thinking harder, I would have remembered that float accepted the non-European digits and added a note to that effect. This (unintentional) omission does underline the point that it's difficult right now to document and understand exactly what the float constructor does or doesn't accept. Mark
On Thu, Dec 2, 2010 at 4:57 PM, Mark Dickinson <dickinsm@gmail.com> wrote: ..
(the decimal spec requires that non-European digits be accepted).
Mark, I think *requires* is too strong of a word to describe what the spec says. The decimal module documentation refers to two authorities: 1. IBM’s General Decimal Arithmetic Specification 2. IEEE standard 854-1987 The IEEE standards predates Unicode and unsurprisingly does not have anything related to the issue. the IBM's spec says the following in the Conversions section: """ It is recommended that implementations also provide additional number formatting routines (including some which are locale-dependent), and if available should accept non-European decimal digits in strings. """ http://speleotrove.com/decimal/daconvs.html This cannot possibly be interpreted as normative text. The emphasis is clearly on "formatting routines" with "non-European decimal digits" added as an afterthought. This recommendation can reasonably be interpreted as a requirement that conversion routines should accept what formatting routines can produce. In Python there are no formatting routines to produce non-European numerals, so there is no requirement to accept them in conversions. I don't think decimal module should support non-European decimal digits. The only place where it can make some sense is in int() because here we have a fighting chance of producing a reasonable definition. The motivating use case is conversion of numerical data extracted from text using simple '\d+' regex matches. Here is how I would do it: 1. String x of non-European decimal digits is only accepted in int(x), but not by int(x, 0) or int(x, 10). 2. If x contains one or more non-European digits, then (a) all digits must be from the same block: def basepoint(c): return ord(c) - unicodedata.digit(c) all(basepoint(c) == basepoint(x[0]) for c in x) -> True (b) and '+' or '-' sign is not alowed. 3. A character c is a digit if it matches '\d' regex. I think this means unicodedata.category(c) -> 'Nd'. Condition 2(b) is important because there is no clear way to define what is acceptable as '+' or '-' using Unicode character properties and not all number systems even support local form of negation. (It is also YAGNI.)
On Fri, Dec 3, 2010 at 12:10 AM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote: ..
I don't think decimal module should support non-European decimal digits. The only place where it can make some sense is in int() because here we have a fighting chance of producing a reasonable definition. The motivating use case is conversion of numerical data extracted from text using simple '\d+' regex matches.
re.compile(r'\s+(\d+)\s+').match(' \u2081\u2082\u2083 ').group(1) '₁₂₃' int(_) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'decimal' codec can't encode character '\u2081' in
It turns out, this use case does not quite work in Python either: position 0: invalid decimal Unicode string This may actually be a bug in Python regex implementation because Unicode standard seems to recommend that '\d' be interpreted as gc = Decimal_Number (Nd): http://unicode.org/reports/tr18/#Compatibility_Properties I actually wonder if Python's re module can claim to provide even Basic Unicode Support. http://unicode.org/reports/tr18/#Basic_Unicode_Support
Here is how I would do it:
1. String x of non-European decimal digits is only accepted in int(x), but not by int(x, 0) or int(x, 10). 2. If x contains one or more non-European digits, then
(a) all digits must be from the same block:
def basepoint(c): return ord(c) - unicodedata.digit(c) all(basepoint(c) == basepoint(x[0]) for c in x) -> True
(b) and '+' or '-' sign is not alowed.
3. A character c is a digit if it matches '\d' regex. I think this means unicodedata.category(c) -> 'Nd'.
Condition 2(b) is important because there is no clear way to define what is acceptable as '+' or '-' using Unicode character properties and not all number systems even support local form of negation. (It is also YAGNI.)
On Sat, Dec 4, 2010 at 5:58 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
I actually wonder if Python's re module can claim to provide even Basic Unicode Support.
Do you really wonder? Most definitely it does not.
Were you more optimistic four years ago? http://bugs.python.org/issue1528154#msg54864 I was hoping that regex syntax would be useful in explaining/documenting Python text processing routines (including string to number conversions) without a heavy dose of Unicode terminology.
2010/12/7 Alexander Belopolsky <alexander.belopolsky@gmail.com>:
On Sat, Dec 4, 2010 at 5:58 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
I actually wonder if Python's re module can claim to provide even Basic Unicode Support.
Do you really wonder? Most definitely it does not.
Were you more optimistic four years ago?
http://bugs.python.org/issue1528154#msg54864
I was hoping that regex syntax would be useful in explaining/documenting Python text processing routines (including string to number conversions) without a heavy dose of Unicode terminology.
The new regex version http://bugs.python.org/issue2636 supports much more features, including unicode properties, and the mentioned possix classes etc. but definitely not all of the requirements of that rather "generous" list. http://www.unicode.org/reports/tr18/ It seems, e.g. in Perl, there are some omissions too http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-... Do you know of any re engine fully complying to to tr18, even at the first level: "Basic Unicode Support"? vbr
On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom <vlastimil.brom@gmail.com> wrote: ..
It seems, e.g. in Perl, there are some omissions too http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-...
Do you know of any re engine fully complying to to tr18, even at the first level: "Basic Unicode Support"?
I would say Perl comes very close. At least it explicitly documents the missing features and offers workarounds in its reference manual. I am actually not as concerned about missing features as I am about non-conformance in the widely used features such as digits' matching with '\d'.
On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom <vlastimil.brom@gmail.com> wrote: ..
Do you know of any re engine fully complying to to tr18, even at the first level: "Basic Unicode Support"?
""" ICU Regular Expressions conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2. """ http://userguide.icu-project.org/strings/regexp
2010/12/7 Alexander Belopolsky <alexander.belopolsky@gmail.com>:
On Tue, Dec 7, 2010 at 8:02 AM, Vlastimil Brom <vlastimil.brom@gmail.com> wrote: ..
Do you know of any re engine fully complying to to tr18, even at the first level: "Basic Unicode Support"?
""" ICU Regular Expressions conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2. """ http://userguide.icu-project.org/strings/regexp
Thanks for the pointer, I wasn't aware of that project. Anyway I am quite happy with the mentioned regex library and can live with trading this full compliance for some non-unicode goodies (like unbounded lookbehinds etc.), but I see, it's beyond the point here. Not that my opinion matters, but I can't think of, say, "union, intersection and set-difference of Unicode sets" as a basic feature or consider it a part of "a minimal level for useful Unicode support." vbr
Am 07.12.2010 04:03, schrieb Alexander Belopolsky:
On Sat, Dec 4, 2010 at 5:58 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
I actually wonder if Python's re module can claim to provide even Basic Unicode Support.
Do you really wonder? Most definitely it does not.
Were you more optimistic four years ago?
Not at all. I thought back then, and think now, that Python should, but doesn't, support TR#18. I don't view that lack as a severe problem, though, and apparently none of the other contributors did so, either. Regards, Martin
2010/11/28 M.-A. Lemburg <mal@egenix.com>:
"Martin v. Löwis" wrote:
> float('١٢٣٤.٥٦') 1234.56
I think it's a bug that this works. The definition of the float builtin says
Convert a string or a number to floating point. If the argument is a string, it must contain a possibly signed decimal or floating point number, possibly embedded in whitespace. The argument may also be '[+|-]nan' or '[+|-]inf'.
Now, one may wonder what precisely a "possibly signed floating point number" is, but most likely, this refers to
floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"
I don't see why the language spec should limit the wealth of number formats supported by float().
It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded.
That's different. Python doesn't assign any semantic meaning to the characters in identifiers. The non-latin support for numerals, though, could change the meaning of a program dramatically and needs to be well-specified. Whether int() should do this is debatable. I, for one, think this kind of support belongs in the locale module. -- Regards, Benjamin
On Sun, 28 Nov 2010 17:23:01 -0600 Benjamin Peterson <benjamin@python.org> wrote:
2010/11/28 M.-A. Lemburg <mal@egenix.com>:
"Martin v. Löwis" wrote:
>> float('١٢٣٤.٥٦') 1234.56
I think it's a bug that this works. The definition of the float builtin says
Convert a string or a number to floating point. If the argument is a string, it must contain a possibly signed decimal or floating point number, possibly embedded in whitespace. The argument may also be '[+|-]nan' or '[+|-]inf'.
Now, one may wonder what precisely a "possibly signed floating point number" is, but most likely, this refers to
floatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponent intpart ::= digit+ fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"
I don't see why the language spec should limit the wealth of number formats supported by float().
It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers. Just because these digits may look strange to someone doesn't mean that they are meaningless or should be discarded.
That's different. Python doesn't assign any semantic meaning to the characters in identifiers. The non-latin support for numerals, though, could change the meaning of a program dramatically and needs to be well-specified. Whether int() should do this is debatable.
Perhaps int(), float(), Decimal() and friends could take an optional parameter indicating whether non-ascii digits are considered. It would then satisfy all parties. Antoine.
On Sun, Nov 28, 2010 at 7:01 PM, Antoine Pitrou <solipsis@pitrou.net> wrote: ..
That's different. Python doesn't assign any semantic meaning to the characters in identifiers. The non-latin support for numerals, though, could change the meaning of a program dramatically and needs to be well-specified. Whether int() should do this is debatable.
Perhaps int(), float(), Decimal() and friends could take an optional parameter indicating whether non-ascii digits are considered. It would then satisfy all parties.
What parties? I don't think anyone has claimed to actually have used non-ASCII digits with float(). Of course it is fun that Python can process Bengali numerals, but so would be allowing Roman numerals. There is a reason why after careful consideration, PEP 313 was ultimately rejected. BTW, it is common in Russia to specify months using roman numerals. Maybe we should consider allowing datetime.date() accept '1.IV.2011'.
Perhaps int(), float(), Decimal() and friends could take an optional parameter indicating whether non-ascii digits are considered. It would then satisfy all parties.
What parties? I don't think anyone has claimed to actually have used non-ASCII digits with float().
Have you done a poll of all Python 3 users?
Of course it is fun that Python can process Bengali numerals, but so would be allowing Roman numerals. There is a reason why after careful consideration, PEP 313 was ultimately rejected.
That's mostly irrelevant. This feature exists and someone, somewhere, may be using it. We normally don't remove stuff without deprecation. Antoine.
Alexander Belopolsky <alexander.belopolsky@gmail.com> writes:
On Sun, Nov 28, 2010 at 7:01 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Perhaps int(), float(), Decimal() and friends could take an optional parameter indicating whether non-ascii digits are considered. It would then satisfy all parties.
What parties? I don't think anyone has claimed to actually have used non-ASCII digits with float().
Rather, it has been pointed out that there is an unknown amount of existing code which does that. You're not going to know how much or how little from this forum.
Of course it is fun that Python can process Bengali numerals, but so would be allowing Roman numerals. There is a reason why after careful consideration, PEP 313 was ultimately rejected.
Rejecting a proposed *new* capability is a different matter from disabling an *existing* capability which works in existing Python releases. -- \ “Following fashion and the status quo is easy. Thinking about | `\ your users' lives and creating something practical is much | _o__) harder.” —Ryan Singer, 2008-07-09 | Ben Finney
On Sun, Nov 28, 2010 at 7:55 PM, Ben Finney <ben+python@benfinney.id.au> wrote: ..
Of course it is fun that Python can process Bengali numerals, but so would be allowing Roman numerals. There is a reason why after careful consideration, PEP 313 was ultimately rejected.
Rejecting a proposed *new* capability is a different matter from disabling an *existing* capability which works in existing Python releases.
Was this capability ever documented? It does not feel like a deliberate feature. If it was, '\N{ARABIC DECIMAL SEPARATOR}' would be accepted in arabic-indic notation. If feels more like a CPython implementation detail similar to say:
int('10') is 10 True int('10000') is 10000 False
Note that the underlying PyUnicode_EncodeDecimal() function is described in the unicodeobject.h header file as follows: /* --- Decimal Encoder ---------------------------------------------------- */ /* Takes a Unicode string holding a decimal value and writes it into an output buffer using standard ASCII digit codes. .. The encoder converts whitespace to ' ', decimal characters to their corresponding ASCII digit and all other Latin-1 characters except \0 as-is. Characters outside this range (Unicode ordinals 1-256) are treated as errors. This includes embedded NULL bytes. */ So the support for non-ASCII digits is accidental and should be treated as a bug.
Perhaps int(), float(), Decimal() and friends could take an optional parameter indicating whether non-ascii digits are considered. It would then satisfy all parties.
Not really. I still would want to see what the actual requirement is: i.e. do any users actually have the desire to have these digits accepted, yet the alternative decimal points rejected? Regards, Martin
M.-A. Lemburg writes:
It is not uncommon for Asians and other non-Latin script users to use their own native script symbols for numbers.
Japanese don't, in computational or scientific work where float() would be used. Japanese numerals are used for dates and for certain felicitous ages (and even there so-called "Arabic" numerals are perfectly acceptable). Otherwise, it's all ASCII (although it might be "full-width" compatibility variants).
Please also remember that Python3 now allows Unicode names for identifiers for much the same reasons.
I don't think it's the same reason, not for Japanese, anyway. I agree that Python should make it easy for the programmer to get numerical values of native numeric strings, but it's not at all clear to me that there is any point to having float() recognize them by default.
On Mon, Nov 29, 2010 at 1:39 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I agree that Python should make it easy for the programmer to get numerical values of native numeric strings, but it's not at all clear to me that there is any point to having float() recognize them by default.
Indeed, as someone else suggested earlier in the thread, supporting non-ASCII digits sounds more like a job for the locale module than for the builtin types. Deprecating non-ASCII support in the latter, while ensuring it is properly supported in the former sounds like a better way forward than maintaining the status quo (starting in 3.3 though, with the first beta just around the corner, we don't want to be monkeying with this in 3.2) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
On Mon, Nov 29, 2010 at 1:39 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I agree that Python should make it easy for the programmer to get numerical values of native numeric strings, but it's not at all clear to me that there is any point to having float() recognize them by default.
Indeed, as someone else suggested earlier in the thread, supporting non-ASCII digits sounds more like a job for the locale module than for the builtin types.
Deprecating non-ASCII support in the latter, while ensuring it is properly supported in the former sounds like a better way forward than maintaining the status quo (starting in 3.3 though, with the first beta just around the corner, we don't want to be monkeying with this in 3.2)
Since when do we only support certain Unicode features in specific locales ? If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc. We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg <mal@egenix.com> wrote:
If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc.
We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale.
Because parsing numbers is about more than just the characters used for the individual digits. There are additional semantics associated with digit ordering (for any number) and decimal separators and exponential notation (for floating point numbers) and those vary by locale. We deliberately chose to make the builtin numeric parsers unaware of all of those things, and assuming that we can simply parse other digits as if they were their ASCII equivalents and otherwise assume a C locale seems questionable. If the existing semantics can be adequately defined, documented and defended, then retaining them would be fine. However, the language reference needs to define the behaviour properly so that other implementations know what they need to support and what can be chalked up as being just an implementation accident of CPython. (As a point in the plus column, both decimal.Decimal and fractions.Fraction were able to handle the '١٢٣٤.٥٦' example in a manner consistent with the int and float handling) Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg <mal@egenix.com> wrote:
If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc.
We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale.
Because parsing numbers is about more than just the characters used for the individual digits. There are additional semantics associated with digit ordering (for any number) and decimal separators and exponential notation (for floating point numbers) and those vary by locale. We deliberately chose to make the builtin numeric parsers unaware of all of those things, and assuming that we can simply parse other digits as if they were their ASCII equivalents and otherwise assume a C locale seems questionable.
Sure, and those additional semantics are locale dependent, even between ASCII-only locales. However, that does not apply to the basic building blocks, the decimal digits themselves.
If the existing semantics can be adequately defined, documented and defended, then retaining them would be fine. However, the language reference needs to define the behaviour properly so that other implementations know what they need to support and what can be chalked up as being just an implementation accident of CPython. (As a point in the plus column, both decimal.Decimal and fractions.Fraction were able to handle the '١٢٣٤.٥٦' example in a manner consistent with the int and float handling)
The support is built into the C API, so there's not really much surprise there. Regarding documentation, we'd just have to add that numbers may be made up of an Unicode code point in the category "Nd". See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section 4.6 for details.... """ Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, but exclude char- acters such as Roman numerals and Greek acrophonic numerals. (Note that <1, 5> = 15 = fifteen, but <I, V> = IV = four.) Decimal digits also exclude the compatibility subscript or superscript digits to prevent simplistic parsers from misinterpreting their values in context. """ int(), float() and long() (in Python2) are such simplistic parsers. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 11/29/2010 10:19 AM, M.-A. Lemburg wrote:
Nick Coghlan wrote:
On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg<mal@egenix.com> wrote:
If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc.
We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale.
Because parsing numbers is about more than just the characters used for the individual digits. There are additional semantics associated with digit ordering (for any number) and decimal separators and exponential notation (for floating point numbers) and those vary by locale. We deliberately chose to make the builtin numeric parsers unaware of all of those things, and assuming that we can simply parse other digits as if they were their ASCII equivalents and otherwise assume a C locale seems questionable.
Sure, and those additional semantics are locale dependent, even between ASCII-only locales. However, that does not apply to the basic building blocks, the decimal digits themselves.
If the existing semantics can be adequately defined, documented and defended, then retaining them would be fine. However, the language reference needs to define the behaviour properly so that other implementations know what they need to support and what can be chalked up as being just an implementation accident of CPython. (As a point in the plus column, both decimal.Decimal and fractions.Fraction were able to handle the '١٢٣٤.٥٦' example in a manner consistent with the int and float handling)
The support is built into the C API, so there's not really much surprise there.
Regarding documentation, we'd just have to add that numbers may be made up of an Unicode code point in the category "Nd".
See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section 4.6 for details....
""" Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, but exclude char- acters such as Roman numerals and Greek acrophonic numerals. (Note that<1, 5> = 15 = fifteen, but<I, V> = IV = four.) Decimal digits also exclude the compatibility subscript or superscript digits to prevent simplistic parsers from misinterpreting their values in context. """
int(), float() and long() (in Python2) are such simplistic parsers.
Since you are the knowledgable advocate of the current behavior, perhaps you could open an issue and propose a doc patch, even if not .rst formatted. -- Terry Jan Reedy
On Mon, Nov 29, 2010 at 2:23 PM, Terry Reedy <tjreedy@udel.edu> wrote: ..
Since you are the knowledgable advocate of the current behavior, perhaps you could open an issue and propose a doc patch, even if not .rst formatted.
I am not an advocate of the current behavior, but an issue for doc patches is at <http://bugs.python.org/issue10581>.
Terry Reedy wrote:
On 11/29/2010 10:19 AM, M.-A. Lemburg wrote:
Nick Coghlan wrote:
On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg<mal@egenix.com> wrote:
If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc.
We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale.
Because parsing numbers is about more than just the characters used for the individual digits. There are additional semantics associated with digit ordering (for any number) and decimal separators and exponential notation (for floating point numbers) and those vary by locale. We deliberately chose to make the builtin numeric parsers unaware of all of those things, and assuming that we can simply parse other digits as if they were their ASCII equivalents and otherwise assume a C locale seems questionable.
Sure, and those additional semantics are locale dependent, even between ASCII-only locales. However, that does not apply to the basic building blocks, the decimal digits themselves.
If the existing semantics can be adequately defined, documented and defended, then retaining them would be fine. However, the language reference needs to define the behaviour properly so that other implementations know what they need to support and what can be chalked up as being just an implementation accident of CPython. (As a point in the plus column, both decimal.Decimal and fractions.Fraction were able to handle the '١٢٣٤.٥٦' example in a manner consistent with the int and float handling)
The support is built into the C API, so there's not really much surprise there.
Regarding documentation, we'd just have to add that numbers may be made up of an Unicode code point in the category "Nd".
See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section 4.6 for details....
""" Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, but exclude char- acters such as Roman numerals and Greek acrophonic numerals. (Note that<1, 5> = 15 = fifteen, but<I, V> = IV = four.) Decimal digits also exclude the compatibility subscript or superscript digits to prevent simplistic parsers from misinterpreting their values in context. """
int(), float() and long() (in Python2) are such simplistic parsers.
Since you are the knowledgable advocate of the current behavior, perhaps you could open an issue and propose a doc patch, even if not .rst formatted.
Good suggestion. I tried to collect as much context as possible: http://bugs.python.org/issue10610 I'll leave the rst-magic to someone else, but will certainly help if you have more questions about the details. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Mon, 29 Nov 2010 13:58:05 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
On Mon, Nov 29, 2010 at 1:39 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
I agree that Python should make it easy for the programmer to get numerical values of native numeric strings, but it's not at all clear to me that there is any point to having float() recognize them by default.
Indeed, as someone else suggested earlier in the thread, supporting non-ASCII digits sounds more like a job for the locale module than for the builtin types.
Not sure, really. For example, "\d" in a regular expression will match all Unicode digits, unless you pass the re.ASCII flag. The C locale mechanism generally does a poor job of supporting what MS seems to call "culture-specific" characteristics. Regards Antoine.
Martin v. Löwis wrote:
float('١٢٣٤.٥٦') 1234.56
I think it's a bug that this works. The definition of the float builtin says [...]
I think that's a documentation bug rather than a coding bug. If Python wishes to limit the digits allowed in numeric *literals* to ASCII 0...9, that's one thing, but I think that the digits allowed in numeric *strings* should allow the full range of digits supported by the Unicode standard. The former ensures that literals in code are always readable; the later allows users to enter numbers in their own number system. How could that be a bad thing? -- Steven
Steven D'Aprano <steve@pearwood.info> writes:
If Python wishes to limit the digits allowed in numeric *literals* to ASCII 0...9, that's one thing, but I think that the digits allowed in numeric *strings* should allow the full range of digits supported by the Unicode standard.
I assume you specifically mean that the numeric class constructors, like ‘int’ and ‘float’, should parse their input string such that any character Unicode defines as a numeric digit is mapped to the corresponding digit. That sounds attractive, but it raises questions about mixed notations, mixing digits from different writing systems, and probably other questionss I haven't thought of. It's not something to make a simple yes-or-no-decision on now, IMO. This sounds best suited to a PEP, which someone who cares enough can champion in ‘python-ideas’. -- \ “The manager has personally passed all the water served here.” | `\ —hotel, Acapulco | _o__) | Ben Finney
The former ensures that literals in code are always readable; the later allows users to enter numbers in their own number system. How could that be a bad thing?
It's YAGNI, feature bloat. It gives the illusion of supporting something that actually isn't supported very well (namely, parsing local number strings). I claim that there is no meaningful application of this feature. Regards, Martin
On Mon, Nov 29, 2010 at 2:22 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
The former ensures that literals in code are always readable; the later allows users to enter numbers in their own number system. How could that be a bad thing?
It's YAGNI, feature bloat. It gives the illusion of supporting something that actually isn't supported very well (namely, parsing local number strings). I claim that there is no meaningful application of this feature.
Speaking of YAGNI, does anyone want to defend
complex('١٢٣٤.٥٦j') 1234.56j
? Especially given that we reject complex('1234.56i'): http://bugs.python.org/issue10562
Alexander Belopolsky wrote:
On Mon, Nov 29, 2010 at 2:22 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
The former ensures that literals in code are always readable; the later allows users to enter numbers in their own number system. How could that be a bad thing?
It's YAGNI, feature bloat. It gives the illusion of supporting something that actually isn't supported very well (namely, parsing local number strings). I claim that there is no meaningful application of this feature.
This is not about parsing local number strings, it's about parsing number strings represented using different scripts - besides en-US is a locale as well, ye know :-)
Speaking of YAGNI, does anyone want to defend
complex('١٢٣٤.٥٦j') 1234.56j
?
Yes. The same arguments apply. Just because ASCII-proponents may have a hard time reading such literals, doesn't mean that script users have the same trouble.
Especially given that we reject complex('1234.56i'):
We've had that discussion long before we had Unicode in Python. The main reason was that 'i' looked to similar to 1 in a number of fonts which is why it was rejected for Python source code. However, I don't any reason why we shouldn't accept both i and j for complex(), though, since the input to that constructor doesn't have to originate in Python source code. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 29 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
M.-A. Lemburg writes:
Just because ASCII-proponents may have a hard time reading such literals,
That's not the point.
doesn't mean that script users have the same trouble.
The script users may have no trouble reading them, but that doesn't mean it's not a YAGNI. In Japanese, it's a YAGNI except in addresses on New Year cards and in dates, which could be handled by specialized modules, or by a generic module for extracting numeric information from general (as opposed to program) text. Neither of those is likely to appear in program text in context where they would be used as a numeric literal. In fact, Python *does* consider it a YAGNI for Han! Although my apartment number would be written "七〇四" on a New Year card, Python won't parse it as 704: unicodedata considers those digits to be Lo, except for "〇" which fails anyway because it's Nl, not Nd. (To add insult to injury, it doesn't even return numeric values for those characters, even though any Han-user would consider them numeric when used in isolation, except that Japanese would be likely to consider "〇" to be the non-numeric "maru" symbol, ie, circle, meaning "OK"!) The whole concept of numeric in Unicode is a mess; why import that mess into Python? Can you give any examples where people do computation, keep books, or do nuclear physics in non-Arabic numerals? I suppose Arabic users might, but even there I suspect not.
Alexander Belopolsky wrote:
Speaking of YAGNI, does anyone want to defend
complex('١٢٣٤.٥٦j') 1234.56j
*If* we allow float('١٢٣٤.٥٦') (as we currently do, but is being disputed by some), then we should allow complex('١٢٣٤.٥٦j'). It would be silly for complex to be more restrictive than float.
Especially given that we reject complex('1234.56i'):
I don't understand why you use 'i' when Python uses 'j' as the symbol for imaginary numbers.
complex('1234.56j') 1234.56j
works fine. I have no problem with Python choosing one of i/j as the symbol for imaginary-1 and rejecting the other. I prefer i rather than j, but that's because my background is in maths rather than electrical engineering, but I can live with either. But in any case, please don't conflate the question of whether Python should accept j and/or i for complex numbers with the question of supporting non-arabic numerals. The two issues are unrelated. -- Steven
On Mon, Nov 29, 2010 at 5:09 PM, Steven D'Aprano <steve@pearwood.info> wrote: ..
But in any case, please don't conflate the question of whether Python should accept j and/or i for complex numbers with the question of supporting non-arabic numerals. The two issues are unrelated.
The two issues are related because they are both about how strict numerical constructors should be. If we want to accept wide variations in how numbers can be spelled, then surely using i for the imaginary unit is much more common than using ७ for the digit 7. I see two problems with supporting non-ascii spellings: 1. Support costs. 2. User confusion. The two are related because when users are confused, they will report invalid bugs when Python does not meet their expectations. For example, why
int('123', 10) 123
int('123ABC', 16) Traceback (most recent call last): .. UnicodeEncodeError: 'decimal' codec can't encode character '\uff21' in
works, but position 3: invalid decimal Unicode string does not? And if 'decimal' is a codec, why
'123'.encode('decimal') Traceback (most recent call last): ... LookupError: unknown encoding: decimal
Before anyone suggests that int(.., 16) should consult the new Hex_Digit property in the UCD, let me remind that int() supports bases from 2 through 36. I thought Python design was primarily driven by practicality. Here the only plausible argument that one can make is that if Unicode says it is a digit, we should treat it as a digit. Purity over practicality. In practical terms, UCD comes at a price. The unicodedata module size is over 700K on my machine. This is almost half the size of the python executable and by far the largest extension module. (only CJK encodings come close.) Making builtins depend on the largest extension module for operation does not strike me as sound design.
On Mon, 29 Nov 2010 22:46:33 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
In practical terms, UCD comes at a price. The unicodedata module size is over 700K on my machine. This is almost half the size of the python executable and by far the largest extension module. (only CJK encodings come close.) Making builtins depend on the largest extension module for operation does not strike me as sound design.
Well, do they depend on it? _PyUnicode_EncodeDecimal seems to depend only on Objects/unicodectype.c. $ size Objects/unicode*.o text data bss dec hex filename 60398 0 0 60398 ebee Objects/unicodectype.o 130440 13559 2208 146207 23b1f Objects/unicodeobject.o Antoine.
On Tue, Nov 30, 2010 at 8:38 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 29 Nov 2010 22:46:33 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
In practical terms, UCD comes at a price. The unicodedata module size is over 700K on my machine. This is almost half the size of the python executable and by far the largest extension module. (only CJK encodings come close.) Making builtins depend on the largest extension module for operation does not strike me as sound design.
Well, do they depend on it? _PyUnicode_EncodeDecimal seems to depend only on Objects/unicodectype.c.
'\N{DIGIT ONE}'
My mistake. That was a late night post. I wonder why unicodedata.so is so big then. It must be character names: $ python -v dlopen("/.../unicodedata.so", 2); import unicodedata # dynamically loaded from /.../unicodedata.so '1'
Le mardi 30 novembre 2010 à 09:32 -0500, Alexander Belopolsky a écrit :
On Tue, Nov 30, 2010 at 8:38 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 29 Nov 2010 22:46:33 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
In practical terms, UCD comes at a price. The unicodedata module size is over 700K on my machine. This is almost half the size of the python executable and by far the largest extension module. (only CJK encodings come close.) Making builtins depend on the largest extension module for operation does not strike me as sound design.
Well, do they depend on it? _PyUnicode_EncodeDecimal seems to depend only on Objects/unicodectype.c.
My mistake. That was a late night post. I wonder why unicodedata.so is so big then.
It must be character names:
'\N{DIGIT ONE}'
$ python -v dlopen("/.../unicodedata.so", 2); import unicodedata # dynamically loaded from /.../unicodedata.so '1'
From a quick peek using hexdump, character names seem to only account for 1/4 of the module size. That said, I don't think the size is very important. For any non-trivial Python application, the size of unicodedata will be negligible compared to the size of Python objects. Regards Antoine.
On Tue, Nov 30, 2010 at 09:41, Antoine Pitrou <solipsis@pitrou.net> wrote:
That said, I don't think the size is very important. For any non-trivial Python application, the size of unicodedata will be negligible compared to the size of Python objects.
That depends very much on the platform and the application. For our embedded use of Python, static data size (like the text segment of a shared object) is far dearer than the heap space used by Python objects, which is why we've had to excise both the UCD and the CJK codecs in our builds. -- Tim Lesher <tlesher@gmail.com>
Steven D'Aprano writes:
But in any case, please don't conflate the question of whether Python should accept j and/or i for complex numbers with the question of supporting non-arabic numerals. The two issues are unrelated.
Different, yes, unrelated, no. They're both about whether variant forms of universally used literals should be allowed in a programming language, or whether only the canonical form is allowed. Note that *nobody* is saying that Python should have no facility for parsing these numbers, only that by default literal decimal numerals should be encoded as ASCII digits. For example, I would not object to int() getting a Boolean flag meaning "consult unicodedata for non-ASCII digits", just as it has an optional parameter meaning "decode in base other than 10".[1] OTOH, until somebody says "Yes, in Mecca the bazaar traders keep books on their Lenovos using ISO-8859-6 numerals, and it would be painful for them to switch to what we call 'Arabic' numerals", I'm going to consider it a YAGNI. Just as even though mathematicians clearly prefer "i" as the imaginary unit, there's not enough pain involved in them switching to "j" to make it worth supporting both. (BTW, my first reaction to the "j" notation was "cool, Python supports quaternions out of the box!" It took only a second or so to return to reality, but that was my first reaction.) Footnotes: [1] That might not be a good idea on other grounds, but in principle I would be OK with such built-ins accepting non-ASCII digits on request.
On Mon, 29 Nov 2010 08:22:46 +0100 "Martin v. Löwis" <martin@v.loewis.de> wrote:
The former ensures that literals in code are always readable; the later allows users to enter numbers in their own number system. How could that be a bad thing?
It's YAGNI, feature bloat. It gives the illusion of supporting something that actually isn't supported very well (namely, parsing local number strings). I claim that there is no meaningful application of this feature.
Still, if it's not detrimental and it it's not difficult to support, then why do you care? You aren't even maintaining that part of the code. I don't think "remove feature bloat" is part of our development goals or practices. Given the diversity of our user base, such removal should be done carefully and only for serious reasons. Regards Antoine.
On Mon, Nov 29, 2010 at 1:33 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 29 Nov 2010 08:22:46 +0100 "Martin v. Löwis" <martin@v.loewis.de> wrote:
The former ensures that literals in code are always readable; the later allows users to enter numbers in their own number system. How could that be a bad thing?
It's YAGNI, feature bloat. It gives the illusion of supporting something that actually isn't supported very well (namely, parsing local number strings). I claim that there is no meaningful application of this feature.
Still, if it's not detrimental and it it's not difficult to support, then why do you care?
It is difficult to support. A fix for issue10557 would be much simpler if we did not support non-European digits. I now added a patch that handles non-ascii digits, so you can see what's involved. Note that when Unicode Consortium inevitably adds more Nd characters to the non-BMP planes, we will have to add surrogate pairs' support to this code. In any case, there is little we can do about it in 3.2 other than fix bugs like issue10557 without breaking currently valid code, so I created a separate issue to continue this debate in context of 3.3. [issue10581] Now, I would like to bring this thread back to it's subject. Given that UCD is now affecting the language definition and the standard library behavior, how should changes to UCD be handled? - Should Python documentation refer to the specific version of Unicode that it supports? Current documentation refers to old versions. Should version be updated or removed to imply the latest? - How UCD updates should be handled during the language moratorium? During PEP 3003 discussion, it was suggested to handle it on a case by case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP 3003. Should this upgrade be backported to 2.7? - How specific should library reference manual be in defining methods affected by UCD such as str.upper()? - What is an acceptable level of variation between Python implementations? For example, if '\UXXXXXXXX'.isalpha() returns true in one implementation, can it return false in another? Note that even CPython narrow and wide builds are presently not consistent in this respect. [issue10581] http://bugs.python.org/issue10581
- Should Python documentation refer to the specific version of Unicode that it supports?
You mean, mention it somewhere? Sure (although it would be nice if the documentation generator would automatically extract it from the source, just as it extracts the Python version number). Of course, such mentioning should explain that this is specific to CPython, and not an aspect of Python-the-language.
Current documentation refers to old versions. Should version be updated or removed to imply the latest?
What specific reference are you referring to?
- How UCD updates should be handled during the language moratorium?
It's clearly not affected.
During PEP 3003 discussion, it was suggested to handle it on a case by case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP 3003.
It's covered by "As the standard library is not directly tied to the language definition it is not covered by this moratorium."
Should this upgrade be backported to 2.7?
No, it's a new feature.
- How specific should library reference manual be in defining methods affected by UCD such as str.upper()?
It should specify what this actually does in Unicode terminology (probably in addition to a layman's rephrase of that)
- What is an acceptable level of variation between Python implementations? For example, if '\UXXXXXXXX'.isalpha() returns true in one implementation, can it return false in another?
Implementations are free to use any version of the UCD. Regards, Martin
During PEP 3003 discussion, it was suggested to handle it on a case by case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP 3003.
It's covered by "As the standard library is not directly tied to the language definition it is not covered by this moratorium."
How is this restricted to the stdlib if it defines the set of valid identifiers? - Hagen
Am 30.11.2010 09:15, schrieb Hagen Fürstenau:
During PEP 3003 discussion, it was suggested to handle it on a case by case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP 3003.
It's covered by "As the standard library is not directly tied to the language definition it is not covered by this moratorium."
How is this restricted to the stdlib if it defines the set of valid identifiers?
The language does not change. The language specification says Python 3.0 introduces additional characters from outside the ASCII range (see PEP 3131). For these characters, the classification uses the version of the Unicode Character Database as included in the unicodedata module. That remains unchanged. It was a deliberate design decision of PEP 3131 to not codify a fixed set of characters that can be used in identifiers. Regards, Martin
On Mon, Nov 29, 2010 at 4:13 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
- Should Python documentation refer to the specific version of Unicode that it supports?
You mean, mention it somewhere? Sure (although it would be nice if the documentation generator would automatically extract it from the source, just as it extracts the Python version number).
Of course, such mentioning should explain that this is specific to CPython, and not an aspect of Python-the-language.
Current documentation refers to old versions. Should version be updated or removed to imply the latest?
What specific reference are you referring to?
I found two places: A reference to Unicode 3.0 (!) in the Data Model section and a reference to 5.2.0 in unicodedata docs. See http://mail.python.org/pipermail/docs/2010-November/002074.html
- How UCD updates should be handled during the language moratorium?
It's clearly not affected.
This is not what Guido said last year: """
One question:
There are currently number of patch waiting on the tracker for additional Unicode feature support and it's also likely that we'll want to upgrade to a more recent Unicode version within the next few years.
How would such indirect changes be seen under the moratorium ?
That would fall under the Case-by-Case Exemptions section. "Within the next few years" sounds like it might well wait until the moratorium is ended though. :-) """ http://mail.python.org/pipermail/python-dev/2009-November/093666.html I don't see it as a big deal, but technically speaking, with Unicode 6.0 changing properties of two characters to become identifiers Python language definition is affected. For example, an alternative implementation based on 5.2.0 will not accept a valid CPython program that uses one of these characters.
During PEP 3003 discussion, it was suggested to handle it on a case by case basis, but I don't see discussion of the upgrade to 6.0.0 in PEP 3003.
It's covered by "As the standard library is not directly tied to the language definition it is not covered by this moratorium."
See above. Also, it has been suggested that semantics of built-ins cannot change. (If that was so, it would put int('١٢٣٤') debate to rest at least for the time being.:-)
Should this upgrade be backported to 2.7?
No, it's a new feature.
Given that 2.7 will be maintained for 5 years and arguably Unicode Consortium takes backward compatibility very seriously, wouldn't it make sense to consider a backport at some point? I am sure we will soon see a bug report that the following does not work in 2.7: :-)
ord('\N{CAT FACE WITH WRY SMILE}') 128572
- How specific should library reference manual be in defining methods affected by UCD such as str.upper()?
It should specify what this actually does in Unicode terminology (probably in addition to a layman's rephrase of that)
I opened an issue for this: http://bugs.python.org/issue10587
.. For example, if '\UXXXXXXXX'.isalpha() returns true in one implementation, can it return false in another?
Implementations are free to use any version of the UCD.
I was more concerned about wide an narrow unicode CPython builds. Is it a bug that '\UXXXXXXXX'.isalpha() may disagree even when the two implementations are based on the same version of UCD? Thanks for your answers.
On 11/30/2010 10:05 AM, Alexander Belopolsky wrote: My general answers to the questions you have raised are as follows: 1. Each new feature release should use the latest version of the UCD as of the first beta release (or perhaps a week or so before). New chars are new features and the beta period can be used to (hopefully) iron out any bugs introduced by a new UCD version. 2. The language specification should not be UCD version specific. Martin pointed out that the definition of identifiers was intentionally written to not be, bu referring to 'current version' or some such. On the other hand, the UCD version used should be programatically discoverable, perhaps as an attribute of sys or str. 3.. The UCD should not change in bugfix releases. New chars are new features. Adding them in bugfix releases will introduce gratuitous imcompatibilities between releases. People who want the latest Unicode should either upgrade to the latest Python version or patch an older version (but not expect core support for any problems that creates).
Given that 2.7 will be maintained for 5 years and arguably Unicode Consortium takes backward compatibility very seriously, wouldn't it make sense to consider a backport at some point?
I am sure we will soon see a bug report that the following does not work in 2.7: :-)
ord('\N{CAT FACE WITH WRY SMILE}') 128572
3 (cont). 2.7 is no different in that regard. It is feature frozen just like all other x.y releases. And that is the answer to any such report. If that code became valid in 2.7.2, for instance, it would still not work in 2.7 and 2.7.1. Not working is not a bug; working is a new feature introduced after 2.7 was released.
- How specific should library reference manual be in defining methods affected by UCD such as str.upper()?
It should specify what this actually does in Unicode terminology (probably in addition to a layman's rephrase of that)
I opened an issue for this:
1,2 (cont). Good idea in general.
I was more concerned about wide an narrow unicode CPython builds. Is it a bug that '\UXXXXXXXX'.isalpha() may disagree even when the two implementations are based on the same version of UCD?
4. While the difference between narrow/wide builds of (CPython) x.y (which should have once constant UCD) cannot be completely masked, I appreciate and generally agree with your efforts to minimize them. In some cases, there will be a conflict/tradeoff between eliminating this difference versus that. -- Terry Jan Reedy
Terry Reedy wrote:
On 11/30/2010 10:05 AM, Alexander Belopolsky wrote:
My general answers to the questions you have raised are as follows:
1. Each new feature release should use the latest version of the UCD as of the first beta release (or perhaps a week or so before). New chars are new features and the beta period can be used to (hopefully) iron out any bugs introduced by a new UCD version.
The UCD is versioned just like Python is, so if the Unicode Consortium decides to ship a 5.2.1 version of the UCD, we can add that to Python 2.7.x, since Python 2.7 started out with 5.2.0.
2. The language specification should not be UCD version specific. Martin pointed out that the definition of identifiers was intentionally written to not be, bu referring to 'current version' or some such. On the other hand, the UCD version used should be programatically discoverable, perhaps as an attribute of sys or str.
It already is and has been for while, e.g. Python 2.5:
import unicodedata unicodedata.unidata_version '4.1.0'
3.. The UCD should not change in bugfix releases. New chars are new features. Adding them in bugfix releases will introduce gratuitous imcompatibilities between releases. People who want the latest Unicode should either upgrade to the latest Python version or patch an older version (but not expect core support for any problems that creates).
See above. Patch level revisions of the UCD are fine for patch level releases of Python, since those patch level revisions of the UCD fix bugs just like we do in Python. Note that each new UCD major.minor version is a new standard on its own, so it's perfectly ok to stick with one such standard version per Python version. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Mon, Nov 29, 2010 at 2:38 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote: ..
Still, if it's not detrimental and it it's not difficult to support, then why do you care?
It is difficult to support. A fix for issue10557 would be much simpler if we did not support non-European digits. I now added a patch that handles non-ascii digits, so you can see what's involved. Note that when Unicode Consortium inevitably adds more Nd characters to the non-BMP planes, we will have to add surrogate pairs' support to this code.
It turns out that this did in fact happen: # Newly assigned in Unicode 3.1.0 (March, 2001) .. 1D7CE..1D7FF ; 3.1 # [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL MONOSPACE DIGIT NINE See http://unicode.org/Public/UNIDATA/DerivedAge.txt And of course,
unicodedata.digit('\U0001D7CE') 0
but
int('\U0001D7CE') .. UnicodeEncodeError: 'decimal' codec can't encode character '\ud835' ..
on a narrow Unicode build. (Note the character reported in the error message!) If you think non-ASCII digits are not difficult to support, please contribute to the following tracker issues: http://bugs.python.org/issue10581 (Review and document string format accepted in numeric data type constructors) http://bugs.python.org/issue10557 (Malformed error message from float()) http://bugs.python.org/issue10435 (Document unicode C-API in reST - Specifically, PyUnicode_EncodeDecimal) http://bugs.python.org/issue8646 (PyUnicode_EncodeDecimal is undocumented) http://bugs.python.org/issue6632 (Include more fullwidth chars in the decimal codec) and back to the issue of user confusion http://bugs.python.org/issue652104 [closed/invalid] (int(u"\u1234") raises UnicodeEncodeError by Guido van Rossum)
On 30/11/2010 16:40, Alexander Belopolsky wrote:
[snip...] And of course,
unicodedata.digit('\U0001D7CE') 0
but
int('\U0001D7CE') .. UnicodeEncodeError: 'decimal' codec can't encode character '\ud835' ..
on a narrow Unicode build. (Note the character reported in the error message!)
If you think non-ASCII digits are not difficult to support, please contribute to the following tracker issues:
Would moving this functionality to the locale module make the issues any easier to fix? Michael
http://bugs.python.org/issue10581 (Review and document string format accepted in numeric data type constructors)
http://bugs.python.org/issue10557 (Malformed error message from float())
http://bugs.python.org/issue10435 (Document unicode C-API in reST - Specifically, PyUnicode_EncodeDecimal)
http://bugs.python.org/issue8646 (PyUnicode_EncodeDecimal is undocumented)
http://bugs.python.org/issue6632 (Include more fullwidth chars in the decimal codec)
and back to the issue of user confusion
http://bugs.python.org/issue652104 [closed/invalid] (int(u"\u1234") raises UnicodeEncodeError by Guido van Rossum) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.voidspace.org.uk/ READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer.
On Tue, Nov 30, 2010 at 12:40 PM, Michael Foord <fuzzyman@voidspace.org.uk> wrote: ..
If you think non-ASCII digits are not difficult to support, please contribute to the following tracker issues:
Would moving this functionality to the locale module make the issues any easier to fix?
Sure, if we code it in Python, supporting it will by much easier: def normalize_digits(s): digits = {m.group(1) for m in re.finditer('(\d)', s)} trtab = {ord(d): str(unicodedata.digit(d)) for d in digits} return s.translate(trtab)
normalize_digits('١٢٣٤.٥٦') '1234.56'
I am not sure this belongs to the locale module, however. It seems to me, something like 'unicodealgo' for unicode algorithms would be more appropriate.
Sure, if we code it in Python, supporting it will by much easier:
def normalize_digits(s): digits = {m.group(1) for m in re.finditer('(\d)', s)} trtab = {ord(d): str(unicodedata.digit(d)) for d in digits} return s.translate(trtab)
normalize_digits('١٢٣٤.٥٦') '1234.56'
I am not sure this belongs to the locale module, however. It seems to me, something like 'unicodealgo' for unicode algorithms would be more appropriate.
It could simply be in unicodedata if you split the implementation into a core C part and some Python bits. Regards Antoine.
On Tue, Nov 30, 2010 at 1:29 PM, Antoine Pitrou <solipsis@pitrou.net> wrote: ..
I am not sure this belongs to the locale module, however. It seems to me, something like 'unicodealgo' for unicode algorithms would be more appropriate.
It could simply be in unicodedata if you split the implementation into a core C part and some Python bits.
Splitting unicodedata may not be a bad idea. There are many more pieces in UCD than covered by unicodedata. [1] Hardcoding them all into unicodedata module is hard to justify, but some are quite useful. For example, PropertyValueAliases.txt is quite useful for those like myself who cannot remember what Pd or Zl category names stand for. SpecialCasing.txt is required for proper casing, but is not currently included in Python. I would not want to change str.upper or str.title because of this, but providing the raw info to someone who wants to implement proper case mappings may not be a bad idea. Blocks.txt is certainly useful for any language-dependent processing. On the other hand, I think we should keep Unicode data and Unicode algorithms separate. And the latter may not even belong to the Python stdlib. [1] http://unicode.org/Public/UNIDATA/
Le mardi 30 novembre 2010 à 20:16 +0100, "Martin v. Löwis" a écrit :
Would moving this functionality to the locale module make the issues any easier to fix?
You could delegate it to the C library, so: yes.
I hope you don't suggest delegating it to the C locale functions. Do you?
Am 30.11.2010 20:23, schrieb Antoine Pitrou:
Le mardi 30 novembre 2010 à 20:16 +0100, "Martin v. Löwis" a écrit :
Would moving this functionality to the locale module make the issues any easier to fix?
You could delegate it to the C library, so: yes.
I hope you don't suggest delegating it to the C locale functions. Do you?
Yes, I do. Why do you hope I don't? Regards, Martin
Le mardi 30 novembre 2010 à 20:40 +0100, "Martin v. Löwis" a écrit :
Am 30.11.2010 20:23, schrieb Antoine Pitrou:
Le mardi 30 novembre 2010 à 20:16 +0100, "Martin v. Löwis" a écrit :
Would moving this functionality to the locale module make the issues any easier to fix?
You could delegate it to the C library, so: yes.
I hope you don't suggest delegating it to the C locale functions. Do you?
Yes, I do. Why do you hope I don't?
Because we all know how locale is a pile of cr*p, both in specification and in implementations. Our unit tests for it are a clear proof of that. Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library. Regards Antoine.
Because we all know how locale is a pile of cr*p, both in specification and in implementations. Our unit tests for it are a clear proof of that.
I wouldn't use expletives, but rather claim that the locale module is highly platform-dependent.
Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library.
By that, I stand - however, I have given up the hope that this will happen anytime soon. Wrt. to local number parsing, I think that the locale module would be way better than the nonsense that Python currently does. In the locale module, somebody at least has thought about what specifically constitutes a number. The current not-ASCII-but-not-local-either approach is just useless. Maintaining a reasonable implementation is a burden, so deferring to the C library is more attractive than having to maintain an unreasonable implementation. Regards, Martin
Le mardi 30 novembre 2010 à 20:55 +0100, "Martin v. Löwis" a écrit :
Wrt. to local number parsing, I think that the locale module would be way better than the nonsense that Python currently does. In the locale module, somebody at least has thought about what specifically constitutes a number. The current not-ASCII-but-not-local-either approach is just useless.
It depends what you need. If you parse integers it's probably good enough. And it's better to have a trustable standard (unicode) than a myriad of ad-hoc, possibly buggy or incomplete, often unavailable, cultural specifications drafted by OS vendors who have no business (and no expertise) in drafting them. At least you can build more sophisticated routines on the simple information given to you by the unicode database. You cannot build anything solid on the C locale functions (and even then you are limited by various issues inherent in the locale semantics, such as the fact that it relies on process-wide state, which would only be ok, at best, for single-user applications). There's a reason that e.g. Babel (*) reimplements locale-like functionality from scratch. (*) http://pypi.python.org/pypi/Babel/ Regards Antoine.
Oh, about ICU:
Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library.
By that, I stand - however, I have given up the hope that this will happen anytime soon.
Perhaps this could be made a GSOC topic. Regards Antoine.
On Tue, Nov 30, 2010 at 3:13 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Oh, about ICU:
Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library.
By that, I stand - however, I have given up the hope that this will happen anytime soon.
Perhaps this could be made a GSOC topic.
Incidentally, this may also address another Python's Achilles' heel: the timezone support. http://icu-project.org/download/icutzu.html
On Wed, Dec 1, 2010 at 8:45 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
On Tue, Nov 30, 2010 at 3:13 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Oh, about ICU:
Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library.
By that, I stand - however, I have given up the hope that this will happen anytime soon.
Perhaps this could be made a GSOC topic.
Incidentally, this may also address another Python's Achilles' heel: the timezone support.
I work with people who speak highly of ICU, so I want to encourage work in this area. At the same time, I'm skeptical -- IIRC, ICU is a large amount of C++ code. I don't know how easy it will be to integrate this into our build processes for various platforms, nor how "Pythonic" the resulting APIs will look to the experienced Python user. Still, those are not roadblocks, the benefits are potentially great, so it's definitely worth investigating! -- --Guido van Rossum (python.org/~guido)
2010/12/2 Guido van Rossum <guido@python.org>:
On Wed, Dec 1, 2010 at 8:45 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
On Tue, Nov 30, 2010 at 3:13 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Oh, about ICU:
Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library.
By that, I stand - however, I have given up the hope that this will happen anytime soon.
Perhaps this could be made a GSOC topic.
Incidentally, this may also address another Python's Achilles' heel: the timezone support.
I work with people who speak highly of ICU, so I want to encourage work in this area.
At the same time, I'm skeptical -- IIRC, ICU is a large amount of C++ code. I don't know how easy it will be to integrate this into our build processes for various platforms, nor how "Pythonic" the resulting APIs will look to the experienced Python user.
There's a nice C-API. -- Regards, Benjamin
On Dec 1, 2010, at 11:45 PM, Alexander Belopolsky wrote:
On Tue, Nov 30, 2010 at 3:13 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Oh, about ICU:
Actually, I remember you saying that locale should ideally be replaced with a wrapper around the ICU library.
By that, I stand - however, I have given up the hope that this will happen anytime soon.
Perhaps this could be made a GSOC topic.
Incidentally, this may also address another Python's Achilles' heel: the timezone support.
Does ICU do anything regarding timezones that datetime + pytz doesn't already do? Wouldn't it make more sense to integrate the already-existing-and-pythonic pytz into Python than to make a new wrapper based on ICU? James
Am 29.11.2010 19:33, schrieb Antoine Pitrou:
On Mon, 29 Nov 2010 08:22:46 +0100 "Martin v. Löwis" <martin@v.loewis.de> wrote:
The former ensures that literals in code are always readable; the later allows users to enter numbers in their own number system. How could that be a bad thing?
It's YAGNI, feature bloat. It gives the illusion of supporting something that actually isn't supported very well (namely, parsing local number strings). I claim that there is no meaningful application of this feature.
Still, if it's not detrimental and it it's not difficult to support, then why do you care? You aren't even maintaining that part of the code.
I sure do maintain the Unicode database implementation in Python - the one that is being used (IMO incorrectly) to implement the conversion in question (and also the one that triggered this thread).
I don't think "remove feature bloat" is part of our development goals or practices. Given the diversity of our user base, such removal should be done carefully and only for serious reasons.
I think it's a serious reason that the intuitive expectation of many people (including committers) deviates from the actual implementation - so much that they clarify the documentation in a way that makes the difference explicit. Having a mismatch between the expected behavior and the actual behavior is a serious problem because it could lead to security issues, e.g. when someone relies on float() to perform certain syntactic checking, making it then possible to sneak in values that cause corruption later on (speaking theoretically, of course - I'm not aware of an application that is vulnerable in this manner). Regards, Martin
Alexander Belopolsky wrote:
Two recently reported issues brought into light the fact that Python language definition is closely tied to character properties maintained by the Unicode Consortium. [1,2] For example, when Python switches to Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two additional characters that Python can use in identifiers. [3]
With Python 3.1:
exec('\u0CF1 = 1') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier
but with Python 3.2a4:
exec('\u0CF1 = 1') eval('\u0CF1') 1
Such changes are not new, but I agree that they should probably be highlighted in the "What's new in Python x.x".
Of course, the likelihood is low that this change will affect any user, but the change in str.isspace() reported in [1] is likely to cause some trouble:
Python 2.6.5:
u'A\u200bB'.split() [u'A', u'B']
Python 2.7:
u'A\u200bB'.split() [u'A\u200bB']
That's a classical bug fix.
While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins.
Why should we divert from the work done by the Unicode Consortium ? After all, most of their changes are in fact bug fixes as well.
For example, I don't think that supporting
float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Sorry, but I don't agree. If ASCII numerals are an important aspect of an application, the application should make sure that only those numerals are used (e.g. by using a regular expression for checking). In a Unicode world, not accepting non-Arabic numerals would be a limitation, not a feature. Besides Python has had this support since Python 1.6.
[1] http://bugs.python.org/issue10567 [2] http://bugs.python.org/issue10557 [3] http://www.unicode.org/versions/Unicode6.0.0/#Database_Changes
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 28 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Sun, Nov 28, 2010 at 5:48 PM, M.-A. Lemburg <mal@egenix.com> wrote: ..
With Python 3.1:
exec('\u0CF1 = 1') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<string>", line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier
but with Python 3.2a4:
exec('\u0CF1 = 1') eval('\u0CF1') 1
Such changes are not new, but I agree that they should probably be highlighted in the "What's new in Python x.x".
As of today, "What’s New In Python 3.2" [1] does not even mention the unicodedata upgrade to 6.0.0. Here are the features form the unicode.org summary [2] that I think should be reflected in Python's "What's New" document: """ * adds 2,088 characters, including over 1,000 additional symbols—chief among them the additional emoji symbols, which are especially important for mobile phones; * corrects character properties for existing characters including - a general category change to two Kannada characters (U+0CF1, U+0CF2), which has the effect of making them newly eligible for inclusion in identifiers; - a general category change to one New Tai Lue numeric character (U+19DA), which would have the effect of disqualifying it from inclusion in identifiers unless grandfathering measures are in place for the defining identifier syntax. """ The above may be too verbose for inclusion to "What’s New In Python 3.2", but I think we should add a possibly shorter summary with a link to unicode.org for details. PS: Yes, I think everyone should know about the Python 3.2 killer feature: ('\N{CAT FACE WITH WRY SMILE}'! [1] http://docs.python.org/dev/whatsnew/3.2.html [2] http://www.unicode.org/versions/Unicode6.0.0/
On 12/1/2010 12:55 PM, Alexander Belopolsky wrote:
On Sun, Nov 28, 2010 at 5:48 PM, M.-A. Lemburg<mal@egenix.com> wrote: ..
With Python 3.1:
exec('\u0CF1 = 1') Traceback (most recent call last): File "<stdin>", line 1, in<module> File "<string>", line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier
but with Python 3.2a4:
exec('\u0CF1 = 1') eval('\u0CF1') 1
Such changes are not new, but I agree that they should probably be highlighted in the "What's new in Python x.x".
As of today, "What’s New In Python 3.2" [1] does not even mention the unicodedata upgrade to 6.0.0. Here are the features form the unicode.org summary [2] that I think should be reflected in Python's "What's New" document:
""" * adds 2,088 characters, including over 1,000 additional symbols—chief among them the additional emoji symbols, which are especially important for mobile phones;
* corrects character properties for existing characters including - a general category change to two Kannada characters (U+0CF1, U+0CF2), which has the effect of making them newly eligible for inclusion in identifiers;
- a general category change to one New Tai Lue numeric character (U+19DA), which would have the effect of disqualifying it from inclusion in identifiers unless grandfathering measures are in place for the defining identifier syntax. """
The above may be too verbose for inclusion to "What’s New In Python 3.2",
I think those 11 lines are pretty good. Put them in ('\N{CAT FACE WITH WRY SMILE}'! Plus give a link to Unicode site (Issue numbers are implicit links). -- Terry Jan Reedy
Am 01.12.2010 23:39, schrieb "Martin v. Löwis":
As of today, "What’s New In Python 3.2" [1] does not even mention the unicodedata upgrade to 6.0.0.
One reason was that I was instructed not to change "What's New" a few years ago.
Maybe all past, present and future whatsnew maintainers can agree on these rules, which I copied directly from whatsnew/3.2.rst? Rules for maintenance: * Anyone can add text to this document. Do not spend very much time on the wording of your changes, because your text will probably get rewritten to some degree. * The maintainer will go through Misc/NEWS periodically and add changes; it's therefore more important to add your changes to Misc/NEWS than to this file. * This is not a complete list of every single change; completeness is the purpose of Misc/NEWS. Some changes I consider too small or esoteric to include. If such a change is added to the text, I'll just remove it. (This is another reason you shouldn't spend too much time on writing your addition.) * If you want to draw your new text to the attention of the maintainer, add 'XXX' to the beginning of the paragraph or section. * It's OK to just add a fragmentary note about a change. For example: "XXX Describe the transmogrify() function added to the socket module." The maintainer will research the change and write the necessary text. * You can comment out your additions if you like, but it's not necessary (especially when a final release is some months away). * Credit the author of a patch or bugfix. Just the name is sufficient; the e-mail address isn't necessary. It's helpful to add the issue number: XXX Describe the transmogrify() function added to the socket module. (Contributed by P.Y. Developer; :issue:`12345`.) This saves the maintainer the effort of going through the SVN log when researching a change. Georg
Maybe all past, present and future whatsnew maintainers can agree on these rules, which I copied directly from whatsnew/3.2.rst?
I don't think all past maintainers can (I'm pretty certain that AMK would disagree), but if that's the current policy, I can certainly try following it (I didn't know it exists because I never look at the file). Regards, Martin
Am 02.12.2010 20:40, schrieb "Martin v. Löwis":
Maybe all past, present and future whatsnew maintainers can agree on these rules, which I copied directly from whatsnew/3.2.rst?
I don't think all past maintainers can
Yes, and the same goes for the future ones, since they may not even know yet that they will be whatsnew maintainers. Or maybe they aren't born yet (let's hope for a long life of Python 3...).
(I'm pretty certain that AMK would disagree), but if that's the current policy, I can certainly try following it (I didn't know it exists because I never look at the file).
The large chunk of rules appeared in 2.6, where AMK still was maintainer. But even in the whatsnew for 2.4, there is this: .. Don't write extensive text for new sections; I'll do that. .. Feel free to add commented-out reminders of things that need .. to be covered. --amk But in any case, they are certainly valid for the current whatsnew -- even if Raymond likes to grumble about too expansive commits :) Georg
Alexander Belopolsky wrote:
Two recently reported issues brought into light the fact that Python language definition is closely tied to character properties maintained by the Unicode Consortium. [1,2] For example, when Python switches to Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two additional characters that Python can use in identifiers. [3] [...]
Why do you consider this a problem? It would be a problem if previously valid identifiers *stopped* being valid, but not the other way around.
Of course, the likelihood is low that this change will affect any user, but the change in str.isspace() reported in [1] is likely to cause some trouble:
Looking at the thread here: http://bugs.python.org/issue10567 I interpret it as indicting that Python's isspace() has been buggy for many years, and is only now being fixed. It's always unfortunate when people rely on bugs, but I'm not sure we should be promising to support bug-for-bug compatibility from one version to the next :)
While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins. For example, I don't think that supporting
float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Seems like a pretty foolish assumption, if you ask me, pretty much akin to assuming that if string.isalpha() returns true that string is ASCII. Support for non-Arabic numerals in number strings goes back to at least Python 2.4: [steve@sylar ~]$ python2.4 Python 2.4.6 (#1, Mar 30 2009, 10:08:01) [GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
float(u'١٢٣٤.٥٦') 1234.5599999999999
The fact that this is (apparently) only being raised now means that it isn't actually a problem in real life. I'd even say that it's a feature, and that if Python didn't support non-Arabic numerals, it should. -- Steven
On Sun, Nov 28, 2010 at 6:43 PM, Steven D'Aprano <steve@pearwood.info> wrote: ..
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Seems like a pretty foolish assumption, if you ask me, pretty much akin to assuming that if string.isalpha() returns true that string is ASCII.
It is not to 99.9% of Python users whose code is written for 2.x. Their strings are byte strings and string.isdigit() does imply ASCII even if string.isalpha() does not in many locales. ..
The fact that this is (apparently) only being raised now means that it isn't actually a problem in real life. I'd even say that it's a feature, and that if Python didn't support non-Arabic numerals, it should.
I raised this problem because I found a bug that is related to this feature. The bug is also a regression from 2.x. In 2.7:
float(u'1234\xa1') .. ValueError: invalid literal for float(): 1234?
The last character is lost, but the error message is still meaningful. In 3.x, however:
float('1234\xa1') .. ValueError
See http://bugs.python.org/issue10557 While investigating this issue I found that by the time the string gets to the number parser (_Py_dg_strtod), all non-ascii characters are dropped by PyUnicode_EncodeDecimal() so it cannot produce meaningful diagnostic. Of course, PyUnicode_EncodeDecimal(), can be fixed by making it pass non-ascii chars through as UTF-8 bytes, but I was wondering if preserving the ability to parse exotic numerals was worth the effort.
On Sun, 28 Nov 2010 21:32:15 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
On Sun, Nov 28, 2010 at 6:43 PM, Steven D'Aprano <steve@pearwood.info> wrote: ..
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
Seems like a pretty foolish assumption, if you ask me, pretty much akin to assuming that if string.isalpha() returns true that string is ASCII.
It is not to 99.9% of Python users whose code is written for 2.x. Their strings are byte strings and string.isdigit() does imply ASCII even if string.isalpha() does not in many locales.
We are not talking about string.isdigit(), we are talking about the float() constructor when given an unicode string. Constructing a float from an unicode string is certainly a common thing, even in 2.x. Regards Antoine.
On Sun, Nov 28, 2010 at 21:24, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins.
Why? I can see this is a problem if one character that earlier was allowed no longer is. That breaks backwards compatibility. This doesn't.
float('١٢٣٤.٥٦') 1234.56
is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.
*I* think it is more important. In python 3, you can never ever assume anything is ASCII any more. ASCII is practically dead an buried as far as Python goes, unless you explicitly encode to it.
def deposit(self, amountstr): self.balance += float(amountstr) audit_log("Deposited: " + amountstr)
Auditor:
$ cat numbered-account.log Deposited: ?????.??
That log reasonably should be in UTF-8 or something else, in which case this is not a problem. And that's ignoring that it makes way more sense to log the numerical amount. -- Lennart Regebro: http://regebro.wordpress.com/ Python 3 Porting: http://python3porting.com/ +33 661 58 14 64
Lennart Regebro writes:
*I* think it is more important. In python 3, you can never ever assume anything is ASCII any more.
Sure you can. In Python program text, all keywords will be ASCII (English, even, though it may be en_NL.UTF-8<wink>) for the forseeable future. I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese "ASCII" numerals or Arabic cursive numerals in "for i in range(...)" for example. As soon as somebody gives an example of a culture, however minor, that uses computers but actively prefers to use non-ASCII numerals to express numbers in an IT context, I'll review my thinking. But at the moment it's 101% YAGNI.
hi, I agree with this. I never seen any man in China using chinese number literals (at least two kinds:一, 壹, same meaning with 1) in Python program, except UI output. They can do some mappings when want to output these non-ascii numbers. Example: if 1: print "一" I think it is a little ugly to have code like this: num = float("一.一"), expected result is: num = 1.1 br, khy On Tue, Nov 30, 2010 at 4:23 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Lennart Regebro writes:
> *I* think it is more important. In python 3, you can never ever assume > anything is ASCII any more.
Sure you can. In Python program text, all keywords will be ASCII (English, even, though it may be en_NL.UTF-8<wink>) for the forseeable future.
I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese "ASCII" numerals or Arabic cursive numerals in "for i in range(...)" for example.
As soon as somebody gives an example of a culture, however minor, that uses computers but actively prefers to use non-ASCII numerals to express numbers in an IT context, I'll review my thinking. But at the moment it's 101% YAGNI. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/cornsea%40gmail.com
haiyang kang wrote:
hi,
I agree with this.
I never seen any man in China using chinese number literals (at least two kinds:一, 壹, same meaning with 1) in Python program, except UI output.
They can do some mappings when want to output these non-ascii numbers. Example: if 1: print "一"
I think it is a little ugly to have code like this: num = float("一.一"), expected result is: num = 1.1
I don't expect that anyone would sensibly write code like that, except for testing. You wouldn't write num = float("1.1") instead of just num = 1.1 either. But you should be able to write: text = input("Enter a number using your preferred digits: ") num = float(text) without caring whether the user enters 一.一 or 1.1 or something else. -- Steven
On Tue, Nov 30, 2010 at 7:59 AM, Steven D'Aprano <steve@pearwood.info> wrote: ..
But you should be able to write:
text = input("Enter a number using your preferred digits: ") num = float(text)
without caring whether the user enters 一.一 or 1.1 or something else.
I find it ironic that people who argue for preservation of the current behavior do it without checking what it actually is:
float('一.一') .. UnicodeEncodeError: 'decimal' codec can't encode character '\u4e00' ..
This one of the biggest problems with this feature. It does not fit user's expectations. Even the original author of the decimal "codec" expected the above to work. [1]
Python can already do this, and has been able to for many years:
int(u'٣') 3
but you can do this without support from int() as well:
import unicodedata unicodedata.digit('٣') 3
and for Unihan numbers, you can do
unicodedata.numeric('一') 1.0
and
unicodedata.numeric('ⅷ') 8.0
and if you are so inclined,
[unicodedata.numeric(c) for c in "ↂ ↁ ⅗ ⅞ 𐄳".split()] [10000.0, 5000.0, 0.6, 0.875, 90000.0]
Do you want to see all these supported by float()? [1] "makeunicodedata.py does not support Unihan digit data" http://bugs.python.org/issue10575
But you should be able to write:
text = input("Enter a number using your preferred digits: ") num = float(text)
without caring whether the user enters 一.一 or 1.1 or something else.
yes. from logical point of view, this can happen. But i really doubt that if really there are users who would like to input number like that, means that they first use google pinyin method to input 一, then change to english input method to input . , then change to google pinyin again for the other 一; or maybe you mean they input the whole 一.一 words with google pinyin input method. To input 1, users only need to type one time keyboard, but to input 一, they need to type three times (yi SPACE). Of course, users can also input something accidentally, but we just need to give them some kind reminders. At least coders in my around will restrain their system users to input numbers with ASCII, and seems that users are still happy with the ASCII type numbers :). br, khy
On Tue, Nov 30, 2010 at 9:56 AM, haiyang kang <cornsea@gmail.com> wrote:
But you should be able to write:
text = input("Enter a number using your preferred digits: ") num = float(text)
without caring whether the user enters 一.一 or 1.1 or something else.
yes. from logical point of view, this can happen. ...
Please stop discussing a non-feature. Python's float *does not* accept ' 一.一'. This was reported as a bug and closed as invalid. See "makeunicodedata.py does not support Unihan digit data" http://bugs.python.org/issue10575
Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
On Tue, Nov 30, 2010 at 9:56 AM, haiyang kang <cornsea@gmail.com> wrote:
But you should be able to write:
text = input("Enter a number using your preferred digits: ") num = float(text)
without caring whether the user enters 一.一 or 1.1 or something else.
yes. from logical point of view, this can happen. ...
Please stop discussing a non-feature. Python's float *does not* accept ' 一.一'. This was reported as a bug and closed as invalid.
That seems irrelevant to me. One of the main topics of this thread is whether actual native speakers would be happy with ascii-only input for float(). haiyang kang confirmed that this is the case. I hope that more local speakers will contribute their views. Stefan Krah
haiyang kang <cornsea@gmail.com> writes:
I think it is a little ugly to have code like this: num = float("一.一"), expected result is: num = 1.1
That's a straw man, though. The string need not be a literal in the program; it can be input to the program. num = float(input_from_the_external_world) Does that change your assessment of whether non-ASCII digits are used? -- \ “The greatest tragedy in mankind's entire history may be the | `\ hijacking of morality by religion.” —Arthur C. Clarke, 1991 | _o__) | Ben Finney
Am 30.11.2010 21:24, schrieb Ben Finney:
haiyang kang <cornsea@gmail.com> writes:
I think it is a little ugly to have code like this: num = float("一.一"), expected result is: num = 1.1
That's a straw man, though. The string need not be a literal in the program; it can be input to the program.
num = float(input_from_the_external_world)
Does that change your assessment of whether non-ASCII digits are used?
I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that. You would need a number of key strokes to enter each individual ideograph, plus you have to press the keys for keyboard layout switching to enter the Latin decimal separator (which you normally wouldn't use along with the Han numerals). Regards, Martin
"Martin v. Löwis" <martin@v.loewis.de> writes:
Am 30.11.2010 21:24, schrieb Ben Finney:
The string need not be a literal in the program; it can be input to the program.
num = float(input_from_the_external_world)
Does that change your assessment of whether non-ASCII digits are used?
I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that.
Who's talking about *entering* it into the program at a keyboard directly, though? Input to a program can come from all kinds of crazy sources. Just because it wasn't typed by the person at the keyboard using this program doesn't stop it being input to the program. A concrete example, but certainly not the only possible case: non-ASCII digit characters representing integers, stored as text in a file. Note that I'm not saying this is common. Nor am I saying it's a desirable situation. I'm saying it is a feasible use case, to be dismissed only if there is strong evidence that it's not used by existing Python code. -- \ “When a well-packaged web of lies has been sold to the masses | `\ over generations, the truth will seem utterly preposterous and | _o__) its speaker a raving lunatic.” —Dresden James | Ben Finney
I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that.
Who's talking about *entering* it into the program at a keyboard directly, though? Input to a program can come from all kinds of crazy sources. Just because it wasn't typed by the person at the keyboard using this program doesn't stop it being input to the program.
I think haiyang kang claimed exactly that - it won't ever be input to a program. I trust him on that - and so should you, unless you have sufficient experience with the Chinese language and writing system.
Note that I'm not saying this is common. Nor am I saying it's a desirable situation. I'm saying it is a feasible use case, to be dismissed only if there is strong evidence that it's not used by existing Python code.
And indeed, for the Chinese numerals, we have such strong evidence. Regards, Martin
Martin v. Löwis wrote:
I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that. Who's talking about *entering* it into the program at a keyboard directly, though? Input to a program can come from all kinds of crazy sources. Just because it wasn't typed by the person at the keyboard using this program doesn't stop it being input to the program.
I think haiyang kang claimed exactly that - it won't ever be input to a program. I trust him on that - and so should you, unless you have sufficient experience with the Chinese language and writing system.
Note that I'm not saying this is common. Nor am I saying it's a desirable situation. I'm saying it is a feasible use case, to be dismissed only if there is strong evidence that it's not used by existing Python code.
And indeed, for the Chinese numerals, we have such strong evidence.
With full respect to haiyang kang, hear-say from one person can hardly be described as "strong" evidence -- particularly, as Alexander Belopolsky pointed out, the use-case described isn't currently supported by Python. Given that what haiyang kang describes *can't* be done, the fact that people don't do it is hardly surprising -- nor is it a good reason for taking away functionality that does exist. -- Steven
Steven D'Aprano writes:
With full respect to haiyang kang, hear-say from one person can hardly be described as "strong" evidence
That's *disrespectful* nonsense. What Haiyang reported was not hearsay, it's direct observation of what he sees around him and personal experience, plus extrapolation. Look up "hearsay," please. Furthermore, he provided good *objective* reason (excessive cost, to which I can also testify, in several different input methods for Japanese) why numbers simply would not be input that way. What's left is copy/paste via the mouse. I assure you, every day I see dozens of Japanese copy/pasting *only* ASCII numerals, and the sales figures for Microsoft Excel (not to mention the download numbers for Open Office) strongly suggest that 30 million Japanese salarymen are similarly dedicated to ASCII. (That's not "hearsay" either, that's direct observation and extrapolation, which is more than the "we need float to translate Arabic" supporters can offer.) I have seen only *one* use case: it's a toy for sophisticated programmers who want to think of themselves as broadminded. We've seen several examples of that in this thread, so I can't deny that is a real use case. Please, give us just *one* more real use case that isn't "somebody might".
"Stephen J. Turnbull" <stephen@xemacs.org> writes:
Furthermore, he provided good *objective* reason (excessive cost, to which I can also testify, in several different input methods for Japanese) why numbers simply would not be input that way.
What's left is copy/paste via the mouse.
For direct entry by an interactive user, yes. Why are some people in this discussion thinking only of direct entry by an interactive user? Input to a program comes from various sources other than direct entry by the interactive user, as has been pointed out many times.
Please, give us just *one* more real use case that isn't "somebody might".
Input from an existing text file, as I said earlier. Or any other way of text data making its way into a Python program. Direct entry at the console is a red herring. -- \ “First things first, but not necessarily in that order.” —The | `\ Doctor, _Doctor Who_ | _o__) | Ben Finney
Ben Finney writes:
Input from an existing text file, as I said earlier. Or any other way of text data making its way into a Python program.
Direct entry at the console is a red herring.
I don't think it is. Not at all. Here's why: '''print "%d" % some_integer''' doesn't now, and never will (unless Kristan gets his Python 2.8<wink>), produce Arabic or Han numerals. Not in any language I know of, not in Microsoft Excel, and definitely not in Python 2. *Somebody* typed that text at some point. If it's Han, that somebody had *way* too much time on his hands, not a working accountant nor a graduate assistant in a research lab for sure. How about old archived texts, copied and recopied? At least for Japanese, old archival (text) data will *all* be in ASCII, because the earliest implementations of Japanese language text used JIS X 0201 (or its predecessor), which doesn't have Han digits (and kana digits don't exist even if you write with a brush and ink AFAIK). Ditto Arabic, I would imagine; ISO 8859/6 (aka Latin/Arabic) does not contain the Arabic digits that have been presented here earlier AFAICT. Note that there's plenty of space for them in that code table (eg, 0xB0-0xB9 is empty). Apparently nobody *ever* thought it was useful to have them! So, which culture, using which script and in which application, inputs numeric data in other than ASCII digits? Or would want to, if only somebody would tell them they can do it in Python? Hearsay will do, for starters.
Stephen J. Turnbull:
Here's why: '''print "%d" % some_integer''' doesn't now, and never will (unless Kristan gets his Python 2.8<wink>), produce Arabic or Han numerals. Not in any language I know of, not in Microsoft Excel, and definitely not in Python 2.
While I don't have Excel to test with, OpenOffice.org Calc will display in Arabic or Han numerals using the NatNum format codes. http://www.scintilla.org/ArabicNumbers.png
Ditto Arabic, I would imagine; ISO 8859/6 (aka Latin/Arabic) does not contain the Arabic digits that have been presented here earlier AFAICT. Note that there's plenty of space for them in that code table (eg, 0xB0-0xB9 is empty). Apparently nobody *ever* thought it was useful to have them!
DOS code page 864 does use 0xB0-0xB9 for ٠ .. ٩. http://www.ascii.ca/cp864.htm Neil
Neil Hodgson writes:
While I don't have Excel to test with, OpenOffice.org Calc will display in Arabic or Han numerals using the NatNum format codes.
Display is different from input, but at least this is concrete evidence. Will it accept Arabic on input? (Han might be too much to ask for since Unicode considers Han digits to be "impure".)
Ditto Arabic, I would imagine; ISO 8859/6 (aka Latin/Arabic) does not contain the Arabic digits that have been presented here earlier AFAICT.
DOS code page 864 does use 0xB0-0xB9
OK, Microsoft thought it would be useful. I'd still like to know whether people actually use them for input (or output, for that matter -- anybody have a corpus of Arabic Form 10-Ks to grep through?), but that's more concrete evidence than we've seen before. Thank you!
Stephen J. Turnbull:
Will it accept Arabic on input? (Han might be too much to ask for since Unicode considers Han digits to be "impure".)
I couldn't find a direct way to input Arabic digits into OO Calc, the normal use of Alt+number didn't work in Calc although it did in WordPad where Alt+1632 is ٠ and so on. OO Calc does have settings in the Complex Text Layout section for choosing different numerals but I don't understand the interaction of choices here. Neil
Am 02.12.2010 03:01, schrieb Ben Finney:
"Stephen J. Turnbull" <stephen@xemacs.org> writes:
Furthermore, he provided good *objective* reason (excessive cost, to which I can also testify, in several different input methods for Japanese) why numbers simply would not be input that way.
What's left is copy/paste via the mouse.
For direct entry by an interactive user, yes. Why are some people in this discussion thinking only of direct entry by an interactive user?
Ultimately, somebody will have entered the data.
Input from an existing text file, as I said earlier.
Which *specific* existing text file? Have you actually *seen* such a text file?
Direct entry at the console is a red herring.
And we don't need powerhouses because power comes out of the socket. Regards, Martin
"Martin v. Löwis" wrote:
[...] For direct entry by an interactive user, yes. Why are some people in this discussion thinking only of direct entry by an interactive user?
Ultimately, somebody will have entered the data.
I don't think you really believe that all data processed by a computer was eventually manually entered by a someone :-) I already gave you a couple of examples of how such data can end up being input for Python number constructors. If you are still curious, please see the Wikipedia pages I linked to, or have a look at these keyboards: http://en.wikipedia.org/wiki/File:KB_Arabic_MAC.svg http://en.wikipedia.org/wiki/File:Keyboard_Layout_Sanskrit.png http://en.wikipedia.org/wiki/File:800px-KB_Thai_Kedmanee.png http://en.wikipedia.org/wiki/File:Tibetan_Keyboard.png http://en.wikipedia.org/wiki/File:KBD-DZ-noshift-2009.png (all referenced on http://en.wikipedia.org/wiki/Keyboard_layout) and then compare these to: http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt Arabic numerals are being used a lot nowadays in Asian countries, but that doesn't mean that the native script versions are not being used anymore. Furthermore, data can well originate from texts that were written hundreds or even thousands of years ago, so there is plenty of material available for processing. Even if not entered directly, there are plenty of ways to convert Arabic numerals (or other numeral systems) to the above forms, e.g. in MS Office for Thai: http://office.microsoft.com/en-us/excel-help/convert-arabic-numbers-to-thai-... Anyway, as mentioned before: all this is really besides the point: If we want to support Unicode in Python, we have to also support conversion of numerals declared in Unicode into a form that can be processed by Python. Regardless of where such data originates. If we were not to follow this approach, we could just as well decide not support support reading Egyptian Hieroglyphs based on the argument that there's no keyboard to enter them... http://www.unicode.org/charts/PDF/U13000.pdf :-) (from http://www.unicode.org/charts/)
Input from an existing text file, as I said earlier.
Which *specific* existing text file? Have you actually *seen* such a text file?
Have you tried Google ? http://www.google.com/search?q=١٢٣ http://www.google.com/search?q=٣+site%3Agov.lb Some examples: http://www.bdl.gov.lb/circ/intpdf/int123.pdf http://www.cdr.gov.lb/study/sdatl/Arabic/Chapter3.PDF http://www.batroun.gov.lb/PDF/Waredat2006.pdf (these all use http://en.wikipedia.org/wiki/Eastern_Arabic_numerals)
Direct entry at the console is a red herring.
And we don't need powerhouses because power comes out of the socket.
Martin, the argument simply doesn't fit well with the discussion about Python and Unicode. We introduced Unicode in Python not because there was a need for each and every code point in Unicode, but because we wanted to adopt a standard which doesn't prefer any one way of writing things over another. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Arabic numerals are being used a lot nowadays in Asian countries, but that doesn't mean that the native script versions are not being used anymore.
I never claimed that people are not using their local scripts to enter numbers. However, none of your examples is about Chinese numerals using an ASCII full stop as a decimal point. The only thing I claimed about usage (actually only repeating haiyang kang's earlier claim) is that nobody would enter Chinese numerals with a keyboard and then use full stop as the decimal separator. So all your counter-examples just don't apply - I don't deny them. Regards, Martin
On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg <mal@egenix.com> wrote: ..
Have you tried Google ?
I tried google at I could not find any plain text or HTML file that would use Arabic-Indic numerals. What was interesting, though that a search for "quran unicode" (without quotes). Brought me to http://www.sacred-texts.com which says that they've been using unicode since 2002 in their archives. Interestingly enough, their version of Qur'an uses ordinary digits for ayah numbers. See, for example <http://www.sacred-texts.com/isl/uq/050.htm>. I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float().
Alexander Belopolsky wrote:
On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg <mal@egenix.com> wrote: ..
Have you tried Google ?
I tried google at I could not find any plain text or HTML file that would use Arabic-Indic numerals. What was interesting, though that a search for "quran unicode" (without quotes). Brought me to http://www.sacred-texts.com which says that they've been using unicode since 2002 in their archives. Interestingly enough, their version of Qur'an uses ordinary digits for ayah numbers. See, for example <http://www.sacred-texts.com/isl/uq/050.htm>.
I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float().
Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals. Here's an example of a a famous Chinese text using Chinese numerals: http://ctext.org/nine-chapters Unfortunately, the Chinese numerals are not listed in the Category "Nd", so Python won't be able to parse them. This has various reasons, it seems, one of them being that the numeral code points were not defined as range of code points. I'm sure you can find other books on mathematics in sanscrit or arabic scripts as well. But this whole branch of the discussion is not going to go anywhere. The point is that we support all of Unicode in Python, not just a fragment, and therefore the numeric constructors support all of Unicode. Using them, it's very easy to support numbers in all kinds of variants, whether bound to a locale or not. Adding more locale aware numeric parsers and formatters to the locale module, based on these APIs is certainly a good idea, but orthogonal to the ongoing discussion, IMO. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 02 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg <mal@egenix.com> wrote: ..
I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float().
Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals.
Yes, but this was all about output. I am pretty sure TeX was able to typeset Qur'an in all its glory long before Unicode was invented. Yet, in machine readable form it would be something like {\quran 1} (invented directive). I have asked for a file that is intended for machine processing, not for human enjoyment in print or on a display. I claim that if such file exists, the program that reads it does not use the same rules as Python and converting non-ascii digits would be a tiny portion of what that program does.
Alexander Belopolsky wrote:
On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg <mal@egenix.com> wrote: ..
I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float().
Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals.
Yes, but this was all about output. I am pretty sure TeX was able to typeset Qur'an in all its glory long before Unicode was invented. Yet, in machine readable form it would be something like {\quran 1} (invented directive). I have asked for a file that is intended for machine processing, not for human enjoyment in print or on a display. I claim that if such file exists, the program that reads it does not use the same rules as Python and converting non-ascii digits would be a tiny portion of what that program does.
Well, programs that take input from the keyboards I posted in this thread will have to deal with the digits. Since Python's input() accepts keyboard input, you have your use case :-) Seriously, I find the distinction between input and output forms of numerals somewhat misguided. Any output can also serve as input. For books and other printed material, images, etc. you have scanners and OCR. For screen output you have screen readers. For spreadsheets and data, you have CSV, TSV, XML, etc. etc. etc. Just for the fun of it, I created a CSV file with Thai and Dzongkha numerals (in addition to Arabic ones) using OpenOffice. Here's the cut and paste version: """ Numbers in various scripts Arabic Thai Dzongkha 1 ๑ ༡ 2 ๒ ༢ 3 ๓ ༣ 4 ๔ ༤ 5 ๕ ༥ 6 ๖ ༦ 7 ๗ ༧ 8 ๘ ༨ 9 ๙ ༩ 10 ๑๐ ༡༠ 11 ๑๑ ༡༡ 12 ๑๒ ༡༢ 13 ๑๓ ༡༣ 14 ๑๔ ༡༤ 15 ๑๕ ༡༥ 16 ๑๖ ༡༦ 17 ๑๗ ༡༧ 18 ๑๘ ༡༨ 19 ๑๙ ༡༩ 20 ๒๐ ༢༠ """ And here's the script that goes with it: import csv c = csv.reader(open('Numbers-in-various-scripts.csv')) headers = [c.next() for i in range(3)] while c: print [int(unicode(x, 'utf-8')) for x in c.next()] and the output using Python 2.7: [1, 1, 1] [2, 2, 2] [3, 3, 3] [4, 4, 4] [5, 5, 5] [6, 6, 6] [7, 7, 7] [8, 8, 8] [9, 9, 9] [10, 10, 10] [11, 11, 11] [12, 12, 12] [13, 13, 13] [14, 14, 14] [15, 15, 15] [16, 16, 16] [17, 17, 17] [18, 18, 18] [19, 19, 19] [20, 20, 20] If you need more such files, I can generate as many as you like ;-) I can send the OOo file as well, if you like to play around with it. I'd say: case closed :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 03 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
The point is that we support all of Unicode in Python, not just a fragment, and therefore the numeric constructors support all of Unicode.
That conclusion is as false today as it was in Python 1.6, but only now people start caring about that. a) we don't support all of Unicode in numeric constructors. There are lots of things that you can write down that readers would recognize as a real/rational/integral number that float() won't parse. b) if float() would restrict itself to the scientific notation of real numbers (as it should), Python could well continue to claim all of Unicode.
Adding more locale aware numeric parsers and formatters to the locale module, based on these APIs is certainly a good idea, but orthogonal to the ongoing discussion, IMO.
Not at all. The concept of "Unicode numbers" is flawed: Unicode does *not* prescribe any specific way to denote numbers. Unicode is about characters, and Python supports the Unicode characters for digits as well as it supports all the other Unicode characters. Instead, support for non-scientific notation of real numbers should be based on user needs, which probably can be approximated by looking at actual scripts. This, in turn, is inherently locale-dependent. Regards, Martin
On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg <mal@egenix.com> wrote: ..
Some examples:
I looked at this one more closely. While I cannot understand what it says, It appears that Arabic numerals are used in dates. It looks like Python want be able to deal with those:
datetime.strptime('١٩٩٩/١٠/٢٩', '%Y/%m/%d') .. ValueError: time data '١٩٩٩/١٠/٢٩' does not match format '%Y/%m/%d'
Interestingly,
datetime.strptime('١٩٩٩', '%Y') datetime.datetime(1999, 1, 1, 0, 0)
which further suggests that support of such numerals is accidental. As I think more about it, though I am becoming less avert to accepting these numerals for base 10 integers. Integers can be easily extracted from text using simple regex and '\d' accepts all category Nd characters. I would require though that all digits be from the same block, which is not hard because Unicode now promises to only have them in contiguous blocks of 10. This rule seems to address some of security issues because it is unlikely that a system that can display some of the local digits would not be able to display all of them properly. I still don't think it makes any sense to accept them in float().
On 12/2/2010 6:54 PM, Alexander Belopolsky wrote:
On Thu, Dec 2, 2010 at 4:14 PM, M.-A. Lemburg<mal@egenix.com> wrote: ..
Some examples:
I looked at this one more closely. While I cannot understand what it says, It appears that Arabic numerals are used in dates. It looks like Python want be able to deal with those:
When I travelled in S. Asia around 25 years ago, arabic and indic numerals were in obvious use in stores, road signs, and banks (as with money exchange receipts). I learned the digits partly for self-protestions ;-). I have no real idea of what is done *now* in computerized business, but I assume the native digits are used. It may well be that there is no Python software yet that operates with native digits. The lack of direct output capability would hinder that. Of course, someone could run both input and output through language-specific str.translate digit translators.
datetime.strptime('١٩٩٩/١٠/٢٩', '%Y/%m/%d')
Googling ١٩٩٩ gets about 83,000 hits.
.. ValueError: time data '١٩٩٩/١٠/٢٩' does not match format '%Y/%m/%d'
Interestingly,
datetime.strptime('١٩٩٩', '%Y') datetime.datetime(1999, 1, 1, 0, 0)
which further suggests that support of such numerals is accidental.
As I think more about it, though I am becoming less avert to accepting these numerals for base 10 integers.
Both input and output are needed for educational programming, though translation tables might be enough.
Integers can be easily extracted from text using simple regex and '\d' accepts all category Nd characters. I would require though that all digits be from the same block, which is not hard because Unicode now promises to only have them in contiguous blocks of 10.
That seems sensible.
This rule seems to address some of security issues because it is unlikely that a system that can display some of the local digits would not be able to display all of them properly.
I still don't think it makes any sense to accept them in float().
For the present, I would pretty well agree with that, at least until we know more. You have raised an important issue. It is a bit of a chicken and egg problem though. We will not really know what is needed until Python is used more in non-english/non-euro contexts, while such usage may await better support. -- Terry Jan Reedy
Furthermore, data can well originate from texts that were written hundreds or even thousands of years ago, so there is plenty of material available for processing.
humm..., for this, i think we need a special tuned language processing system to handle this, and one subsystem for one language :)... (sometimes a single word is not enough, we also need context) Take pi for example, in modern math, it is wrote as: 3.1415...; in old China, it is sometimes wrote as: 三一四一五 or 三点一四一五 or 叁点壹肆壹伍; And if these texts are extracted through scanner (OCR or other image processing tech), in my POV, it is the job of this image processing subsystem (or some other subsystem between the image processing and database) to do the mapping between number and raw text data, example table in DB: text | raw data |raw image data -----------|---------------------------------|----------------------- 3.1415 | 三一四一五 | image... br, khy
Stephen J. Turnbull wrote:
Steven D'Aprano writes:
With full respect to haiyang kang, hear-say from one person can hardly be described as "strong" evidence
That's *disrespectful* nonsense. What Haiyang reported was not hearsay, it's direct observation of what he sees around him and personal experience, plus extrapolation. Look up "hearsay," please.
Fair enough. I choose my words poorly and apologise. A better description would be anecdotal evidence. -- Steven
On Wed, Dec 1, 2010 at 5:36 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote: ..
Note that I'm not saying this is common. Nor am I saying it's a desirable situation. I'm saying it is a feasible use case, to be dismissed only if there is strong evidence that it's not used by existing Python code.
And indeed, for the Chinese numerals, we have such strong evidence.
Indeed: it over 10 years that Python's int() accepted Arabic-Indic numerals, nobody has complained that it *did not* accept Chinese.
"Martin v. Löwis" wrote:
Am 30.11.2010 21:24, schrieb Ben Finney:
haiyang kang <cornsea@gmail.com> writes:
I think it is a little ugly to have code like this: num = float("一.一"), expected result is: num = 1.1
That's a straw man, though. The string need not be a literal in the program; it can be input to the program.
num = float(input_from_the_external_world)
Does that change your assessment of whether non-ASCII digits are used?
I think the OP (haiyang kang) already indicated that he finds it quite unlikely that anybody would possibly want to enter that. You would need a number of key strokes to enter each individual ideograph, plus you have to press the keys for keyboard layout switching to enter the Latin decimal separator (which you normally wouldn't use along with the Han numerals).
That's a somewhat limited view, IMHO. Numbers are not always entered using a computer keyboard, you have tool like cash registries, special numeric keypads, scanners, OCR, etc. for external entry, and you also have other programs producing such output, e.g. MS Office if configured that way. The argument with the decimal point doesn't work well either, since it's obvious that float() and int() do not support localized input. E.g. in Germany we write 3,141 instead of 3.141:
float('3,141') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for float(): 3,141
No surprise there. The localization of the input data, e.g. removal of thousands separators and conversion of decimal marks to the dot, have to be done by the application, just like you have to now for German floating point number literals. The locale module already has locale.atof() and locale.atoi() for just this purpose. FYI, here's a list of decimal digits supported by Python 2.7: http://www.unicode.org/Public/5.2.0/ucd/extracted/DerivedNumericType.txt: """ 0030..0039 ; Decimal # Nd [10] DIGIT ZERO..DIGIT NINE 0660..0669 ; Decimal # Nd [10] ARABIC-INDIC DIGIT ZERO..ARABIC-INDIC DIGIT NINE 06F0..06F9 ; Decimal # Nd [10] EXTENDED ARABIC-INDIC DIGIT ZERO..EXTENDED ARABIC-INDIC DIGIT NINE 07C0..07C9 ; Decimal # Nd [10] NKO DIGIT ZERO..NKO DIGIT NINE 0966..096F ; Decimal # Nd [10] DEVANAGARI DIGIT ZERO..DEVANAGARI DIGIT NINE 09E6..09EF ; Decimal # Nd [10] BENGALI DIGIT ZERO..BENGALI DIGIT NINE 0A66..0A6F ; Decimal # Nd [10] GURMUKHI DIGIT ZERO..GURMUKHI DIGIT NINE 0AE6..0AEF ; Decimal # Nd [10] GUJARATI DIGIT ZERO..GUJARATI DIGIT NINE 0B66..0B6F ; Decimal # Nd [10] ORIYA DIGIT ZERO..ORIYA DIGIT NINE 0BE6..0BEF ; Decimal # Nd [10] TAMIL DIGIT ZERO..TAMIL DIGIT NINE 0C66..0C6F ; Decimal # Nd [10] TELUGU DIGIT ZERO..TELUGU DIGIT NINE 0CE6..0CEF ; Decimal # Nd [10] KANNADA DIGIT ZERO..KANNADA DIGIT NINE 0D66..0D6F ; Decimal # Nd [10] MALAYALAM DIGIT ZERO..MALAYALAM DIGIT NINE 0E50..0E59 ; Decimal # Nd [10] THAI DIGIT ZERO..THAI DIGIT NINE 0ED0..0ED9 ; Decimal # Nd [10] LAO DIGIT ZERO..LAO DIGIT NINE 0F20..0F29 ; Decimal # Nd [10] TIBETAN DIGIT ZERO..TIBETAN DIGIT NINE 1040..1049 ; Decimal # Nd [10] MYANMAR DIGIT ZERO..MYANMAR DIGIT NINE 1090..1099 ; Decimal # Nd [10] MYANMAR SHAN DIGIT ZERO..MYANMAR SHAN DIGIT NINE 17E0..17E9 ; Decimal # Nd [10] KHMER DIGIT ZERO..KHMER DIGIT NINE 1810..1819 ; Decimal # Nd [10] MONGOLIAN DIGIT ZERO..MONGOLIAN DIGIT NINE 1946..194F ; Decimal # Nd [10] LIMBU DIGIT ZERO..LIMBU DIGIT NINE 19D0..19DA ; Decimal # Nd [11] NEW TAI LUE DIGIT ZERO..NEW TAI LUE THAM DIGIT ONE 1A80..1A89 ; Decimal # Nd [10] TAI THAM HORA DIGIT ZERO..TAI THAM HORA DIGIT NINE 1A90..1A99 ; Decimal # Nd [10] TAI THAM THAM DIGIT ZERO..TAI THAM THAM DIGIT NINE 1B50..1B59 ; Decimal # Nd [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE 1BB0..1BB9 ; Decimal # Nd [10] SUNDANESE DIGIT ZERO..SUNDANESE DIGIT NINE 1C40..1C49 ; Decimal # Nd [10] LEPCHA DIGIT ZERO..LEPCHA DIGIT NINE 1C50..1C59 ; Decimal # Nd [10] OL CHIKI DIGIT ZERO..OL CHIKI DIGIT NINE A620..A629 ; Decimal # Nd [10] VAI DIGIT ZERO..VAI DIGIT NINE A8D0..A8D9 ; Decimal # Nd [10] SAURASHTRA DIGIT ZERO..SAURASHTRA DIGIT NINE A900..A909 ; Decimal # Nd [10] KAYAH LI DIGIT ZERO..KAYAH LI DIGIT NINE A9D0..A9D9 ; Decimal # Nd [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE AA50..AA59 ; Decimal # Nd [10] CHAM DIGIT ZERO..CHAM DIGIT NINE ABF0..ABF9 ; Decimal # Nd [10] MEETEI MAYEK DIGIT ZERO..MEETEI MAYEK DIGIT NINE FF10..FF19 ; Decimal # Nd [10] FULLWIDTH DIGIT ZERO..FULLWIDTH DIGIT NINE 104A0..104A9 ; Decimal # Nd [10] OSMANYA DIGIT ZERO..OSMANYA DIGIT NINE 1D7CE..1D7FF ; Decimal # Nd [50] MATHEMATICAL BOLD DIGIT ZERO..MATHEMATICAL MONOSPACE DIGIT NINE """ The Chinese and Japanese ideographs are not supported because of the way they are defined in the Unihan database. I'm currently investigating how we could support them as well. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Stephen J. Turnbull wrote:
Lennart Regebro writes:
*I* think it is more important. In python 3, you can never ever assume anything is ASCII any more.
Sure you can. In Python program text, all keywords will be ASCII (English, even, though it may be en_NL.UTF-8<wink>) for the forseeable future.
I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese "ASCII" numerals or Arabic cursive numerals in "for i in range(...)" for example.
I agree with you that numeric *literals* should be restricted to the ASCII digits. I don't think anyone here is arguing differently -- if they are, they should speak up and try to make the case for allowing numeric literals in arbitrary scripts. Python doesn't currently allow non-ASCII numeric literals, and even if such a change were desirable, it would run up against the moratorium. So let's just forget the specter of code like: x = math.sqrt(١٢٣٤.٥٦ ** 一.一) It ain't gonna happen :) But I think there is a good case for allowing the constructors int, float and complex to continue to accept numeric *strings* with non-ASCII digits. The code already exists, there's probably people out there who rely on it, and in the absence of any convincing demonstration that the existing behaviour is causing widespread difficulty, we should leave well-enough alone. Various people have suggested that there should be a function in the locale module that handles numeric string input in non-ASCII digits. This is a de facto admission that there are use-cases for taking user input like the string '٣' and turning it into the int 3. Python can already do this, and has been able to for many years: [steve@sylar ~]$ python2.4 Python 2.4.6 (#1, Mar 30 2009, 10:08:01) [GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
int(u'٣') 3
It seems to me that there's no need to move this functionality into locale. -- Steven
On Wed, 01 Dec 2010 00:23:22 +1100 Steven D'Aprano <steve@pearwood.info> wrote:
But I think there is a good case for allowing the constructors int, float and complex to continue to accept numeric *strings* with non-ASCII digits. The code already exists, there's probably people out there who rely on it, and in the absence of any convincing demonstration that the existing behaviour is causing widespread difficulty, we should leave well-enough alone.
+1
It seems to me that there's no need to move this functionality into locale.
Not only, but moving it into locale won't make it easier to maintain anyway. Regards Antoine.
On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:
I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese "ASCII" numerals or Arabic cursive numerals in "for i in range(...)" for example.
I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts. -- Terry Jan Reedy
Am 30.11.2010 23:43, schrieb Terry Reedy:
On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:
I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese "ASCII" numerals or Arabic cursive numerals in "for i in range(...)" for example.
I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts.
And here, my observation stands: if they wanted to, they currently couldn't - at least not for real numbers (and also not for integers if they want to use grouping). So the presumed application of this feature doesn't actually work, despite the presence of the feature it was supposedly meant to enable. Regards, Martin
Martin v. Löwis wrote:
Am 30.11.2010 23:43, schrieb Terry Reedy:
On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:
I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese "ASCII" numerals or Arabic cursive numerals in "for i in range(...)" for example. I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts.
And here, my observation stands: if they wanted to, they currently couldn't - at least not for real numbers (and also not for integers if they want to use grouping). So the presumed application of this feature doesn't actually work, despite the presence of the feature it was supposedly meant to enable.
By that argument, English speakers wanting to enter integers using Arabic numerals can't either! I'd like to use grouping for large literals, if only I could think of a half-decent syntax, and if only Python supported it. This fails on both counts: x = 123_456_789_012_345 The lack of grouping and the lack of a native decimal point doesn't mean that the feature "doesn't work" -- it merely means the feature requires some compromise before it can be used. In the same way, if I wanted to enter a number using non-Arabic digits, it works provided I compromise by using the Anglo-American decimal point instead of the European comma or the native decimal point I might prefer. The lack of support for non-dot decimal points is arguably a bug that should be fixed, not a reason to remove functionality. -- Steven
And here, my observation stands: if they wanted to, they currently couldn't - at least not for real numbers (and also not for integers if they want to use grouping). So the presumed application of this feature doesn't actually work, despite the presence of the feature it was supposedly meant to enable.
By that argument, English speakers wanting to enter integers using Arabic numerals can't either!
That's correct, and the key point here for the argument. It's just not *meant* to support localized number forms, but deliberately constrains them to a formal grammar which users using it must be aware of in order to use it.
I'd like to use grouping for large literals, if only I could think of a half-decent syntax, and if only Python supported it. This fails on both counts:
x = 123_456_789_012_345
Here you are confusing issues, though: this fragment uses the syntax of the Python programming language. Whether or not the syntax of the float() constructor arguments matches that syntax is also a subject of the debate. I take it that you speak in favor of the float syntax also being used for the float() constructor.
The lack of grouping and the lack of a native decimal point doesn't mean that the feature "doesn't work" -- it merely means the feature requires some compromise before it can be used.
No, it means that the Python programming language syntax for floating point numbers just doesn't take local notation into account *at all*. This is not a flaw - it just means that this feature is non-existent. Now, for the float() constructor, some people in this thread have claimed that it *is* aimed at people who want to enter numbers in their local spellings. I claim that this feature either doesn't work, or is absent also.
In the same way, if I wanted to enter a number using non-Arabic digits, it works provided I compromise by using the Anglo-American decimal point instead of the European comma or the native decimal point I might prefer.
Why would you want that, if, what you really wanted, could not be done. There certainly *is* a way to convert strings into floats, and there would be a way if that restricted itself to the digits 0..9. So it can't be the mere desire to convert strings to float that make you ask for non-ASCII digits.
The lack of support for non-dot decimal points is arguably a bug that should be fixed, not a reason to remove functionality.
I keep repeating my two concerns: a) if that was a feature, it is not specified at all in the documentation. In fact, the documentation was recently clarified to deny existence of that feature. b) fixing it will be much more difficult than you apparently think. Regards, Martin
Martin v. Löwis wrote:
And here, my observation stands: if they wanted to, they currently couldn't - at least not for real numbers (and also not for integers if they want to use grouping). So the presumed application of this feature doesn't actually work, despite the presence of the feature it was supposedly meant to enable. By that argument, English speakers wanting to enter integers using Arabic numerals can't either!
That's correct, and the key point here for the argument. It's just not *meant* to support localized number forms, but deliberately constrains them to a formal grammar which users using it must be aware of in order to use it.
You're *agreeing* that English speakers can't enter integers using Arabic numerals? What do you think I'm doing when I do this?
int("1234") 1234
Ah wait... did you think I meant Arabic numerals in the sense of digits used by Arabs in Arabia? I meant Arabic numerals as opposed to Roman numerals. Sorry for the confusion. Your argument was that even though Python's int() supports many non-ASCII digits, the lack of grouping means that it "doesn't actually work". If that argument were correct, then it applies equally to ASCII digits as well. It's clearly nonsense to say that int("1234") "doesn't work" just because of the lack of grouping. It's equally nonsense to say that int("١٢٣٤") "doesn't work" because of the lack of grouping. [...]
I take it that you speak in favor of the float syntax also being used for the float() constructor.
I'm sorry, I don't understand what you mean here. I've repeatedly said that the syntax for numeric literals should remain constrained to the ASCII digits, as it currently is. n = ١٢٣٤ gives a SyntaxError, and I don't want to see that change. But I've also argued that the float constructor currently accepts non-ASCII strings: n = int("١٢٣٤") we should continue to support the existing behaviour. None of the arguments against it seem convincing to me, particularly since the opponents of the current behaviour admit that there is a use-case for it, but they just want it to move elsewhere, such as the locale module. We've even heard from one person -- I forget who, sorry -- who claimed that C++ has the same behaviour, and if you want ASCII-only digits, you have to explicitly ask for it. For what it's worth, Microsoft warns developers not to assume users will enter numeric data using ASCII digits: "Number representation can also use non-ASCII native digits, so your application may encounter characters other than 0-9 as inputs. Avoid filtering on U+0030 through U+0039 to prevent frustration for users who are trying to enter data using non-ASCII digits." http://msdn.microsoft.com/en-us/magazine/cc163506.aspx There was a similar discussion going on in Perl-land recently: http://www.nntp.perl.org/group/perl.perl5.porters/2010/07/msg162400.html although, being Perl, the discussion was dominated by concerns about regexes and implicit conversions, rather than an explicit call to float() or int() as we are discussing here. [...]
In the same way, if I wanted to enter a number using non-Arabic digits, it works provided I compromise by using the Anglo-American decimal point instead of the European comma or the native decimal point I might prefer.
Why would you want that, if, what you really wanted, could not be done. There certainly *is* a way to convert strings into floats, and there would be a way if that restricted itself to the digits 0..9. So it can't be the mere desire to convert strings to float that make you ask for non-ASCII digits.
Why do Europeans use programming languages that force them to use a dot instead of a comma for the decimal place? Why do I misspell string.centre as string.center? Because if you want to get something done, you use the tools you have and not the tools you'd like to have. -- Steven
On Wed, Dec 1, 2010 at 7:17 PM, Steven D'Aprano <steve@pearwood.info> wrote: ..
we should continue to support the existing behaviour. None of the arguments against it seem convincing to me, particularly since the opponents of the current behaviour admit that there is a use-case for it, but they just want it to move elsewhere, such as the locale module.
I don't remember who made this argument, but I think you misunderstood it. The argument was that if there was a use case for parsing Eastern Arabic numerals, it would be better served by a module written by someone who speaks one of the Arabic languages and knows the details of how Eastern Arabic numerals are written. So far nobody has even claimed to know conclusively that Arabic-Indic digits are always written left-to-right.
unicodedata.bidirectional('٤') 'AN'
is not very helpful because it means "any Arabic-Indic digit" according to unicode.org. (To me, a special category hints that it may be written in either direction and the proper interpretation may depend on context.) I have not seen a real use case reported in this thread and for theoretical use cases, the current implementation is either outright wrong or does not solve the problem completely. Given that a function that replaces all Unicode digits in a string with 0-9 can be written in 3 lines of Python code, it is very unlikely that anyone would prefer to rely on undocumented behavior of Python builtins instead of having explicit control over parsing of their data.
On 12/1/2010 7:44 PM, Alexander Belopolsky wrote:
it. The argument was that if there was a use case for parsing Eastern Arabic numerals, it would be better served by a module written by someone who speaks one of the Arabic languages and knows the details of how Eastern Arabic numerals are written. So far nobody has even claimed to know conclusively that Arabic-Indic digits are always written left-to-right.
Both my personal observations when travelling from Turkey to India and Wikipedia say yes. "When representing a number in Arabic, the lowest-valued position is placed on the right, so the order of positions is the same as in left-to-right scripts." https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals -- Terry Jan Reedy
On Wed, Dec 1, 2010 at 10:11 PM, Terry Reedy <tjreedy@udel.edu> wrote:
On 12/1/2010 7:44 PM, Alexander Belopolsky wrote:
it. The argument was that if there was a use case for parsing Eastern Arabic numerals, it would be better served by a module written by someone who speaks one of the Arabic languages and knows the details of how Eastern Arabic numerals are written. So far nobody has even claimed to know conclusively that Arabic-Indic digits are always written left-to-right.
Both my personal observations when travelling from Turkey to India and Wikipedia say yes. "When representing a number in Arabic, the lowest-valued position is placed on the right, so the order of positions is the same as in left-to-right scripts." https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals
This matches my limited research on this topic as well. However, I am not sure that when these codes are embedded in Arabic text, their logical order always matches their display order. It seems to me that it can go either way depending on the surrounding text and/or presence of explicit formatting codes. Also, I don't understand why Eastern Arabic-Indic digits have the same Bidi-Class as European digits, but Arabic-Indic digits, Arabic decimal and thousands separators have Bidi-Class "AN". http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types
On Wed, 1 Dec 2010 22:28:49 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
Both my personal observations when travelling from Turkey to India and Wikipedia say yes. "When representing a number in Arabic, the lowest-valued position is placed on the right, so the order of positions is the same as in left-to-right scripts." https://secure.wikimedia.org/wikipedia/en/wiki/Arabic_language#Numerals
This matches my limited research on this topic as well. However, I am not sure that when these codes are embedded in Arabic text, their logical order always matches their display order.
That shouldn't matter, since unicode text follows logical order. The display order is up to the graphical representation library. Regards Antoine.
On Thu, Dec 2, 2010 at 8:36 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Wed, 1 Dec 2010 22:28:49 -0500 Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote: ..
This matches my limited research on this topic as well. However, I am not sure that when these codes are embedded in Arabic text, their logical order always matches their display order.
That shouldn't matter, since unicode text follows logical order. The display order is up to the graphical representation library.
I am not so sure. On my Mac, U+200F (RIGHT-TO-LEFT MARK) affects 0-9 and Arabic-Indic decimals differently:
print('\u200F123') 123 print('\u200F\u0661\u0662\u0663') 231
I replaced Arabic-Indic decimals with 0-9 in the output to demonstrate the point. Cut-n-paste does not work well in the presence of RTL directives. and U+202E (RIGHT-TO-LEFT OVERRIDE) reverts the display order for both:
print('\u202E123') 321 print('\u202E\u0661\u0662\u0663') 321
(again, the output display is simulated not copied.) I don't know if explicit RTL directives are ever used in Arabic texts, but it is quite possible that texts converted from older formats would use them for efficiency. Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right. If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages. Nobody will ever assume that python builtins are suitable for use with all these variants. This "feature" is only good for nefarious purposes such as hiding extra digits in innocent-looking files or smuggling binary data through naive interfaces. PS: BTW, shouldn't int('\u0661\u0662\u06DD') be valid? or is it int('\u06DD\u0661\u0662')?
Le jeudi 02 décembre 2010 à 11:41 -0500, Alexander Belopolsky a écrit :
Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right.
I don't understand why you think Arabic or Hebrew text is any different from Western text. Surely right-to-left isn't more conceptually complicated than left-to-right, is it? The fact that mixed rtl + ltr can render bizarrely or is awkward to cut and paste is quite off-topic for our discussion.
If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages.
So why do you trust the Unicode standard on other things and not on this one? Regards Antoine.
On Thu, Dec 2, 2010 at 11:56 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Le jeudi 02 décembre 2010 à 11:41 -0500, Alexander Belopolsky a écrit :
Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right.
I don't understand why you think Arabic or Hebrew text is any different from Western text. Surely right-to-left isn't more conceptually complicated than left-to-right, is it?
No, but a mix of LTR and RTL is certainly more difficult that either of the two. I invite you to digest Unicode Standard Annex #9 before we continue this discussion. See <http://unicode.org/reports/tr9/>.
The fact that mixed rtl + ltr can render bizarrely or is awkward to cut and paste is quite off-topic for our discussion.
No, it is not. One of the invented use cases in this thread was naive users' desire to enter numbers using their preferred local decimals. Same users may want to be able to cut and paste their decimals as well. More importantly, however, legacy formats may not have support for mixed-direction text and may require that "John is 41" be stored as "41 si nhoJ" and Unicode converter would turn it into "[RTL]John is 14" that will still display as "41 si nhoJ", but int(s[-2:]) will return 14, not 41.
If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages.
So why do you trust the Unicode standard on other things and not on this one?
What other things? As far as I understand the only str method that was designed to comply with Unicode recomendations was str.isidentifier(). And we have some really bizarre results:
'\u2164'.isidentifier() True '\u2164'.isalpha() False
and can you describe the difference between str.isdigit() and str.isdecimal()? According to the reference manual, """ str.isdecimal() Return true if all characters in the string are decimal characters and there is at least one character, false otherwise. Decimal characters include digit characters, and all characters that that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO. str.isdigit() Return true if all characters in the string are digits and there is at least one character, false otherwise. """ http://docs.python.org/dev/library/stdtypes.html#str.isdecimal Since U+0660 is mentioned in the first definition and not in the second, I may conclude that it is not a digit, but
'\u0660'.isdigit() True
If you know the correct answer, please contribute it here: <http://bugs.python.org/issue10587>.
Le jeudi 02 décembre 2010 à 13:14 -0500, Alexander Belopolsky a écrit :
I don't understand why you think Arabic or Hebrew text is any different from Western text. Surely right-to-left isn't more conceptually complicated than left-to-right, is it?
No, but a mix of LTR and RTL is certainly more difficult that either of the two. I invite you to digest Unicode Standard Annex #9 before we continue this discussion.
“This annex describes specifications for the *positioning* of characters flowing from right to left” (emphasis mine) Looks like something for implementors of rendering engines, which python-dev is not AFAICT.
Same users may want to be able to cut and paste their decimals as well. More importantly, however, legacy formats may not have support for mixed-direction text and may require that "John is 41" be stored as "41 si nhoJ" and Unicode converter would turn it into "[RTL]John is 14" that will still display as "41 si nhoJ", but int(s[-2:]) will return 14, not 41.
The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right.
If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages.
So why do you trust the Unicode standard on other things and not on this one?
What other things?
Everything which the Unicode database stores and that we already rely on.
As far as I understand the only str method that was designed to comply with Unicode recomendations was str.isidentifier().
I don't think so. str.split() and str.splitlines() are also defined in conformance to the SPEC, AFAIK. They certainly try to. And, outside of str itself, the re module tries to follow Unicode categories as well (for example, "\d" should match non-ASCII digits). Regards Antoine.
On Thu, Dec 2, 2010 at 1:55 PM, Antoine Pitrou <solipsis@pitrou.net> wrote: ..
I don't think so. str.split() and str.splitlines() are also defined in conformance to the SPEC, AFAIK. They certainly try to.
You are joking, right? Where exactly does Unicode specify something like this:
''.join('𐌀𐌁𐌂'.split('\udf00\ud800')) '𐌁𐌂' ?
OK, splitting on a given separator has very little to do with Unicode or UCD, but str.splitlines() makes absolutely no attempt to conform to Unicode Standard Annex #14 ("Unicode line breaking algorithm"). Wait, UAX #14 is actually relevant to textwrap module which saw very little change since 2.x days. So, what exactly does str.splitlines() do? And which part of the Unicode standard defines how it is different from str.split(.., '\n')? Reference manual does not help me here either: """ str.splitlines([keepends]) Return a list of the lines in the string, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true. """ http://docs.python.org/dev/library/stdtypes.html#str.splitlines
Le jeudi 02 décembre 2010 à 16:34 -0500, Alexander Belopolsky a écrit :
On Thu, Dec 2, 2010 at 1:55 PM, Antoine Pitrou <solipsis@pitrou.net> wrote: ..
I don't think so. str.split() and str.splitlines() are also defined in conformance to the SPEC, AFAIK. They certainly try to.
You are joking, right?
Perhaps you could look at the implementation.
Antoine Pitrou writes:
The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right.
Uhmmmmmm, the argument *for* this "feature" proposed by several people is that Python's numeric constructors do it (right) so that the programmer doesn't have to. If Python *doesn't* do it right, why should Python do it at all?
Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a écrit :
Antoine Pitrou writes:
The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right.
Uhmmmmmm, the argument *for* this "feature" proposed by several people is that Python's numeric constructors do it (right) so that the programmer doesn't have to.
As far as I understand, Alexander was talking about a legacy pre-unicode text format. We don't have to support this. Regards Antoine.
Antoine Pitrou writes:
Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a écrit :
Antoine Pitrou writes:
The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right.
Uhmmmmmm, the argument *for* this "feature" proposed by several people is that Python's numeric constructors do it (right) so that the programmer doesn't have to.
As far as I understand, Alexander was talking about a legacy pre-unicode text format. We don't have to support this.
*I* didn't say we *should* support it. I'm saying that *others'* argument for not restricting the formats accepting by string to number converters to something well-defined and AFAIK universally understood by users (developers of Python programs *and* end-users) is that we *already* support this. Alexander, Martin, and I are basically just pointing out that no, the "support" we have via the built-in numeric constructors is incomplete and nonconforming. We feel that is a bug to be fixed by (1) implementing the definition as currently found in the documents, and (2) moving the non-ASCII support to another module (or, as a compromise, supporting non-ASCII digits via an argument to the built-ins -- that was my proposal, I don't know if Alexander or Martin would find it acceptable). Given that some committers (MAL, you?) don't even consider that accepting and converting a string containing digits from multiple scripts as a single number is a bug, I'd really rather that this bug/feature not be embedded in the interpreter. I suppose that as a built-in rather than syntax, technically it doesn't fall under the moratorium, but it makes me nervous....
Le samedi 04 décembre 2010 à 17:13 +0900, Stephen J. Turnbull a écrit :
Antoine Pitrou writes:
Le vendredi 03 décembre 2010 à 13:58 +0900, Stephen J. Turnbull a écrit :
Antoine Pitrou writes:
The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right.
Uhmmmmmm, the argument *for* this "feature" proposed by several people is that Python's numeric constructors do it (right) so that the programmer doesn't have to.
As far as I understand, Alexander was talking about a legacy pre-unicode text format. We don't have to support this.
*I* didn't say we *should* support it. I'm saying that *others'* argument for not restricting the formats accepting by string to number converters to something well-defined and AFAIK universally understood by users (developers of Python programs *and* end-users) is that we *already* support this.
As far as I can parse your sentence, I think you are mistaken. Regards Antoine.
Terry Reedy wrote:
On 11/30/2010 3:23 AM, Stephen J. Turnbull wrote:
I see no reason not to make a similar promise for numeric literals. I see no good reason to allow compatibility full-width Japanese "ASCII" numerals or Arabic cursive numerals in "for i in range(...)" for example.
I do not think that anyone, at least not me, has argued for anything other than 0-9 digits (or 0-f for hex) in literals in program code. The only issue is whether non-programmer *users* should be able to use their native digits in applications in response to input prompts.
Me neither. This is solely about Python being able to parse numeric input in the float(), int() and complex() constructors. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 01 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Tue, Nov 30, 2010 at 09:23, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Sure you can. In Python program text, all keywords will be ASCII
Yes, yes, sure, but not the contents of variables,
I see no reason not to make a similar promise for numeric literals.
Wait what, literas? The example was
float('١٢٣٤.٥٦')
Which doesn't have any numeric literals in them at all. Do that work? Nope, it's a syntax error. Too badm that would have been cool, but whatever. Why would this be a problem:
T1234 = float('١٢٣٤.٥٦') T1234 1234.56
But this OK?
T١٢٣٤ = float('1234.56') T١٢٣٤ 1234.56
I don't see that. Should we bother to implement ١٢٣٤.٥٦ as a literal equivalent to 1234.56? Well, not unless somebody askes for it, or it turns out to be easy. :-) But that's another question.
Lennart Regebro writes:
On Tue, Nov 30, 2010 at 09:23, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Sure you can. In Python program text, all keywords will be ASCII
Yes, yes, sure, but not the contents of variables,
Irrelevant, you're not converting these to a string representation. If you're generating numerals for internal use, I don't see why you would want to do arithmetic on them; conversion is a YAGNI. This is only interesting to allow naive users to input in a comfortable way. As yet there is no evidence that there are *any* such naive users, 1.3 billion of "possibles" are shut out, and at least two cultures which use non-ASCII numerals every day, representing 1.3 billion naive users (the coincidence of numbers is no coincidence), have reported that nobody in their right mind would would *input* the numbers that way, and at least for Japanese, the use cases are not really numeric anyway.
I see no reason not to make a similar promise for numeric literals.
Wait what, literas?
Sorry, my bad.
Why would this be a problem:
T1234 = float('~~~~.~~') T1234 1234.56
But this OK?
T~~~~ = float('1234.56') T~~~~ 1234.56
(Sorry, the Arabic is going to get munged, my mailer is beta and somebody screwed up.) Because the characters in the identifier are uninterpreted and have no syntactic content other than their identity. They're arbitrary. That's not true of numerics. Because that works, but print(T1234) doesn't (it prints ASCII). You can't round-trip, but users will want/expect that. Because that works but this doesn't: T1000 = float('一.◯◯◯') Violates TOOWTDI. If you're proposing to fix the numeric parsers, I still don't like it but I could go to -0 on it. However as Alexander points out and MAL admits, it's apparently not so easy to do that.
2010/12/2 Stephen J. Turnbull <stephen@xemacs.org>:
Because that works, but
print(T1234)
doesn't (it prints ASCII). You can't round-trip, but users will want/expect that.
You should be able to round-trip, absolutely. I don't think you should expect print() to do that. str(56) possibly. :) That's an argument for it to be in a module, as you then would need to send in a parameter on which decimal characters you want.
T1000 = float('一.◯◯◯')
That was already discussed here, and it's clear that unicode does not consider these characters to be something you can use in a decimal number, and hence it's not broken.
Lennart Regebro writes:
2010/12/2 Stephen J. Turnbull <stephen@xemacs.org>:
T1000 = float('一.◯◯◯')
That was already discussed here, and it's clear that unicode does not consider these characters to be something you can use in a decimal number, and hence it's not broken.
Huh? IOW, use Unicode features just because they're there, what the users want and use doesn't matter? The only evidence I've seen so far that this feature is anything but a a toy for a small faction of developers is Neil Hodgson's information that OOo will generate these kinds of digits (note that it *will* do Han! so the evidence is as good for users demanding Han numerals as for any other kind, Unicode.org definitions notwithstanding), and that DOS CP 864 contains the Indo/Arabic versions. Of course, it's quite possible that those were toys for the developers of those software packages too.
participants (24)
-
"Martin v. Löwis"
-
Alexander Belopolsky
-
Antoine Pitrou
-
Ben Finney
-
Benjamin Peterson
-
Eric Smith
-
Georg Brandl
-
Guido van Rossum
-
Hagen Fürstenau
-
haiyang kang
-
James Y Knight
-
Joao S. O. Bueno
-
Lennart Regebro
-
M.-A. Lemburg
-
Mark Dickinson
-
Michael Foord
-
Neil Hodgson
-
Nick Coghlan
-
Stefan Krah
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Lesher
-
Vlastimil Brom