Re: When is a number not a number

Looks like a bug to me. For reasons I don't yet understand, the int type check in objectify's type guesser (see https://lxml.de/objectify.html#how-data-types-are-matched) does not fail for this input:
However:
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s... Best regards, Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.

Am March 1, 2023 3:15:22 PM UTC schrieb Holger.Joukl@LBBW.de:
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s...
Ah, yes, it might be the isdigit() check, actually. That could be too broad. Not every digit is a valid part of a number. Thanks for the report and the investigation. I'll try a fix when I get to it. Stefan

Stefan Behnel schrieb am 02.03.23 um 08:50:
According to the XML Schema 1.1 spec, it's really just [0-9] that we should detect. https://www.w3.org/TR/xmlschema11-2/#decimal I'll remove the ".isdigit()" check all together and only leave the '0-9' comparison in there. Even when we're parsing Unicode strings, we should only care about XML numbers, not everything that Python accepts. Stefan

Stefan Behnel schrieb am 03.03.23 um 09:00:
https://github.com/lxml/lxml/commit/3d4e60f2835e4d85fd357c182656d3eca534f2ff Stefan

On Wed, Mar 01, 2023 at 03:15:22PM +0000, Holger.Joukl@LBBW.de wrote:
ValueError: invalid literal for int() with base 10: '²²²²²²²²²²'
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s...
str.isdigit() accepts many Unicode characters classified as digits that int() rejects. Marius Gedminas -- Please note that I only check linux-utf8 on Tuesdays when they happen on a tenth of December, so please CC me with any replies. -- Juliusz Chroboczek

Am March 1, 2023 3:15:22 PM UTC schrieb Holger.Joukl@LBBW.de:
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s...
Ah, yes, it might be the isdigit() check, actually. That could be too broad. Not every digit is a valid part of a number. Thanks for the report and the investigation. I'll try a fix when I get to it. Stefan

Stefan Behnel schrieb am 02.03.23 um 08:50:
According to the XML Schema 1.1 spec, it's really just [0-9] that we should detect. https://www.w3.org/TR/xmlschema11-2/#decimal I'll remove the ".isdigit()" check all together and only leave the '0-9' comparison in there. Even when we're parsing Unicode strings, we should only care about XML numbers, not everything that Python accepts. Stefan

Stefan Behnel schrieb am 03.03.23 um 09:00:
https://github.com/lxml/lxml/commit/3d4e60f2835e4d85fd357c182656d3eca534f2ff Stefan

On Wed, Mar 01, 2023 at 03:15:22PM +0000, Holger.Joukl@LBBW.de wrote:
ValueError: invalid literal for int() with base 10: '²²²²²²²²²²'
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s...
str.isdigit() accepts many Unicode characters classified as digits that int() rejects. Marius Gedminas -- Please note that I only check linux-utf8 on Tuesdays when they happen on a tenth of December, so please CC me with any replies. -- Juliusz Chroboczek
participants (3)
-
Holger.Joukl@LBBW.de
-
Marius Gedminas
-
Stefan Behnel