Re: When is a number not a number
Hi, Does anyone think this needs to be posted to the bug tracker?
lxml seems to identify superscripts as an integer but then throws an exception.
Thanks
Alex
from lxml import objectify xml = """ <types> <mysuperscript>²²²²²²²²²²</mysuperscript> </types> """ doc = objectify.fromstring(xml) print(objectify.dump(doc))
Traceback (most recent call last): File “**********.py", line 11, in <module> print(objectify.dump(doc)) ^^^^^^^^^^^^^^^^^^^ File "src/lxml/objectify.pyx", line 1521, in lxml.objectify.dump File "src/lxml/objectify.pyx", line 1549, in lxml.objectify._dump File "src/lxml/objectify.pyx", line 1526, in lxml.objectify._dump File "src/lxml/objectify.pyx", line 646, in lxml.objectify.NumberElement.__repr__ File "src/lxml/objectify.pyx", line 946, in lxml.objectify._parseNumber ValueError: invalid literal for int() with base 10: '²²²²²²²²²²'
Looks like a bug to me. For reasons I don't yet understand, the int type check in objectify's type guesser (see https://lxml.de/objectify.html#how-data-types-are-matched) does not fail for this input:
objectify.getRegisteredTypes() [PyType(int, IntElement), PyType(float, FloatElement), PyType(bool, BoolElement), PyType(long, IntElement), PyType(str, StringElement), PyType(NoneType, NoneElement), PyType(none, NoneElement)] objectify.getRegisteredTypes()[0] PyType(int, IntElement) print(objectify.getRegisteredTypes()[0].type_check("222")) None print(objectify.getRegisteredTypes()[0].type_check("²²²²²²²²²²")) # Should raise! None print(objectify.getRegisteredTypes()[0].type_check("abcd")) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "stringsource", line 67, in cfunc.to_py.__Pyx_CFunc_object____object___to_py.wrap File "src/lxml/objectify.pyx", line 1054, in lxml.objectify._checkInt File "src/lxml/objectify.pyx", line 1047, in lxml.objectify._checkNumber ValueError
However:
int("²²²²²²²²²²") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '²²²²²²²²²²'
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s... Best regards, Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart HRA 4356, HRA 104 440 Amtsgericht Mannheim HRA 40687 Amtsgericht Mainz Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre personenbezogenen Daten. Informationen finden Sie unter https://www.lbbw.de/datenschutz.
Am March 1, 2023 3:15:22 PM UTC schrieb Holger.Joukl@LBBW.de:
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s...
Ah, yes, it might be the isdigit() check, actually. That could be too broad. Not every digit is a valid part of a number. Thanks for the report and the investigation. I'll try a fix when I get to it. Stefan
Stefan Behnel schrieb am 02.03.23 um 08:50:
Am March 1, 2023 3:15:22 PM UTC schrieb Holger.Joukl@LBBW.de:
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s...
Ah, yes, it might be the isdigit() check, actually. That could be too broad. Not every digit is a valid part of a number.
Thanks for the report and the investigation. I'll try a fix when I get to it.
According to the XML Schema 1.1 spec, it's really just [0-9] that we should detect. https://www.w3.org/TR/xmlschema11-2/#decimal I'll remove the ".isdigit()" check all together and only leave the '0-9' comparison in there. Even when we're parsing Unicode strings, we should only care about XML numbers, not everything that Python accepts. Stefan
Stefan Behnel schrieb am 03.03.23 um 09:00:
Stefan Behnel schrieb am 02.03.23 um 08:50:
Am March 1, 2023 3:15:22 PM UTC schrieb Holger.Joukl@LBBW.de:
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s...
Ah, yes, it might be the isdigit() check, actually. That could be too broad. Not every digit is a valid part of a number.
Thanks for the report and the investigation. I'll try a fix when I get to it.
According to the XML Schema 1.1 spec, it's really just [0-9] that we should detect.
https://www.w3.org/TR/xmlschema11-2/#decimal
I'll remove the ".isdigit()" check all together and only leave the '0-9' comparison in there. Even when we're parsing Unicode strings, we should only care about XML numbers, not everything that Python accepts.
https://github.com/lxml/lxml/commit/3d4e60f2835e4d85fd357c182656d3eca534f2ff Stefan
On Wed, Mar 01, 2023 at 03:15:22PM +0000, Holger.Joukl@LBBW.de wrote:
ValueError: invalid literal for int() with base 10: '²²²²²²²²²²'
Probably a bug in _checkNumber(): https://github.com/lxml/lxml/blob/d01872ccdf7e1e5e825b6c6292b43e7d27ae5fc4/s...
str.isdigit() accepts many Unicode characters classified as digits that int() rejects. Marius Gedminas -- Please note that I only check linux-utf8 on Tuesdays when they happen on a tenth of December, so please CC me with any replies. -- Juliusz Chroboczek
participants (3)
-
Holger.Joukl@LBBW.de
-
Marius Gedminas
-
Stefan Behnel