[Python-checkins] bpo-31170: Write unit test for Expat 2.2.4 UTF-8 bug (#3570)

Victor Stinner webhook-mailer at python.org
Mon Sep 25 04:27:37 EDT 2017


https://github.com/python/cpython/commit/e6d9fcbb8d0c325e57df08ae8781aafedb71eca2
commit: e6d9fcbb8d0c325e57df08ae8781aafedb71eca2
branch: master
author: Victor Stinner <victor.stinner at gmail.com>
committer: GitHub <noreply at github.com>
date: 2017-09-25T01:27:34-07:00
summary:

bpo-31170: Write unit test for Expat 2.2.4 UTF-8 bug (#3570)

Non-regression tests for the Expat 2.2.3 UTF-8 decoder bug.

files:
A Lib/test/xmltestdata/expat224_utf8_bug.xml
M Lib/test/test_xml_etree.py

diff --git a/Lib/test/test_xml_etree.py b/Lib/test/test_xml_etree.py
index baa4e1f5342..661ad8b9d4d 100644
--- a/Lib/test/test_xml_etree.py
+++ b/Lib/test/test_xml_etree.py
@@ -34,6 +34,7 @@
 except UnicodeEncodeError:
     raise unittest.SkipTest("filename is not encodable to utf8")
 SIMPLE_NS_XMLFILE = findfile("simple-ns.xml", subdir="xmltestdata")
+UTF8_BUG_XMLFILE = findfile("expat224_utf8_bug.xml", subdir="xmltestdata")
 
 SAMPLE_XML = """\
 <body>
@@ -1739,6 +1740,37 @@ def __eq__(self, other):
         self.assertIsInstance(e[0].tag, str)
         self.assertEqual(e[0].tag, 'changed')
 
+    def check_expat224_utf8_bug(self, text):
+        xml = b'<a b="%s"/>' % text
+        root = ET.XML(xml)
+        self.assertEqual(root.get('b'), text.decode('utf-8'))
+
+    def test_expat224_utf8_bug(self):
+        # bpo-31170: Expat 2.2.3 had a bug in its UTF-8 decoder.
+        # Check that Expat 2.2.4 fixed the bug.
+        #
+        # Test buffer bounds at odd and even positions.
+
+        text = b'\xc3\xa0' * 1024
+        self.check_expat224_utf8_bug(text)
+
+        text = b'x' + b'\xc3\xa0' * 1024
+        self.check_expat224_utf8_bug(text)
+
+    def test_expat224_utf8_bug_file(self):
+        with open(UTF8_BUG_XMLFILE, 'rb') as fp:
+            raw = fp.read()
+        root = ET.fromstring(raw)
+        xmlattr = root.get('b')
+
+        # "Parse" manually the XML file to extract the value of the 'b'
+        # attribute of the <a b='xxx' /> XML element
+        text = raw.decode('utf-8').strip()
+        text = text.replace('\r\n', ' ')
+        text = text[6:-4]
+        self.assertEqual(root.get('b'), text)
+
+
 
 # --------------------------------------------------------------------
 
diff --git a/Lib/test/xmltestdata/expat224_utf8_bug.xml b/Lib/test/xmltestdata/expat224_utf8_bug.xml
new file mode 100644
index 00000000000..d66a8e6b50f
--- /dev/null
+++ b/Lib/test/xmltestdata/expat224_utf8_bug.xml
@@ -0,0 +1,2 @@
+<a b='01234567890123456古人咏雪抽幽思骋妍辞竞险韵偶得一编奇绝辄擅美当时流声后代是以北门之风南山之雅梁园之简黄台之赋至今为作家称述尚矣及至洛阳之卧剡溪之兴灞桥之思亦皆传为故事钱塘沈履德先生隐居西湖两峰间孤高贞洁与雪同调方大雪满天皴肤粟背之际先生乃鹿中豹舄端居闭门或扶童曳杖踏遍六桥三竺时取古人诗讽咏之合唐宋元诸名家集句成诗得二百四十章联络通穿如出一人如呵一气气立于言表格备于篇中略无掇拾补凑之形非胸次包罗壮阔笔底驱走鲍谢欧苏诸公不能为此世称王荆公为集句擅长观其在钟山对雪仅题数篇未见有此噫嘻奇矣哉亦富矣哉予慕先生有袁安之节愧不能为慧可之立乃取新集命工传写使海内同好者知先生为博古传述之士而一新世人之耳目他日必有慕潜德阐幽光而剞劂以传者余实为之执殳矣
+弘治戊午仲冬望日慈溪杨子器衵于海虞官舍序毕诗部' />



More information about the Python-checkins mailing list