[New-bugs-announce] [issue24036] GB2312 codec is using a wrong covert table
Ma Lin
report at bugs.python.org
Thu Apr 23 11:29:41 CEST 2015
New submission from Ma Lin:
While I was trying to optimize GB2312/GBK/GB18030-2000 codecs (they are three encodings that widely used in China), I found there is a bug.
The three encodings, their relation should be: GB2312 ⊂ GBK ⊂ GB18030-2000.
However, in Python's implement: GB2312 ⊄ GBK ⊂ GB18030-2000.
GBK should be backward compatible with GB2312, but in Python's implement, it's not.
----
I digged into, I found the Python's GB2312 codec is using a wrong convert table.
In this file /Modules/cjkcodecs/_codecs_cn.c , there is a comment block, I paste it here:
/* GBK and GB2312 map differently in few code points that are listed below:
*
* gb2312 gbk
* A1A4 U+30FB KATAKANA MIDDLE DOT U+00B7 MIDDLE DOT
* A1AA U+2015 HORIZONTAL BAR U+2014 EM DASH
* A844 undefined U+2015 HORIZONTAL BAR
*/
In fact the second column (GB2312 column) is wrong, this column should be deleted.
The four involved unicode codepoints are:
U+30FB ・ KATAKANA MIDDLE DOT
U+00B7 · MIDDLE DOT
U+2015 ― HORIZONTAL BAR
U+2014 — EM DASH
So, GB2312 codec decodes b'0xA1, 0xA4' to U+30FB.
U+30FB is a Japanese symbol, but looks quite similar to U+00B7.
I searched "GB2312 Unicode Table" with Google, there are right verson and wrong version on the Internet, unfortunately we are using the wrong verson.
libiconv-1.14 is also using the wrong version.
----
Hold an example of bad behavior.
Using GBK encoder, encode U+30FB to bytes, UnicodeEncodeError exception occurred, becase U+30FB is not in GBK.
In Simplified Chinese version of Microsoft Windows, console's default encoding is GBK[1].
If using GB2312 decoder to decode b'0xA1, 0xA4', then print U+30FB to console, UnicodeEncodeError raised.
Since DASH is a common character, this bug is annoying.
----
If we fix this, I don't know how many disasters will happen.
However, if we don't fix this, it's a bug.
I already made a patch, but I think we need a discussion, should we fix this?
-----------------------
Annotate:
[1] In fact console's default encoding is cp936, cp936 almost same as GBK, but not entirely same. Using GBK in here is not a problem.
----------
components: Unicode
files: fixgb2312.patch
keywords: patch
messages: 241858
nosy: Ma Lin, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: GB2312 codec is using a wrong covert table
type: behavior
versions: Python 3.5, Python 3.6
Added file: http://bugs.python.org/file39182/fixgb2312.patch
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue24036>
_______________________________________
More information about the New-bugs-announce
mailing list