[ python-Feature Requests-1001895 ] Adding missing ISO 8859 codecs, especially Thai

Thu Aug 5 15:02:18 CEST 2004

Feature Requests item #1001895, was opened at 2004-08-02 11:48
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1001895&group_id=5470

>Category: None
Group: None
>Status: Open
>Resolution: None
Priority: 5
Submitted By: Peter Jacobi (peter_jacobi)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Adding missing ISO 8859 codecs, especially Thai

Initial Comment:
As the missing ISO 8859 codecs, (11:Thai, 16:Romanian) 
can be automatically generated from the Unicode 
mapping files (via gencodec.py), I'd like to ask for 
inclusion in the next version.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2004-08-05 15:02

Message:
Logged In: YES 
user_id=21627

Code page 874 differs from the 8859 one in the definition of
\x80..\x9f. 

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP874.TXT

says

0x80	0x20AC	#EURO SIGN
0x85	0x2026	#HORIZONTAL ELLIPSIS
0x91	0x2018	#LEFT SINGLE QUOTATION MARK
0x92	0x2019	#RIGHT SINGLE QUOTATION MARK
0x93	0x201C	#LEFT DOUBLE QUOTATION MARK
0x94	0x201D	#RIGHT DOUBLE QUOTATION MARK
0x95	0x2022	#BULLET
0x96	0x2013	#EN DASH
0x97	0x2014	#EM DASH

I assume the Thai version of Windows is likely to generate
"windows-874". Debian offers the th_TH locale, with TIS-620,
and a th_TH.UTF-8 locale (i.e. no ISO-8859-1 one).

If ISO 8859-11 is understood as published by ISO (i.e. no
control characters at all), then CP 874 is a strict
extension (adding C0, plus the characters above).

Google gives these frequencies:
tis-620 16,200
windows-874 7,290
iso-8859-11  5,880

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-08-05 14:44

Message:
Logged In: YES 
user_id=38388

Checking in Misc/NEWS;
/cvsroot/python/python/dist/src/Misc/NEWS,v  <--  NEWS
new revision: 1.1073; previous revision: 1.1072
done
Checking in Lib/encodings/aliases.py;
/cvsroot/python/python/dist/src/Lib/encodings/aliases.py,v 
<--  aliases.py
new revision: 1.27; previous revision: 1.26
done
RCS file:
/cvsroot/python/python/dist/src/Lib/encodings/iso8859_11.py,v
done
Checking in Lib/encodings/iso8859_11.py;
/cvsroot/python/python/dist/src/Lib/encodings/iso8859_11.py,v
 <--  iso8859_11.py
initial revision: 1.1
done
RCS file:
/cvsroot/python/python/dist/src/Lib/encodings/iso8859_16.py,v
done
Checking in Lib/encodings/iso8859_16.py;
/cvsroot/python/python/dist/src/Lib/encodings/iso8859_16.py,v
 <--  iso8859_16.py
initial revision: 1.1
done
RCS file:
/cvsroot/python/python/dist/src/Lib/encodings/tis_620.py,v
done
Checking in Lib/encodings/tis_620.py;
/cvsroot/python/python/dist/src/Lib/encodings/tis_620.py,v 
<--  tis_620.py
initial revision: 1.1
done

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-08-05 14:33

Message:
Logged In: YES 
user_id=38388

Nevermind. I'll also add a proper tis_620.py codec.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-08-05 14:15

Message:
Logged In: YES 
user_id=38388

I found these references for iso-8859-11:

iso_8859-11:1992 (try searching for this in goole :-)
http://mnogosearch.kn.vutbr.cz/Download/snapshot/mnogosearch32/src/uconv-alias.c

windows-874
http://www.memecode.com/site/ver.php?id=94

thai
windows-874
tis-620
iso-8859-11:2001
http://de.wikipedia.org/wiki/ISO_8859-11

The lsat URL suggests that iso-8859-11 is the same as tis-620,
but only the "basis" for windows-874. It also quotes the
year 2001 as the last revision of the mapping which
corresponds to the header of the Unicode mapping file.

I think it's safe to add the alias for tis-620 even though
the iso mapping has one more character. According to Google
that encoding name is much more popular than the iso one.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2004-08-05 13:41

Message:
Logged In: YES 
user_id=21627

The unfortunate problem is that ISO-8859-11 is not a
IANA-registered character set. For ISO-8859-16,

http://www.iana.org/assignments/character-sets

lists:

Name: ISO-8859-16
MIBenum: 112
Source: ISO
Alias: iso-ir-226
Alias: ISO_8859-16:2001
Alias: ISO_8859-16
Alias: latin10
Alias: l10 

I believe ISO-8859-11 does not have any aliases. Some people
may claim TIS-620 is an alias, but it is not (as it does not
contain \xa0).

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-08-05 13:15

Message:
Logged In: YES 
user_id=38388

Thank you. 

Please also provide suitable aliases (I couldn't find any on
the IANA site), then I'll add them to Python 2.4.

----------------------------------------------------------------------

Comment By: Peter Jacobi (peter_jacobi)
Date: 2004-08-04 00:58

Message:
Logged In: YES 
user_id=845149

Attached are the output if gencodec.py for ISO-8859-11, 
ISO-8859-16 and for reference also the original mapping files.

Peter

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-08-03 16:34

Message:
Logged In: YES 
user_id=38388

Peter, could you attach the generated codecs to this report ?

Thanks.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-08-02 13:14

Message:
Logged In: YES 
user_id=38388

Martin, I think it's a good idea to add the codecs for
completeness. 

We should probably also review the mapping files posted on
the unicode.org site every now and then and update the
codecs in Python accordingly. Sticking to the Unicode
Consortium's view of things is a good way to assure
compatibility with other applications, IMO.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2004-08-02 12:30

Message:
Logged In: YES 
user_id=21627

Marc-Andre, should we add these?

----------------------------------------------------------------------

Comment By: Peter Jacobi (peter_jacobi)
Date: 2004-08-02 12:16

Message:
Logged In: YES 
user_id=845149

In a thread on news://comp.lang.python I was asked by 
Martin v. Löwis to provide evidence on the correctness of the 
ISO 8859-11 Unicode mapping file, as found on 
ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-11.TXT 
(due to the disclaimer boilerplate in these files).

So far I can provide these three points:
a) ISO 8859-n vs ISO-8859-n
If the information at http://en.wikipedia.org/wiki/ISO_8859-
1#ISO_8859-1_vs_ISO-8859-1 is correct, Python 8859-n 
codecs do implement the ISO standard charsets ISO 8859-n 
in the specialized IANA forms ISO-8859-n (and in agreement 
with the Unicode mapping files). So any difficult C0/C1 
wording in the original ISO standard can be disregarded.

b) libiconv ISO 8859-11
The implementation by Bruno Haible in libiconv does agree 
with the Unicode mapping file:
http://cvs.sourceforge.net/viewcvs.py/libiconv/libiconv/lib/

c) IBM ICU4C
The implementation in ICU4C does agree with the Unicode 
mapping file:
http://oss.software.ibm.com/cvs/icu/charset/data/ucm/

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1001895&group_id=5470