[ python-Bugs-969415 ] CJK codecs list incomplete

Sat Jul 17 16:49:32 CEST 2004

Bugs item #969415, was opened at 2004-06-09 15:54
Message generated for change (Comment added) made by perky
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=969415&group_id=5470

Category: Documentation
Group: Python 2.4
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Mike Brown (mike_j_brown)
Assigned to: Nobody/Anonymous (nobody)
Summary: CJK codecs list incomplete

Initial Comment:
http://www.python.org/dev/doc/devel/whatsnew/node7.
html states that various CJK encodings have been 
added, but the list given there does not match the list 
on 
http://www.python.org/dev/doc/devel/lib/node128.html.

In particular, missing from the latter list are all of the 
aliases with hyphens:

shift-jis, shift-jisx0213, euc-jp, euc-jisx0213, iso-2022-
jp, iso-2022-jp-1, iso-2022-jp-2, iso-2022-jp-3, iso-
2022-jp-ext, euc-kr, iso-2022-kr

Since I successfully ran codecs.lookup() tests on a few 
of the hyphenated aliases, I assume that the omission 
of the hyphenated versions in the docs is merely an 
oversight.

----------------------------------------------------------------------

>Comment By: Hye-Shik Chang (perky)
Date: 2004-07-17 23:49

Message:
Logged In: YES 
user_id=55188

I changed aliases with _ which are popular as with hyphens than 
underscores in consistency of iso-8859 aliases.

Doc/lib/libcodecs.tex 1.31

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2004-06-15 05:28

Message:
Logged In: YES 
user_id=21627

Assigning to somebody else without asking for permission is
impolite, IMO; unassigning the report from anybody.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-06-12 21:04

Message:
Logged In: YES 
user_id=38388

I think that it might be a good idea to document of how the
standard search
function of the encodings package work at the top of that
page, namely
to normalize encoding names before doing the lookup:

"""
        Normalization works as follows: all non-alphanumeric
        characters except the dot used for Python package
names are
        collapsed and replaced with a single underscore,
e.g. '  -;#'
        becomes '_'. Leading and trailing underscores are
removed.

        Note that encoding names should be ASCII only; if
they do use
        non-ASCII characters, these must be Latin-1 compatible.
"""

The table should then only list normalized encoding names (which
I think is already the case).

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2004-06-12 20:59

Message:
Logged In: YES 
user_id=21627

Actually, the top of the page does already say

Notice that spelling alternatives that only differ in case
or use a hyphen instead of an underscore are also valid aliases.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2004-06-12 20:54

Message:
Logged In: YES 
user_id=21627

It is just not feasible to list all recognized aliases. For
example, for
ISO-8859-1, there are trivial 31 aliases, including
Iso_8859-1 and iSO-8859_1. For shift_jisx0213, there are
1023 trivial aliases.

The aliases column in the documentation should only list
non-trivial aliases, and for these, it should list a form
that people are most likely to encounter. So if "s-jis"
would be more common than "s_jis", this is what should be
listed. If s-JIS is even more common, this should be listed.

The top of the page should say that case in encoding names
does not matter, and that _ and - can be freely substituted.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-06-12 16:05

Message:
Logged In: YES 
user_id=80475

Mark, would you pronounce on this one.

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-06-09 17:25

Message:
Logged In: YES 
user_id=371366

I see no reason to omit any aliases that are recognized, 
especially when the aliases in question are, more often than 
not, the IANA's preferred MIME name as shown at 
http://www.iana.org/assignments/character-sets.

I was looking in the docs to see if Python 2.4 was going to 
support 'euc-jp', and was dismayed to see 'euc_jp' and 
variants but no 'euc-jp'. I had to obtain and install 2.4a0 to 
test to find out that it was just a documentation problem.

Please consider listing all realnames and aliases.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2004-06-09 16:10

Message:
Logged In: YES 
user_id=55188

Reopened to consider the consistence with non-cjk codecs.
All the non-cjk codecs are written with hyphen even if their
realname is with underscore. (eg. iso8859-1 and iso8859_1.py)
Will changing cjk codecs's codec/alias names to use not
underscores but hyphens make docs more friendly?

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2004-06-09 16:01

Message:
Logged In: YES 
user_id=55188

All hyphens are translated as underscores in encoding lookups.
So we may not need to provide encoding list with hyphens
additionally.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=969415&group_id=5470