[Patches] [ python-Patches-670715 ] Universal Unicode Codec for POSIX iconv

Wed, 02 Apr 2003 21:04:51 -0800

Patches item #670715, was opened at 2003-01-19 17:51
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=670715&group_id=5470

Category: Library (Lib)
Group: Python 2.3
Status: Closed
Resolution: Accepted
Priority: 5
Submitted By: Hye-Shik Chang (perky)
Assigned to: Martin v. Löwis (loewis)
Summary: Universal Unicode Codec for POSIX iconv

Initial Comment:
Here's the unicode codec using POSIX iconv(3) library.

Tested on these platforms and seems to work:
  FreeBSD/i386, FreeBSD/alpha, FreeBSD/ia64,
  FreeBSD/sparc64, MacOS X/ppc, HP-UX/pa-risc2

This codec implementation supports PEP293, also.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2003-04-03 07:04

Message:
Logged In: YES 
user_id=21627

I have reverted this patch because of problems with

setup.py 1.159
__init__.py 1.17
iconv_codec.py delete
regrtest.py 1.137
test_iconv_codecs.py delete
NEWS 1.712
Setup.dist 1.39
_iconv_codec.c delete

----------------------------------------------------------------------

Comment By: Christos Georgiou (tzot)
Date: 2003-02-04 21:10

Message:
Logged In: YES 
user_id=539787

It was a misspelling; ISO8859-7 is what I used in the module, 
the underscore came from my using it with str.decode
("iso8859_7").

o2k 348# ls -l python
-rwxr-xr-x    1 root     sys       1976320 Feb  4 21:21 python
o2k 349# ./python
Python 2.3a1 (#12, Feb  4 2003, 21:14:08) [C] on irix646
Type "help", "copyright", "credits" or "license" for more 
information.
>>> "a".decode("ascii")
Fatal Python error: can't initialize the _iconv_codec module: 
iconv_open() failed
Abort (core dumped)

I then changed it to:
    iconv_t hdl = iconv_open("UCS-2", "LATIN1");
(sys.maxunicode is 65535)
and it still fails at the same line (I ran "dbx python core").  I 
made sure that I use the freshly built _iconv_codec.so (ie no 
other exists on the system).
I added some code to show the errno, and it's 22 (EINVAL).
I ran iconv -l at the prompt, and used UCS-2 and ISO8859-1 
(LATIN1 didn't show up in the list).  Now, the code runs fine 
for str.decode and unicode.encode, and test.test_codecs and 
test.test_263 pass fine.

(BTW, if you're used to vi keystrokes, never press ESC while 
typing with IE in the "Add a comment:" text box... :( )

I then applied the diff-debug patch, and got:
>>> "a".decode("ascii")
init_iconv_codec: 0x0030
u'a'

Please note that on Irix, UCS-2-INTERNAL is not available, 
only UCS-2, so that had to be changed too.

It's an initialisation thing; I could provide a special case of 
defines for Irix, but how can you know which codecs are 
available on a platform during the configure process?  
Perhaps running iconv -l and trying to decode the output? 
iconv -l on GNU/linux shows a comma separated list,  while 
on Irix it shows pairs of available conversions, one on each 
line...

Thanks for the direction about encodings/__init__.py, Walter.

Thanks guys, now I must find if pymalloc has changed and 
occasionally dumps core in dictresize or listextend_internal...

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2003-02-04 19:57

Message:
Logged In: YES 
user_id=89016

"ISO8859_7" is not known on Linux, can you try "ISO8859-7"
or better "ISO8859-1" or "LATIN1"?

Also I wonder whether it is a good thing to test iconv()
with the character '\x01'. Can you try diff-char.txt and see
what happens?

If all this fails, try diff-debug.txt and report what the
output is.

----------------------------------------------------------------------

Comment By: Christos Georgiou (tzot)
Date: 2003-02-04 18:02

Message:
Logged In: YES 
user_id=539787

I am afraid that SGI Irix' iconv must be added to the list of 
scary commercial implementations... at first the module did 
not compile due to two missing typecasts (see patch 
680146), but even after that, the module fails (and python 
dumps core):

Fatal Python error: can't initialize the _iconv_codec module: 
iconv_open() failed
Abort (core dumped)

(message at line 674 of the module)

This is because Irix iconv knows nothing about ASCII 
encoding...

I changed the "ASCII" argument to something 
existing, "ISO8859_7" which is my country's encoding, and 
then python dumps core with:

Fatal Python error: can't initialize the _iconv_codec module: 
mixed endianess
Abort (core dumped)

MIPS processors are big endian.

python works fine in all my programs where there is no use of 
str.decode and unicode.encode .
To make sure that the problem exists only in this module, I 
need to compile without the _iconv_codec .  Do I do that by 
changing setup.py?  This seems the way, but I haven't 
succeeded yet.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2003-01-30 21:04

Message:
Logged In: YES 
user_id=89016

I checked in a version of iconvcodec-3.txt that does a byte
swapping check in the init function as
Modules/_iconv_codec.c 1.5

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2003-01-30 19:26

Message:
Logged In: YES 
user_id=89016

iconvcodec-3.txt does byteswapping under the following
conditions:
    #ifndef WORDS_BIGENDIAN
    #ifdef __GNU_LIBRARY__
Byteswapping is done before encoding to the whole input and
to every piece returned from iconv() for decoding.

Detecting whether byteswapping is neccessary might not work
reliably with the above tests. If this is the case, a test
call to iconv() should probably be done in
Modules/_conv_codec.c::init_iconv_codec() to determine
whether to byte swap or not.

Another possibility might be to use utf-16/utf-32 instead of
ucs-2/ucs-4.

One test still fails: test_sane(), because it uses the
internal encoding in Python, where the real endianness is of
course unknown. The test also assumes a narrow Python build.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-26 16:19

Message:
Logged In: YES 
user_id=55188

Thank you very much for your works.
I'm working on UCS endian detection and UCS transparent 
wrapper for UTF-{8,16}. I'll submit new patch in a week.
Please feel free to change my codes because I'm not familiar 
with python code convention and culture. :-)

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-26 12:32

Message:
Logged In: YES 
user_id=21627

I have committed it now with minimal changes so that it
works on Linux, as

setup.py 1.138
__init__.py 1.15
iconv_codec.py 1.1
regrtest.py 1.117
NEWS 1.627

I will make further changes; please watch the CVS. If you
would like to make further changes, please submit patches
against the CVS.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-22 02:07

Message:
Logged In: YES 
user_id=21627

Hmm. I see that Solaris does support conversion of CJK to
UTF-8. So even though we cannot convert into the internal
encoding, we could still convert through UTF-8.

Looking at /usr/lib/nls/iconv/config.iconv of HP-UX 11.11, I
see conversions from and to ucs2, for iso-8859, eucJP, sjis,
eucTW, big5, roc15, kore5, hp15CN, and many IBM code pages.

So I think the iconv codec should convert into the Python
internal representation if possible. If no encoding name for
that is known, it should convert to ucs2 (be) if possible,
or else to UTF-8; in all cases, it will then construct a
Unicode object from the resulting string.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-22 01:21

Message:
Logged In: YES 
user_id=55188

iconv implementations on commercial UNIXes are very scary. 

Solaris implementation:
  no support for CJK <-> UCS conversion.
  They support UCS[24] only for iso-8859 and UTFs.

HP-UX implementation:
  They have useless iconv. HP-UX iconv has no unicode support.

BSD implementation (Konstantin Chuguev's):
  compatible with this patch (provides ucs-[24]-internal)

GLIBC implementation:
  provides ucs-[24] and they are same with GNU iconv's
ucs-[24]-internal.
  Because ucs-[24] of GNU/BSD implentation is big endian
always. We can't use ucs-[24] for every platform.

In conclusion, we must use 3rd party iconv on Solaris or HP-UX.
And, we need to detect whether the linked iconv is GNU/BSD
iconv or GLIBC iconv. (how?)

I'll investigate how to detect them, but ... :)

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-21 23:27

Message:
Logged In: YES 
user_id=21627

I'm quite happy with this patch, and will apply it shortly.
However, I am concerned that it is specific for GNU iconv.
IMO, there should be machinery to find out the "internal"
encoding, in case native the native iconv implementation is
used instead of GNU iconv.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-20 14:33

Message:
Logged In: YES 
user_id=55188

Thank you for comments. :->
I uploaded a new revised patch with unittest and some code
style fixes.

I saw Martin v. Loewis's iconvcodecs about a years ago.
His implementation is very neat, but it had a limit on error
handling due to recursive call.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-01-19 23:56

Message:
Logged In: YES 
user_id=38388

The patch looks good, but you'll need to add some form
of testing to underline the "seems to work" :-)

Some docs on how to use the codec would also be needed.

Martin von Loewis has written a similar codec some months ago.
Perhaps you two could get in touch and sort out the details ?!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=670715&group_id=5470