[Python-bugs-list] [Bug #116285] Unicode encoders don't report errors properly

Wed, 03 Jan 2001 13:27:08 -0800

Bug #116285, was updated on 2000-Oct-06 17:32
Here is a current snapshot of the bug.

Project: Python
Category: Unicode
Status: Closed
Resolution: Remind
Bug Group: None
Priority: 3
Submitted by: loewis
Assigned to : lemburg
Summary: Unicode encoders don't report errors properly

Details: In current CVS, u"\366".encode("koi8-r") gives '\366'. This is
incorrect -
koi8-r does not support LATIN SMALL LETTER O WITH DIAERESIS, so it should
raise a UnicodeError instead.

Follow-Ups:

Date: 2001-Jan-03 13:27
By: lemburg

Comment:
Patch checked in.
-------------------------------------------------------

Date: 2000-Dec-22 06:06
By: loewis

Comment:
A fix for that bug is in
http://sourceforge.net/patch/?func=detailpatch&patch_id=103002&group_id=5470

Set group back to None since we are in the 2.1 cycle now.
-------------------------------------------------------

Date: 2000-Oct-12 13:26
By: lemburg

Comment:
Reopened so that the bug doesn't get forgotten in 2.1. Instead
of closing the bug, I will set the priority to 3 which should signal
"not vital for the Python 2.0 release".
-------------------------------------------------------

Date: 2000-Oct-12 11:24
By: lemburg

Comment:
Closed for 2.0. This request should be reopened for the 2.1 cycle.

As Martin pointed out in private mail, the situation with correct
error handling is not all that bad: the encoders default to latin-1
mappings (ie. 1-1) when converting Unicode to the encoding 
in case no mapping is given for the character.

The fix would be to add explicit encoding mappings for all supplied
standard codecs which map all Latin-1 characters which do not
have a corresponding character in the encoding to None. This will
then cause the codec to raise an error saying that the mapping is
undefined.
-------------------------------------------------------

Date: 2000-Oct-09 01:14
By: lemburg

Comment:
Note that this is due to the way the character mapping codec
works: if the dictionary doesn't include a mapping for a certain
character it simply copies that character without raising an
error.

All standard codecs in Python 2.0 which use the generic character
codec only contain explicit mappings from the encoding to Unicode
(for the decoding part). When encoding from Unicode to the encoding,
the decoding map is simply reversed.

To produce correct error
output in all possible cases, the reverse mapping would have to
include all Unicode characters which cannot be mapped to a encoding
character (and map these to None). This is not feasable, so the
"bug" is hard to fix... certainly not for Python 2.0.

I'm setting the bug report to "Feature Request" meaning that it should be
reopened for the 2.1 cycle.
-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=116285&group_id=5470