[Python-bugs-list] [ python-Bugs-433882 ] UTF-8: unpaired surrogates mishandled

noreply@sourceforge.net noreply@sourceforge.net
Wed, 06 Feb 2002 10:11:05 -0800


Bugs item #433882, was opened at 2001-06-17 04:27
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=433882&group_id=5470

Category: Unicode
Group: None
Status: Open
Resolution: None
>Priority: 3
Submitted By: Nobody/Anonymous (nobody)
Assigned to: M.-A. Lemburg (lemburg)
Summary: UTF-8: unpaired surrogates mishandled

Initial Comment:
Two bugs:

1. UTF-8 encoding of unpaired high surrogate produces 
an invalid UTF-8 byte sequence.

2. UTF-8 decoding of any unpaired surrogate produces
an exception ("illegal encoding") instead of the 
corresponding 16-bit scalar value.

See attached file utf8bugs.py for example plus detailed
remarks.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-06 10:11

Message:
Logged In: YES 
user_id=38388

I've checked in a patch which fixes bug 1 in the report.

I am unsure about "bug 2": I think that raising an exception is better than silently accepting bogus input data.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2001-08-16 03:50

Message:
Logged In: YES 
user_id=38388

I'll look into this after I'm back from vacation on the 10.09.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-06-17 19:03

Message:
Logged In: YES 
user_id=21627

I think the codec should reject unpaired surrogates both 
when encoding and when decoding. I don't have a copy of 
ISO 10646, but Unicode 3.1 points out

# ISO/IEC 10646 does not allow mapping of unpaired 
surrogates, nor U+FFFE and U+FFFF (but it does allow other 
noncharacters).

So apparently, encoding unpaired surrogates as UTF-8 is 
not allowed according to ISO 10646. I think Python should 
follow this rule, instead of the Unicode one.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=433882&group_id=5470