[Python-bugs-list] [ python-Bugs-433882 ] UTF-8: unpaired surrogates mishandled

Sun, 17 Jun 2001 19:03:24 -0700

Bugs item #433882, was updated on 2001-06-17 04:27
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=433882&group_id=5470

Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Nobody/Anonymous (nobody)
Summary: UTF-8: unpaired surrogates mishandled

Initial Comment:
Two bugs:

1. UTF-8 encoding of unpaired high surrogate produces 
an invalid UTF-8 byte sequence.

2. UTF-8 decoding of any unpaired surrogate produces
an exception ("illegal encoding") instead of the 
corresponding 16-bit scalar value.

See attached file utf8bugs.py for example plus detailed
remarks.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2001-06-17 19:03

Message:
Logged In: YES 
user_id=21627

I think the codec should reject unpaired surrogates both 
when encoding and when decoding. I don't have a copy of 
ISO 10646, but Unicode 3.1 points out

# ISO/IEC 10646 does not allow mapping of unpaired 
surrogates, nor U+FFFE and U+FFFF (but it does allow other 
noncharacters).

So apparently, encoding unpaired surrogates as UTF-8 is 
not allowed according to ISO 10646. I think Python should 
follow this rule, instead of the Unicode one.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=433882&group_id=5470