[Python-bugs-list] [ python-Bugs-433882 ] UTF-8: unpaired surrogates mishandled
noreply@sourceforge.net
noreply@sourceforge.net
Sun, 17 Jun 2001 19:03:24 -0700
Bugs item #433882, was updated on 2001-06-17 04:27
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=433882&group_id=5470
Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Nobody/Anonymous (nobody)
Summary: UTF-8: unpaired surrogates mishandled
Initial Comment:
Two bugs:
1. UTF-8 encoding of unpaired high surrogate produces
an invalid UTF-8 byte sequence.
2. UTF-8 decoding of any unpaired surrogate produces
an exception ("illegal encoding") instead of the
corresponding 16-bit scalar value.
See attached file utf8bugs.py for example plus detailed
remarks.
----------------------------------------------------------------------
>Comment By: Martin v. Löwis (loewis)
Date: 2001-06-17 19:03
Message:
Logged In: YES
user_id=21627
I think the codec should reject unpaired surrogates both
when encoding and when decoding. I don't have a copy of
ISO 10646, but Unicode 3.1 points out
# ISO/IEC 10646 does not allow mapping of unpaired
surrogates, nor U+FFFE and U+FFFF (but it does allow other
noncharacters).
So apparently, encoding unpaired surrogates as UTF-8 is
not allowed according to ISO 10646. I think Python should
follow this rule, instead of the Unicode one.
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=433882&group_id=5470