[Python-bugs-list] [ python-Bugs-765036 ] Unicode non-characters

SourceForge.net noreply@sourceforge.net
Wed, 02 Jul 2003 18:52:59 -0700


Bugs item #765036, was opened at 2003-07-03 01:52
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=765036&group_id=5470

Category: Unicode
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Gnosis Software (gnosis)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Unicode non-characters

Initial Comment:
The alleged codepoints unichr(0xFFFE) and
unichr(0xFFFF) are not unicode characters.  This document:

  http://www.unicode.org/charts/PDF/UFFF0.pdf

Contains:

  Noncharacters
  These codes are intended for process internal uses, but
  are not permitted for interchange.

  FFFE !<not a character>
  ¨ the value FFFE !is guaranteed not to be
    a Unicode character at all
  ¨ may be used to detect byte order by
    contrast with FEFF which is a character
    FEFF zero width no-break space

  FFFF !<not a character>
  ¨ the value FFFF !is guaranteed not to be
    a Unicode character at all

In particular, an XML document that contains such an
alleged unicode entity in not well-formed.

All unicode-aware versions of Python threat these
codepoints in the same manner as other codepoints, e.g.
both unichr(0xFFFE) and u'\uffff' pass without complaint.

I believe the correct behavior would be for Python to
raise an exception, or at least a warning, on access to
these spurious characters.



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=765036&group_id=5470