PEP 393 vs UTF-8 Everywhere

Marko Rauhamaa marko at pacujo.net
Sat Jan 21 14:52:42 EST 2017


Pete Forman <petef4+usenet at gmail.com>:

> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32.

Also, they don't exist as Unicode code points. Python shouldn't allow
surrogate characters in strings.

   Thus the range of code points that are available for use as
   characters is U+0000–U+D7FF and U+E000–U+10FFFF (1,112,064 code
   points).

   <URL: https://en.wikipedia.org/wiki/Unicode>


   The Unicode Character Database is basically a table of characters
   indexed using integers called ’code points’. Valid code points are in
   the ranges 0 to #xD7FF inclusive or #xE000 to #x10FFFF inclusive,
   which is about 1.1 million code points.

   <URL: https://www.gnu.org/software/guile/docs/master/guile.html/Char
   acters.html>

Guile does the right thing:

   scheme@(guile-user)> #\xd7ff
   $1 = #\153777
   scheme@(guile-user)> #\xe000
   $2 = #\160000
   scheme@(guile-user)> #\xd812
   While reading expression:
   ERROR: In procedure scm_lreadr: #<unknown port>:5:8: out-of-range hex c
   haracter escape: xd812

> py> low = '\uDC37'

That should raise a SyntaxError exception.


Marko



More information about the Python-list mailing list