PEP 393 vs UTF-8 Everywhere
Marko Rauhamaa
marko at pacujo.net
Sat Jan 21 14:52:42 EST 2017
Pete Forman <petef4+usenet at gmail.com>:
> Surrogates only exist in UTF-16. They are expressly forbidden in UTF-8
> and UTF-32.
Also, they don't exist as Unicode code points. Python shouldn't allow
surrogate characters in strings.
Thus the range of code points that are available for use as
characters is U+0000–U+D7FF and U+E000–U+10FFFF (1,112,064 code
points).
<URL: https://en.wikipedia.org/wiki/Unicode>
The Unicode Character Database is basically a table of characters
indexed using integers called ’code points’. Valid code points are in
the ranges 0 to #xD7FF inclusive or #xE000 to #x10FFFF inclusive,
which is about 1.1 million code points.
<URL: https://www.gnu.org/software/guile/docs/master/guile.html/Char
acters.html>
Guile does the right thing:
scheme@(guile-user)> #\xd7ff
$1 = #\153777
scheme@(guile-user)> #\xe000
$2 = #\160000
scheme@(guile-user)> #\xd812
While reading expression:
ERROR: In procedure scm_lreadr: #<unknown port>:5:8: out-of-range hex c
haracter escape: xd812
> py> low = '\uDC37'
That should raise a SyntaxError exception.
Marko
More information about the Python-list
mailing list