[New-bugs-announce] [issue45105] Incorrect handling of unicode character \U00010900

Sun Sep 5 07:12:09 EDT 2021

New submission from Max Bachmann <kontakt at maxbachmann.de>:

I noticed that when using the Unicode character \U00010900 when inserting the character as character:
Here is the result on the Python console both for 3.6 and 3.9:
```
>>> s = '0𐤀00'
>>> s
'0𐤀00'
>>> ls = list(s)
>>> ls
['0', '𐤀', '0', '0']
>>> s[0]
'0'
>>> s[1]
'𐤀'
>>> s[2]
'0'
>>> s[3]
'0'
>>> ls[0]
'0'
>>> ls[1]
'𐤀'
>>> ls[2]
'0'
>>> ls[3]
'0'
```

It appears that for some reason in this specific case the character is actually stored in a different position that shown when printing the complete string. Note that the string is already behaving strange when marking it in the console. When marking the special character it directly highlights the last 3 characters (probably because it already thinks this character is in the second position).

The same behavior does not occur when directly using the unicode point
```
>>> s='000\U00010900'
>>> s
'000𐤀'
>>> s[0]
'0'
>>> s[1]
'0'
>>> s[2]
'0'
>>> s[3]
'𐤀'
```

This was tested using the following Python versions:
```
Python 3.6.0 (default, Dec 29 2020, 02:18:14) 
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)] on linux

Python 3.9.6 (default, Jul 16 2021, 00:00:00) 
[GCC 11.1.1 20210531 (Red Hat 11.1.1-3)] on linux
```
on Fedora 34

----------
components: Unicode
messages: 401078
nosy: ezio.melotti, maxbachmann, vstinner
priority: normal
severity: normal
status: open
title: Incorrect handling of unicode character \U00010900
type: behavior
versions: Python 3.6, Python 3.9

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue45105>
_______________________________________