[Python-ideas] Processing surrogates in

M.-A. Lemburg mal at egenix.com
Fri May 8 15:00:02 CEST 2015


On 08.05.2015 14:41, Chris Angelico wrote:
> On Fri, May 8, 2015 at 10:32 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:
>> On 08.05.15 15:28, Chris Angelico wrote:
>>>
>>> On Fri, May 8, 2015 at 9:18 PM, Serhiy Storchaka <storchaka at gmail.com>
>>> wrote:
>>>>>
>>>>> Can you give a simple example of a Python 2 program that provides output
>>>>> that Python 3 will read as surrogates?
>>>>
>>>>
>>>>
>>>> f.write(u'𝄞'[:1].encode('utf-8'))
>>>> json.dump(f, u'𝄞'[:1])
>>>> pickle.dump(f, u'𝄞'[:1])
>>>
>>>
>>> Not for me. In my Python 2, u'𝄞'[:1] == u'𝄞'. I suppose you're
>>> talking only about the (buggy) narrow builds, in which case you don't
>>> need to use string slicing at all. But in that case, all you're doing
>>> is using a single "\uNNNN" escape code to create an unmatched
>>> surrogate.
>>
>>
>> I want to say that that it is easy to unintentionally get a data with
>> encoded lone surrogate in Python 2.
> 
> Only on Windows, where the standard builds are narrow ones. (Also, how
> hard and how bad would it be to change that, and have all python.org
> installers produce wide builds?)

Not only on Windows. The default Python 2 build is a narrow build.

Most Unix distributions explicitly switch on the UCS4 support,
so you usually get UCS4 versions on Unix, but the default still
is UCS2.

In Python 3.3+ this doesn't matter anymore, since Python selects
the storage type based on the string content, so you get
UCS2/UCS4 as needed on all platforms.

All that said, it's still possible to work with lone surrogates
in Python, so Serhiy's example still applies in concept.

And slicing surrogates is only one way to break Unicode strings.
The many combining characters and annotations offer plenty
more :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 08 2015)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> mxODBC Plone/Zope Database Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-ideas mailing list