Mailman 3 Draft PEP: Remove wstr from Unicode - Python-Dev

18 Jun 2020

      PEP: 9999
Title: Remove wstr from Unicode
Author: Inada Naoki  <songofacandy@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 18-Jun-2020
Python-Version: TBD

Abstract
========

PEP 393 deprecated some unicode APIs, and introduced ``wchar_t *wstr``,
and ``Py_ssize_t wstr_length`` in unicode implementation for backward
compatibility of these deprecated APIs. [1]_

This PEP is planning removal of ``wstr``, and ``wstr_length`` with
deprecated APIs using these members.

Motivation
==========

Memory usage
------------

``str`` is one of the most used types in Python.  Even most simple ASCII
strings have a ``wstr`` member.  It consumes 8 bytes on 64bit systems.

Runtime overhead
----------------

To support legacy Unicode object created by
``PyUnicode_FromUnicode(NULL, length)``, many Unicode APIs has
``PyUnicode_READY()`` check.

When we drop support of legacy unicode object, We can reduce this overhead
too.

Simplicity
----------

Support of legacy Unicode object makes Unicode implementation complex.
Until we drop legacy Unicode object, it is very hard to try other Unicode
implementation like UTF-8 based implementation in PyPy.

Specification
=============

Affected APIs
--------------

From the Unicode implementation, ``wstr`` and ``wstr_length`` members are
removed.

Macros and functions to be removed:

* PyUnicode_GET_SIZE
* PyUnicode_GET_DATA_SIZE
* Py_UNICODE_WSTR_LENGTH
* PyUnicode_AS_UNICODE
* PyUnicode_AS_DATA
* PyUnicode_AsUnicode
* PyUnicode_AsUnicodeAndSize

Behaviors to be removed:

* PyUnicode_FromUnicode -- ``PyUnicode_FromUnicode(NULL, size)`` where
  ``size > 0`` cause RuntimeError instead of creating legacy Unicode
  object. While this API is deprecated by PEP 393, this API will be kept
  when ``wstr`` is removed. This API will be removed later.

* PyUnicode_FromStringAndSize -- Like PyUnicode_FromUnicode,
  ``PyUnicode_FromStringAndSize(NULL, size)`` cause RuntimeError
  instead of creating legacy unicode object.

* PyArg_ParseTuple, PyArg_ParseTupleAndKeywords -- 'u', 'u#', 'Z', and
  'Z#' format will be removed.

Deprecation
-----------

All APIs to be removed should have compiler deprecation warning
(e.g. `Py_DEPRECATED(3.3)`) from Python 3.9. [2]_

All APIs to be changed should raise DeprecationWarning for behavior to be
removed. Note that ``PyUnicode_FromUnicode`` has both of compiler deprecation
warning and runtime DeprecationWarning. [3]_, [4]_.

Plan
-----

All deprecations will be implemented in Python 3.10.
Some deprecations will be backported in Python 3.9.

Actual removal will happen in Python 3.12.

Alternative Ideas
=================

Advanced Schedule
-----------------

Backport warnings in 3.9, and do the removal in early development phase
in Python 3.11. If many third packages are broken by this change, we will
revert the change and back to the regular schedule.

Pros: There is a chance to remove ``wstr`` in Python 3.11. Even if we need
to revert it, third party maintainers can have more time to prepare the
removal and we can get feedback from the community early.

Cons: Adding warnings in beta period will make some confusion. Note that
we need to avoid the warning from CPython core and stdlib.

Use hashtable to store wstr
---------------------------

Store the ``wstr`` in a hashtable, instead of Unicode structure.

Pros: We can save memory usage even from Python 3.10. We can have
more longer timeline to remove the ``wstr``.

Cons: This implementation will increase the complexity of Unicode
implementation.

References
==========
A collection of URLs used as references through the PEP.

.. [1] PEP 393 -- Flexible String Representation
       (https://www.python.org/dev/peps/pep-0393/)

.. [2] GH-20878 -- Add Py_DEPRECATED to deprecated unicode APIs
       (https://github.com/python/cpython/pull/20878)

.. [3] GH-20933 -- Raise DeprecationWarning when creating legacy Unicode
       (https://github.com/python/cpython/pull/20933)

.. [4] GH-20927 -- Raise DeprecationWarning for getargs with 'u', 'Z' #20927
       (https://github.com/python/cpython/pull/20927)

Copyright
=========

This document has been placed in the public domain.

-- 
Inada Naoki  <songofacandy@gmail.com>

Draft PEP: Remove wstr from Unicode

Inada Naoki

Victor Stinner

Inada Naoki

Victor Stinner

Inada Naoki

Victor Stinner

tags

participants (2)