[Python-Dev] Optimize Unicode strings in Python 3.3

Fri May 4 01:45:15 CEST 2012

Hi,

Different people are working on improving performances of Unicode
strings in Python 3.3. This Python version is very different from
Python 3.2 because of the PEP 393, and it is still unclear to me what
is the best way to create a new Unicode string.

There are different approachs:

 * Use the legacy (Py_UNICODE) API, PyUnicode_READY() converts the
result to the canonical form. CJK codecs are still using this API.
 * Use a Py_UCS4 buffer and then convert to the canonical form (ASCII,
UCS1 or UCS2). Approach taken by io.StringIO. io.StringIO is not only
used to write, but also to read and so a Py_UCS4 buffer is a good
compromise.
 * PyAccu API: optimized version of chunks=[]; for ...: ...
chunks.append(text); return ''.join(chunks).
 * Two steps: compute the length and maximum character of the output
string, allocate the output string and then write characters. str%args
was using it.
 * Optimistic approach. Start with a ASCII buffer, enlarge and widen
(to UCS2 and then UCS4) the buffer when new characters are written.
Approach used by the UTF-8 decoder and by str%args since today.

The optimistic approach uses realloc() to resize the string. It is
faster than the PyAccu approach (at least for short ASCII strings),
maybe because it avoids the creating of temporary short strings.
realloc() looks to be efficient on Linux and Windows (at least Seven).

Various notes:
 * PyUnicode_READ() is slower than reading a Py_UNICODE array.
 * Some decoders unroll the main loop to process 4 or 8 bytes (32 or
64 bits CPU) at each step.

I am interested if you know other tricks to optimize Unicode strings
in Python, or if you are interested to work on this topic.

There are open issues related to optimizing Unicode:

#11313: Speed up default encode()/decode()
#12807: Optimization/refactoring for {bytearray, bytes, unicode}.strip()
#14419: Faster ascii decoding
#14624: Faster utf-16 decoder
#14625: Faster utf-32 decoder
#14654: More fast utf-8 decoding
#14716: Use unicode_writer API for str.format()

Victor