[New-bugs-announce] [issue41377] memoryview of str (unicode)

Thu Jul 23 16:46:27 EDT 2020

New submission from jakirkham <jakirkham at gmail.com>:

When working with lower level C/C++ code, the Python Buffer Protocol[1] has been immensely useful as it allows common Python `bytes`-like objects to expose the underlying memory buffer in a pointer that C/C++ code can easily work with zero-copy. In fact `memoryview` objects can be quite handy when facilitating coercion of Python objects supporting the Python Buffer Protocol to something that Python and/or C/C++ code can use easily. This works with several Python objects, many Python APIs, and in is relied on heavily by many performance conscious 3rd party libraries.

However one object that gets a lot of use in Python that doesn't support this API is the Python `str` (previously `unicode`) object (see code below).

```python
In [1]: s = "Hello World!"                                                      

In [2]: mv = memoryview(s)                                                      
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-3403c1ca3811> in <module>
----> 1 mv = memoryview(s)

TypeError: memoryview: a bytes-like object is required, not 'str'
```

The canonical answer today is [to encode to `bytes` first]( https://stackoverflow.com/a/54449407 ) and decode to `str` later. While this is ok for a smallish piece of text, it can start to slowdown considerably for larger pieces of text. So being able to skip this encode/decode step can be quite impactful.

```python
In [1]: s = "Hello World!"                                                      

In [2]: %timeit s.encode();                                                     
54.9 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: s = 100_000_000 * "Hello World!"                                        

In [4]: %timeit s.encode();                                                     
729 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

AIUI (though I could be misunderstanding things) `str` objects do use some kind of typed array of unicode characters (either 16-bit narrow or 32-bit wide). So it seems like it *should* be possible to expose this as a 1-D contiguous array that C/C++ code could use. Though I may be misunderstanding how `str`s actually work under-the-hood (if so apologies).

It would be quite helpful to bypass this encoding/decoding step and instead work directly with the underlying buffer in these situations where C/C++ is involved to help performance critical code.

[1]: https://docs.python.org/3/c-api/buffer.html

----------
components: Library (Lib)
messages: 374147
nosy: jakirkham
priority: normal
severity: normal
status: open
title: memoryview of str (unicode)
type: enhancement
versions: Python 3.10, Python 3.8, Python 3.9

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue41377>
_______________________________________