[Python-Dev] thoughts on the bytes/string discussion

Wed Jul 7 12:56:18 CEST 2010

M.-A. Lemburg wrote:

> Note that using UTF-8 as internal storage format would not work
> in Python, since Python is a Unicode producer, i.e. it needs to
> be able to generate and work with code points that are not allowed
> in UTF-8, e.g. lone surrogates.

Well, it wouldn't strictly be UTF-8, any more than the
2-byte build is strictly UTF-16, in the sense that lone
surrogates can be produced.

> Another reason not to use UTF-8 encoded code units is that slicing
> based on code units could easily create invalid UTF-8 which would
> then render the data unusable. This is a lot less likely to happen
> with UCS2 or UCS4.

The use cases I had in mind for a 1-byte build are those for
which the alternative would be keeping everything in bytes.
Applications using a 1-byte build would need to be aware of
the fact and take care to slice strings at valid places. If
they were using bytes, they would have to face exactly the
same issues.

> And finally: RAM is cheap and today's CPUs work better with 16- or
> 32-bit values than 8-bit characters.

Yet some people have reported significant performance benefits
for some applications from using a 2-byte build instead of a
4-byte build. I was just speculating whether a 1-byte build
might be of further advantage in a few specialised cases.

No matter how much RAM or processing speed you have, it's always
possible to find an application that stresses the limits.

-- 
Greg