[pypy-dev] PyPy 2 unicode class

Fri Jan 24 00:06:48 CET 2014

On Thu, Jan 23, 2014 at 10:45:25PM +0200, Elefterios Stamatogiannakis wrote:
> >But having said all this, I know that using UTF-8 internally for strings
> >is quite common (e.g. Haskell does it, without even an index cache, and
> >documents that indexing operations can be slow). CPython's FSR has
> >received much (in my opinion, entirely uninformed) criticism from one
> >vocal person in particular for not using UTF-8 internally. If PyPy goes
> >ahead with using UTF-8 internally, I look forward to comparing memory
> >and time benchmarks of string operations between CPython and PyPy.
> >
> 
> I have to admit that due to my work (databases and data processing),
> i'm biased towards I/O (UTF-8 is better due to size) rather than
> CPU.
> 
> At least from my use cases, the most frequent operations that i do
> on strings are read, write, store, use them as keys in dicts,
> concatenate and split.
> 
> For most of above things (with the exception of split maybe?), an
> index cache would not be needed, and UTF-8 due to its smaller size
> would be faster than wide unicode encodings.

I hear Steven's points, but my experience matches Elefterios' -
smaller data is faster[1].  I'll also note that although many string
processing algorithms can be written in terms of indexing, many(most?)
are actually stream processing algorithms which do not actually need
efficient character offset to/from byte offset calculations.  For
example, split works by walking the entire string in a single pass
outputing substrings as it goes.

regards,
njh
[1] which suggests that lz77ing longer strings by default is not a
terrible idea.