[Python-Dev] PEP 393 Summer of Code Project

Thu Sep 1 11:55:22 CEST 2011

On 9/1/2011 12:59 AM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>
>   >  We can either artificially constrain ourselves to minor tweaks of
>   >  the legal conforming bytestreams,
>
> It's not artificial.  Having the internal representation be the same
> as a standard encoding is very useful for a large number of minor
> usages (urgently saving buffers in a text editor that knows its
> internal state is inconsistent, viewing strings in the debugger, PEP
> 393-style space optimization is simpler if text properties are
> out-of-band, etc).

saving buffers urgently when the internal state is inconsistent sounds 
like carefully preserving a bug.  Windows 7 64-bit on one of my 
computers happily crashes several times a day when it detects 
inconsistent internal state... under the theory, I guess, that losing 
work is better than saving bad work.  You sound the opposite.

I'm actually very grateful that Firefox and emacs recover gracefully 
from Windows crashes, and I lose very little data from the crashes, but 
cannot recommend Windows 7 (this machine being my only experience with 
it) for stability.

In any case, the operations you mention still require the data to be 
processed, if ever so slightly, and I'll admit that a more complex 
representation would require a bit more processing.  Not clear that it 
would be huge or problematical for these cases.

Except, I'm not sure how PEP 393 space optimization fits with the other 
operations.  It may even be that an application-wide complex-grapheme 
cache would save significant space, although if it uses high-bits in a 
string representation to reference the cache, PEP 393 would jump 
immediately to something > 16 bits per grapheme... but likely would 
anyway, if complex-graphemes are in the data stream.

>   >  or we can invent a representation (whether called str or something
>   >  else) that is useful and efficient in practice.
>
> Bring on the practice, then.  You say that a bit to identify lone
> surrogates might be useful or efficient.  In what application?  How
> much time or space does it save?

I didn't attribute any efficiency to flagging lone surrogates (BI-5).  
Since Windows uses a non-validated UCS-2 or UTF-16 character type, any 
Python program that obtains data from Windows APIs may be confronted 
with lone surrogates or inappropriate combining characters at any time.  
Round-tripping that data seems useful, even though the data itself may 
not be as useful as validated Unicode characters would be.  Accidentally 
combining the characters due to slicing and dicing the data, and doing 
normalizations, or what not, would not likely be appropriate.  However, 
returning modified forms of it to Windows as UCS-2 or UTF-16 data may 
still cause other applications to later accidentally combine the 
characters, if the modifications juxtaposed things to make them look 
reasonably, even if accidentally.  If intentionally, of course, the bit 
could be turned off.  This exact sort of problem with non-validated 
UTF-8 bytes was addressed already in Python, mostly for Linux, allowing 
round-tripping of the byte stream, even though it is not valid.  BI-6 
suggests a different scheme for that, without introducing lone 
surrogates (which might accidentally get combined with other lone 
surrogates).

> You say that a bit to cache a
> property might be useful or efficient.  In what application?  Which
> properties?  Are those properties a set fixed by the language, or
> would some bits be available for application-specific property
> caching?  How much time or space does that save?

The brainstorming ideas I presented were just that... ideas.  And they 
were independent.  And the use of many high-order bits for properties 
was one of the independent ones.  When I wrote that one, I was assuming 
a UTF-32 representation (which wastes 11 bits of each 32).  One thing I 
did have in mind, with the high-order bits, for that representation, was 
to flag the start or end or middle of the codes that are included in a 
grapheme.  That would be redundant with some of the Unicode codepoint 
property databases, if I understand them properly... whether it would 
make iterators enough more efficient to be worth the complexity would 
have to be benchmarked.  After writing all those ideas down, I actually 
preferred some of the others, that achieved O(1) real grapheme indexing, 
rather than caching character properties.

> What are the costs to applications that don't want the cache?  How is
> the bit-cache affected by PEP 393?

If it is a separate type from str, then it costs nothing except the 
extra code space to implement the cache for those applications that do 
want it... most of which wouldn't be loaded for applications that don't, 
if done as a module or C extension.

> I know of no answers (none!) to those questions that favor
> introduction of a bit-cache representation now.  And those bits aren't
> going anywhere; it will always be possible to use a "wide" build and
> change the representation later, if the optimization is valuable
> enough.  Now, I'm aware that my experience is limited to the
> implementations of one general-purpose language (Emacs Lisp) of
> retricted applicability.  But its primary use *is* in text processing,
> so I'm moderately expert.
>
> *Moderately*.  Always interested in learning more, though.  If you
> know of relevant use cases, I'm listening!  Even if Guido doesn't find
> them convincing for Python, we might find them interesting at XEmacs.

OK... ignore the bit-cache idea (BI-1), and reread the others without 
having your mind clogged with that one, and see if any of them make 
sense to you then.  But you may be too biased by the "minor" needs of 
keeping the internal representation similar to the stream representation 
to see any value in them.  I rather like BI-2, since it allow O(1) 
indexing of graphemes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110901/84209aa2/attachment.html>