[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Sun Aug 14 03:07:32 CEST 2011

Tom Christiansen <tchrist at perl.com> added the comment:

Matthew Barnett <report at bugs.python.org> wrote
   on Sat, 13 Aug 2011 20:57:40 -0000: 

> There are occasions when you want to do string slicing, often of the form:

>   pos = my_str.index(x)
>   endpos = my_str.index(y)
>   substring = my_str[pos : endpos]

Me, I would probably give the second call to index the first  
index position to guarantee the end comes after the start:

    str  = "for finding the biggest of all the strings"
    x_at = str.index("big")
    y_at = str.index("the", x_at)
    some = str[x_at:y_at]
    print("GOT", some)

But here's a serious question: is that *actually* a common usage pattern
for accessing strings in Python?  I ask because it wouldn't even *occur* to
me to go at such a problem in that way.  I would have always just written
it this way instead:

    import re
    str  = "for finding the biggest of all the strings"
    some = re.search("(big.*?)the", str).group(1)
    print("GOT", some)

I know I would use the pattern approach, just because that's 
how I always do such things in Perl:

    $str  = "for finding the biggest of all the strings";
    ($some) = $str =~ /(big.*?)the/;
    print "GOT $some\n";

Which is obviously a *whole* lot simpler than the index approach:

    $str  = "for finding the biggest of all the strings";
    $x_at = index($str, "big");
    $y_at = index($str, "the", $x_at);
    $len  = $y_at - $x_at;
    $some = substr($str, $x_at, $len);
    print "GOT $some\n";

With no arithmetic and no need for temporary variables (you can't really
escape needing x_at to pass to the second call to index), it's all a
lot more WYSIWIG.  See how much easier that is?  

Sure, it's a bit cleaner and less noisy in Perl than it is in Python by
virtue of Perl's integrated pattern matching, but I would still use
patterns in Python for this, not index.  

I honestly find the equivalent pattern operations a lot easier to read and write
and maintain than I find the index/substring version.  It's a visual thing.  
I find patterns a win in maintainability over all that busy index monkeywork.  
The index/rindex and substring approach is one I almost never ever turn to.
I bet I use pattern matching 100 or 500 times for each time I use index, and
maybe even more.

I happen to think in patterns.  I don't expect other people to do so.  But
because of this, I usually end up picking patterns even if they might be a
little bit slower, because I think the gain in flexibility and especially
maintability more than makes up for any minor performance concerns.

This might also show you why patterns are so important to me: they're one
of the most important tools we have for processing text.  Index isn't,
which is why I really don't care about whether it has O(1) access.  

> To me that suggests that if UTF-8 is used then it may be worth
> profiling to see whether caching the last 2 positions would be
> beneficial.

Notice how with the pattern approach, which is inherently sequential, you don't
have all that concern about running over the string more than once.  Once you
have the first piece (here, "big"), you proceed directly from there looking for
the second piece in a straightforward, WYSIWIG way.  There is no need to keep an
extra index or even two around on the string structure itself, going at it this way.

I would be pretty surprised if Perl could gain any speed by caching a pair of
MRU index values against its UTF-8 [but see footnote], because again, I think
the normal access pattern wouldn't make use of them.  Maybe Python programmers
don't think of strings the same way, though.  That, I really couldn't tell you.

But here's something to think about:

If it *is* true that you guys do all this index stuff that Perl programmers
just never see or do because of our differing comfort levels with regexes,
and so you think Python that might still benefit from that sort of caching 
because its culture has promoted a different access pattern, then that caching 
benefit would still apply even if you were retain the current UTF-16 representation
instead of going to UTF-8 (which might want it) or to UTF-32 (which wouldn't).

After all, you have the same variable-width caching issue with UTF-16 as with
UTF-8, so if it makes sense to have an MRU cache mapping character indices to
byte indices, then it doesn't matter whether you use UTF-8 or UTF-16!

However, I'd want some passive comparative benchmarks using real programs with
real data, because I would be suspicious of incurring the memory cost of two
more pointers in every string in the whole program.  That's serious.

--tom

FOOTNOTE: The Perl 6 people are thinking about clever ways to set up byte
          offset indices.  You have to do this if you want O(1) access to the
          Nth element for elements that are not simple code points even if you
          use UTF-32.  That's because they want the default string element to be
          a user visible grapheme, not a code point.  I know they have clever
          ideas, but I don't know how critical O(1) access truly is, nor whether 
	  it's worth the overhead this would require.  But perhaps it would be
	  extensible for other sorts of string elements, like locale-based
	  alphabetic units (like <ch>, <dz>, <ll>) or even words, and so would
	  prove interesting to try nonetheless.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________