[Python-3000] String comparison

Rauli Ruohonen rauli.ruohonen at gmail.com
Fri Jun 8 00:47:07 CEST 2007


On 6/8/07, Jim Jewett <jimjjewett at gmail.com> wrote:
> How would you expect them to work on arrays of code points?

Just like they do with Python 2.5 unicode objects, as long as the
"array of code points" is str, not e.g. a numpy array or tuple of ints,
which I don't expect to grow string methods :-)

> What sort of answer should the following produce?

That depends on what Python does when it reads in the source code.
I think it should normalize to NFC (which Python 2.5 does not do).

>     # matches by codepoints, but doesn't look like it
>     "Lo&#0308wis".startswith("Lo")
>     # if the above did match, then people will assume ö folds to o
>     "L&#00F6wis".startswith("Lo")
>     # looks like it matches.  Matches as text.  Does not match as bytes.
>     "Lo&#0308wis".startswith("L&#00F6")

Normalized to NFC:

"L&#00F6;wis".startswith("Lo")
"L&#00F6;wis".startswith("Lo")
"L&#00F6;wis".startswith("L&#00F6;")

After this Python lexes, parses and executes. The first two are false,
the last one true. All of the examples should look the same in your editor
(at least ideally). The following would, OTOH, be true false false:

"Lo\u0308wis".startswith("Lo")
"L\u00F6wis".startswith("Lo")
"Lo\u0308wis".startswith("L\u00F6")

As here the source code is pure ASCII, it's WYSIWYG everywhere.

Python 2.5's output with each:

>>> u"Löwis".startswith(u"Lo")
True
>>> u"Löwis".startswith(u"Lo")
False
>>> u"Löwis".startswith(u"Lö")
False
>>> u"Lo\u0308wis".startswith(u"Lo")
True
>>> u"L\u00F6wis".startswith(u"Lo")
False
>>> u"Lo\u0308wis".startswith(u"L\u00F6")
False


More information about the Python-3000 mailing list