[docs] copy&waste problem

Hauke Rehr homo_laber at yahoo.de
Tue Mar 13 13:17:37 CET 2012

Hello again,

I’d rather use the ticket you started, but I couldn’t find where to post an answer/where the discussion is tracked.

So once again, to clarify what I had in mind:

As for the positive (lowercase) classes: Yes, that’s union (first one is matched, if there’s no success, the other one is tried). For the negative (uppercase) classes it’s deMorgan:
complement(union(A, B)) = intersection(complement(A), complement(B))
So we have
\<uppercase> = complement(\<lowercase>)
= complement(union(\<lowercase_locale>, \<lowercase_unicode>))
= intersection(complement(\<lowercase_locale>), complement(\<lowercase_unicode>))
= intersection(\<uppercase_locale>, \<uppercase_unicode>).

At least, that’s how it should be and what the code means you quoted
(it doesn’t matter which one you try first: union is symmetric)
for a match of an uppercase class means nothing but
a char that doesn’t match the corresponding lowercase class

So I still believe my corrections to be - well, correct; and your suggestion to be erroneous.


--- Senthil Kumaran <senthil at uthcode.com> schrieb am Mo, 12.3.2012:

Von: Senthil Kumaran <senthil at uthcode.com>
Betreff: Re: [docs] copy&waste problem
An: "Hauke Rehr" <homo_laber at yahoo.de>
CC: docs at python.org
Datum: Montag, 12. März, 2012 04:09 Uhr

Hello Hauke,
I guess, you are mistaken with the meaning of re.LOCALE flag for space.  It is not intersection but Union of the locale's space characters with the ascii space characters.

For \S, with `LOCALE flag set, it will match [^ \t\n\r\f\v] plus any non-whitespace characters defined by that locale. 

+   In case both ``re.LOCALE`` and ``re.UNICODE`` are specified alongside,
+   these character classes will behave as if the union was given.

Where did you find this logic? I see that, locale flag is matched first and then unicode.
In Modules\_sre.c    

if (pattern->flags & SRE_FLAG_LOCALE)        state->lower = sre_lower_locale;    else if (pattern->flags & SRE_FLAG_UNICODE)

I am going ahead with the changes as I suggested previously and also opening a bug report. Further discussions and changes can be tracked there. Yeah, sometimes doc changes go for discussions and iterations too. :( 

-- Senthil

On Fri, Mar 9, 2012 at 6:12 AM, Hauke Rehr <homo_laber at yahoo.de> wrote:

Hello again,

I can’t agree with your rewrite either, sorry - my suggestion based on yours:

 +   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified,
+   matches any non-whitespace character; this is equivalent to the set ``[^

+   \t\n\r\f\v]`` With :const:`LOCALE`, it will match those elements of the above set
+   not defined as space in the current locale. If :const:`UNICODE` is set, those elements
+   of ``[^ \t\n\r\f\v]`` not marked as space in the Unicode character properties database

+   will be matched.

If I don’t get the meaning of \S (that is: anything but \s) wrong, this should be correct.
The same applies to \W:

+   this will match anything other than ``[0-9_]`` not classified as

 alphanumeric in the Unicode character properties database.

For the additional sentence, I’d prefer:

+   In case both ``re.LOCALE`` and ``re.UNICODE`` are specified alongside,
+   these character classes will behave as if the union was given.

for that’s the logic behind.


