[docs] copy&waste problem

Senthil Kumaran senthil at uthcode.com
Mon Mar 12 04:09:11 CET 2012


Hello Hauke,

I guess, you are mistaken with the meaning of re.LOCALE flag for space.  It
is not intersection but Union of the locale's space characters with the
ascii space characters.

For \S, with `LOCALE flag set, it will match [^ \t\n\r\f\v] plus any
non-whitespace characters defined by that locale.



+   In case both ``re.LOCALE`` and ``re.UNICODE`` are specified alongside,
+   these character classes will behave as if the union was given.

Where did you find this logic? I see that, locale flag is matched first and
then unicode.

In Modules\_sre.c

if (pattern->flags & SRE_FLAG_LOCALE)
        state->lower = sre_lower_locale;
    else if (pattern->flags & SRE_FLAG_UNICODE)


I am going ahead with the changes as I suggested previously and also
opening a bug report. Further discussions and changes can be tracked there.
Yeah, sometimes doc changes go for discussions and iterations too. :(

-- 
Senthil



On Fri, Mar 9, 2012 at 6:12 AM, Hauke Rehr <homo_laber at yahoo.de> wrote:

> Hello again,
>
> I can’t agree with your rewrite either, sorry - my suggestion based on
> yours:
>
>
> +   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified,
> +   matches any non-whitespace character; this is equivalent to the set
> ``[^
> +   \t\n\r\f\v]`` With :const:`LOCALE`, it will match those elements of
> the above set
> +   not defined as space in the current locale. If :const:`UNICODE` is
> set, those elements
> +   of ``[^ \t\n\r\f\v]`` not marked as space in the Unicode character
> properties database
> +   will be matched.
>
> If I don’t get the meaning of \S (that is: anything but \s) wrong, this
> should be correct.
> The same applies to \W:
>
> +   this will match anything other than ``[0-9_]`` not classified as
> +   alphanumeric in the Unicode character properties database.
>
>
> For the additional sentence, I’d prefer:
>
> +   In case both ``re.LOCALE`` and ``re.UNICODE`` are specified alongside,
> +   these character classes will behave as if the union was given.
>
> for that’s the logic behind.
>
> Hauke
>
> --- Senthil Kumaran *<senthil at uthcode.com>* schrieb am *Fr, 9.3.2012:
> *
>
> *
> Von: Senthil Kumaran <senthil at uthcode.com>
> Betreff: Re: [docs] copy&waste problem
> An: "Hauke Rehr" <homo_laber at yahoo.de>
> CC: docs at python.org
> Datum: Freitag, 9. März, 2012 09:18 Uhr
>
> *
> *Hello Hauke,
>
> Yeah, it was pretty confusing. Thanks for catching this. How does this
> change sound?
>
> -   When the :const:`LOCALE` and :const:`UNICODE` flags are not
> specified, matches
> -   any non-whitespace character; this is equivalent to the set ``[^
> \t\n\r\f\v]``
> -   With :const:`LOCALE`, it will match any character not in this set, and
> not
> -   defined as space in the current locale. If :const:`UNICODE` is
> set, this will
> -   match anything other than ``[ \t\n\r\f\v]`` and characters marked
> as space in
> -   the Unicode character properties database.
> +   When the :const:`LOCALE` and :const:`UNICODE` flags are not specified,
> +   matches any non-whitespace character; this is equivalent to the set
> ``[^
> +   \t\n\r\f\v]`` With :const:`LOCALE`, it will match the above set and any
> +   non-space character in the current locale. If :const:`UNICODE` is set,
> the
> +   above set ``[^ \t\n\r\f\v]`` and characters not marked as space in the
> +   Unicode character properties database.
>
> ``\w``
>     When the :const:`LOCALE` and :const:`UNICODE` flags are not
> specified, matches
> @@ -381,8 +381,8 @@
>     any non-alphanumeric character; this is equivalent to the set
> ``[^a-zA-Z0-9_]``.
>     With :const:`LOCALE`, it will match any character not in the set
> ``[0-9_]``, and
>     not defined as alphanumeric for the current locale. If
> :const:`UNICODE` is set,
> -   this will match anything other than ``[0-9_]`` and characters marked as
> -   alphanumeric in the Unicode character properties database.
> +   this will match anything other than ``[0-9_]`` plus characters
> classied as
> +   not alphanumeric in the Unicode character properties database.
>
>
> Hope the rewrite is less confusing.
>
> We can also include this sentence somewhere.
>
> Both re.LOCALE and re.UNICODE is specified together,in that case
> re.LOCALE would be matched first and the re.UNICODE.
>
>
> --
> Senthil
>
> *
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/docs/attachments/20120311/1d4a519a/attachment.html>


More information about the docs mailing list