[docs] copy&waste problem

Hauke Rehr homo_laber at yahoo.de
Thu Mar 15 10:11:44 CET 2012


Hello Senthil

Sorry, but your program doesn’t prove anything but the correct behaviour of sets in python.
Maybe there was a misunderstanding. I’d have expected this very behaviour your code showed. As I said earlier, for the positive (lowercase) classes, it’s union, for the negative (uppercase) ones, it’s intersection. And that’s exactly what your program showed. I don’t see where you got me wrong.
But your code doesn’t deal with the UNICODE or LOCALE flag at all. I wanted to see, if they behave the same (as correct) as the builtin set data structure you used in your code. You’d have to first set the UNICODE and LOCALE definitions of space characters to check whether they show correct results - that is, in my opinion:
s(uni, loc) = union(s(uni), s(loc)) and S(uni, loc) = intersection(S(uni), S(loc))
analogous to the example in your code. I’d still expect them to work like this.

regards,
Hauke

--- Senthil Kumaran <senthil at uthcode.com> schrieb am Do, 15.3.2012:

Von: Senthil Kumaran <senthil at uthcode.com>
Betreff: Re: [docs] copy&waste problem
An: "Hauke Rehr" <homo_laber at yahoo.de>
Datum: Donnerstag, 15. März, 2012 04:09 Uhr

Hi Hauke,

Thanks for persisting. I see the logic and I stand correct. Here is
simple program which I tried for testing the theory. In my first
version, I tried it incorrectly.


ascii_allchars = set(['a','b','c','\t','\n'])
kling_allchars = set([1,2,3,'\t','\n','s']) # 's' is space in kling
universe = ascii_allchars.union(kling_allchars)
ascii_s = set(['\t','\n'])
kling_s = set(['\t','\n','s'])
ascii_S = universe.difference(ascii_s)
kling_S = universe.difference(kling_s)

# my claim is, with locale kling with ascii
s = ascii_s.union( kling_s)
S = ascii_S.union(kling_S)  # this is wrong.

print "\s matches"
print s

print "\S matches"
print S

# for intersection - it is correct
print ascii_S.intersection(kling_S)


# INCORRECT way of doing this which I was trying.

ascii_allchars = ['a','b','c','\t','\n']
kling_allchars=[1,2,3,'\t','\n','s']#'s'isspaceinkling
ascii_s=['\t','\n']
kling_s=['\t','\n','s']
ascii_S=['a','b','c']
kling_S=[1,2,3]

s=ascii_s+kling_s
S=ascii_S+kling_S

print"\smatches"
prints

print"\Smatches"
printS

#Theotherclaimisforintersectionitshouldbe
print set(ascii_S).intersection(set(kling_S))


Having concluded on this, we have another problem. namely the behavior
of Python 2.7. Looking at the code, I find that re.L is completely
ignored for space or non-white space and for re.U, the logic is if
char is less than 128, the ascii non-white space characters are
checked and if its greater than 128, the unicode non-white space
characters are checked.  I think, documentation should be updated with
the behavior rather than supposed logic.

Thank you,
Senthil

On Wed, Mar 14, 2012 at 04:38:41PM +0000, Hauke Rehr wrote:
> Hello,
> 
> you wrote:
> If the intersection logic were to be followed then, it would
> completely remove those "<space>, <form-feed>, <newline>,
> <carriage-return>, <tab>, and <vertical-tab>" from the match as they
> are included as space characters in the locale definition too.
> 
> I don’t see where you get this from: the negative (uppercase) classes - to
> which I still think the intersection logic should apply - don’t have those
> space characters for they’re the exact complement: everything but those space
> chars.
> 
> I say, to build the set \S you either build the union of the positive
> (lowercase) classes and complement the result, or equivalently intersect the
> complements \S(unicode) and \S(locale) which don’t contain any spaces (and so
> does the result which is what we want when specifying \S).
> 
> To cut it short, I thought if characters a, b are in \s(unicode) and a, c are
> in \s(locale), then \S(unicode, locale) should contain any character not in any
> of unicode or locale. Therefore, it should in particular not match a nor any of
> b, c.
> Did your example script show a different behaviour? Please send it for me to
> see if I”m wrong about that.
> 
> I guess, you’ll argee, that \s(uni, loc) is the union of \s(uni) and \s(loc).
> \S(uni, loc) is meant to be the exact complement, so it’s complement of union,
> that is, intersection of the complements \S(uni) and \S(loc).
> 
> Union would not make sense to me:
> if I want a non-space, I don’t want it to be considered space by any of uni or
> loc, so I want it to be in both complements and thus in their (the
> complements’) intersection.
> 
> I hope your script will either show I got this wrong or prove my understanding
> to be correct.
> 
> --- Senthil Kumaran <senthil at uthcode.com> schrieb am Di, 13.3.2012:
> 
> 
>     Von: Senthil Kumaran <senthil at uthcode.com>
>     Betreff: Re: [docs] copy&waste problem
>     An: "Hauke Rehr" <homo_laber at yahoo.de>
>     CC: docs at python.org
>     Datum: Dienstag, 13. März, 2012 17:16 Uhr
> 
>     I understand  your points. My reasoning was based on -
> 
>     man 5 locale which stated the following for locale.
> 
>     space  followed by a list of characters defined as white-space
>     characters.  Characters also  specified  as  upper,  lower,
>     alpha,  digit, graph, or xdigit are not allowed.  The
>     characters <space>, <form-feed>, <newline>, <carriage-return>,
>     <tab>, and <vertical-tab> are automatically included.
> 
>     If the intersection logic were to be followed then, it would
>     completely remove those "<space>, <form-feed>, <newline>,
>     <carriage-return>, <tab>, and <vertical-tab>" from the match as they
>     are included as space characters in the locale definition too.
>     Isn't it?
> 
>     Here is bug report -
> 
>     http://bugs.python.org/issue14258
> 
>     Thanks,
>     Senthil
> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/docs/attachments/20120315/36eb8f82/attachment-0001.html>


More information about the docs mailing list