[Tutor] lc_ctype and re.LOCALE

Albert-Jan Roskam sjeik_appie at hotmail.com
Sun Jan 31 16:41:37 EST 2016


> From: sjeik_appie at hotmail.com
> To: oscar.j.benjamin at gmail.com
> Date: Sun, 31 Jan 2016 19:56:21 +0000
> Subject: Re: [Tutor] lc_ctype and re.LOCALE
> CC: tutor at python.org
> 
> 
> > From: oscar.j.benjamin at gmail.com
> > Date: Fri, 29 Jan 2016 16:32:57 +0000
> > Subject: Re: [Tutor] lc_ctype and re.LOCALE
> > To: sjeik_appie at hotmail.com
> > CC: tutor at python.org
> > 
> > On 28 January 2016 at 20:23, Albert-Jan Roskam <sjeik_appie at hotmail.com> wrote:
> > >
> > > Out of curiosity, I wrote the throw-away script below to find a character that is classified (--> LC_CTYPE) as digit in one locale, but not in another.
> > > I ran it with 5000 locale combinations in Python 2 but did not find any (somebody shut down my computer!). I just modified the code so it also
> > > runs in Python 3. Is this the correct way to find such locale-dependent regex matches?
> > 
> > Eryk already gave you a better explanation of the locale stuff than I
> > could but I have a separate comment about the algorithmic performance
> > of your code (since you mentioned that it took a long time).
> > 
> > You're looping over all pairs of locales:
> > 
> > ...
> > > for n, (locale1, locale2) in enumerate(itertools.combinations(locales, 2), >
> > ...
> > >     for i in xrange(sys.maxunicode + 1):   # 1114111
> > >         s = unichr(i)  #.encode("utf8")
> > >         try:
> > >             locale.setlocale(locale.LC_CTYPE, locale1)
> > >             m1 = bool(regex.match(s))
> > >             locale.setlocale(locale.LC_CTYPE, locale2)
> > >             m2 = bool(regex.match(s))
> > >             if m1 ^ m2:  # m1 != m2
> > 
> > Suppose there are N locales and M is sys.maxunicode. The number of
> > pairs of locales is N*(N-1)/2 which grows like N**2. For each pair you
> > loop over M characters so the innermost loop body is repeated
> > something like M*N**2 times.
> > 
> > Assume that f(locale, c) is the function that gets e.g. m1 or m2 in
> > your code above. We can swap the loops around so that the outer loop
> > is over unicode characters. Then the inner loop can be over the
> > locales but we only loop over all N locales once rather than over all
> > N**2 pairs of locales. This looks like this:
> > 
> >     for c in unicode_chacters:
> >         matched = f(locales[0], c) # Check the first locale
> >         for locale in locales:
> >             assert all(f(locale, c) == matched for locale in locales)
> > 
> > This way you call f(locale, c) M*N times which if N is not small
> > should be a lot faster than M*N**2 times.
> 
> Hi Oscar,
> 

Oh, it seems NoScript or something messed up Hotmail. Here is what I intended to send:

I blindly followed a code tuning tip I once read about in Code Complete (McConnel 2004; page 623 [1]):
 " Putting the Busiest Loop on the Inside
 
 When you have nested loops, think about which loop you want on the outside and
 which you want on the inside. Following is an example of a nested loop that can be
 improved:
 ...
 The key to improving the loop is that the outer loop executes much more often than the
 inner loop. Each time the loop executes, it has to initialize the loop index, increment it
 on each pass through the loop, and check it after each pass" 
 
 [1] https://khmerbamboo.files.wordpress.com/2014/09/code-complete-2nd-edition-v413hav.pdf
 
 Your advice makes perfect sense, though. So McConnel's code tuning tip may be a rule-of-thumb, with exceptions, right?




> 
> Thanks!
> 
> 
>  		 	   		  
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
 		 	   		  


More information about the Tutor mailing list