"All your byte are belong to us! " (TM) (was Re: There's got to be an easy way to do this (fwd))

Fri Jul 6 02:38:32 EDT 2001

Lulu of the Lotus-Eaters <mertz at gnosis.cx> wrote in message news:<mailman.994369989.11026.python-list at python.org>...
> Still dying to win the performance prize, I believe Skip Montanaro's
> performance tip brings the general filter stategy back up to winning
> status (including James Logajan 'traditional' function)
> 
>   def str_join2(iters):
>       has = {0:1,1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:1}.has_key
>       for i in iters:
>           "".join([x for x in '(123)/456-7890' if has(x)])
> 
>   def flt_lmbda3(iters):
>       isdigit = {0:1,1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:1}.has_key
[snip]

Sigh. Looks like it's going to be a long hot [northern hemisphere]
summer.

D:\junk>type mertz.py
def str_join2(iters):
   has = {0:1,1:1,2:1,3:1,4:1,5:1,6:1,7:1,8:1,9:1}.has_key
   for i in iters:
      result = "".join([x for x in '(123)/456-7890' if has(x)])
   return result

def str_join2_fixed(iters):
   has = {'0':1,'1':1,'2':1,'3':1,'4':1,'5':1,'6':1,'7':1,'8':1,'9':1}.has_key
   for i in iters:
      result = "".join([x for x in '(123)/456-7890' if has(x)])
   return result

print repr(str_join2(range(1)))
print repr(str_join2_fixed(range(1)))

D:\junk>python mertz.py
''
'1234567890'

Unfortunately, the routine that doesn't do the "All your byte are
belong to us!"
caper is going to be a bit slower; that's what you have pay for
correctness.

Hey, what a wonderful title for a book on text processing :-)

In any case, the fastest routine available in standard Python would
have to be one based on the suggestion in an earlier post from Tim
Peters, namely to use
source_string.translate(identity_map_string, chars_to_delete).

Note that translate() is O(m+n) where n = len(source_string)
and m = len(chars_to_delete). The O(m) comes about from the fact that
chars_to_delete is unstructured and has to be made into a map on each
call -- if it wasn't you'd have O(m*n). You can get better, with a
precomputed map, but only by jumping into C.

Some more observations:
(1) The various solutions using re do poorly because re.sub is written
in Python, not C, as /F pointed out a week or so ago.
(2) You can knock 20% to 25% off the re times by using "[^0-9]+"
instead of "[^0-9]"
(3) Most contestants could squeeze out a little more performance by
moving attribute lookups outside the loop -- like this:

nulljoin = "".join
for x in y:
   result = nulljoin(etc etc)

(4) Getting away from standard-issue Python, mx.TextTools.setstrip()
looked at first glance like a goer but that turned out to be a snare
and a delusion -- as its doc makes plain, it "strips" in the same way
as standard Python strip(), lstrip() and rstrip(): only at the ends of
the string, not in the middle; IOW, "some your byte that you would
like to belong to us are not". Perhaps if MAL is reading this, he
might consider adding a setremove() to his next version.

Regards,
John