permuting letters and fairy tales

Thu Nov 11 17:37:26 EST 2004

Johannes Nix <Johannes.Nix <at> gmx.net> writes:
> Now I have two innocent questions:
> 
> - Is it possible to make it a bit more concise )) ?
> 
> - Can it coerced to run a little bit faster ?
>   (on my oldish, 300 MHz-AMD K6 , run time looks like this
>   for a famous, 2663-word-long fairy tale from the Grimm's brothers:

So, I don't have Numeric, but I believe I've translated your code to the
equivalent numarray code.  I've also written a version of it using only the
standard Python libraries:

------------------------------------------------------------
import sys
import locale
import string
import re
import numarray
import numarray.strings
import numarray.random_array
import random

def f1(file_in, file_out):
    wordsep = re.compile('([^%s])' % string.letters)
    for line in file_in.xreadlines():
        for word in wordsep.split(line):
            if word and word[0] in string.letters:
                word = string.lower(word)
                wlen = len(word)
                if wlen > 3:
                    wa = numarray.strings.array(list(word))
                    perm = numarray.random_array.permutation(wlen-2)
                    wa[1:wlen-1] = numarray.take(wa[1:wlen-1],perm)
                    word = wa.tostring()
            file_out.write('%s' % word)
    file_out.write('\n')

def f2(file_in, file_out):
    word_sep_matcher = re.compile(r'(\W+)')
    for line in file_in:
        for word in word_sep_matcher.split(line):
            if word and word.isalpha():
                word = word.lower()
                if len(word) > 3:
                    inner = list(word[1:-1])
                    random.shuffle(inner)
                    word = '%s%s%s' % (word[0], ''.join(inner), word[-1])
            file_out.write('%s' % word)
    file_out.write('\n')
------------------------------------------------------------

As far as conciceness goes, a couple things of note:
(1) I believe r'([^A-z])' does the same thing as your re
(2) xreadlines is deprecated in Python 2.3
(3) str.isalpha() is probably a good substitute for s[0] in string.letters.  (If
you want to be strict about the translation, you can do s[0].isalpha())
(4) string.lower(word) is deprecated in favor of word.lower()
(5) perhaps Numeric doesn't support this, but I believe you can replace
wa[1:wlen-1] with wa[1:-1]
(6) you can do something much like what your numarray code does by using
random.shuffle

As far as speed goes, here's the timings I got using your email (minus the code)
as input (your code is saved in 'temp.txt'):

>python -m timeit -n 10000 -s "import temp; file_in = file('temp.txt'); file_out
= file('out.txt', 'w')" "temp.f1(file_in, file_out)"
10000 loops, best of 3: 31.6 usec per loop

>python -m timeit -n 10000 -s "import temp; file_in = file('temp.txt'); file_out
= file('out.txt', 'w')" "temp.f2(file_in, file_out)"
10000 loops, best of 3: 8.02 usec per loop

Looks to me like the one without numarray is much quicker, but I can't be sure
that Numeric would behave in the same manner.

> - It's a good example how powerful libraries, like
>   Numeric, make one's life easier. (BTW, why is Numeric 
>   and stuff like take() still not included in the standard
>   Library ? Batteries included, but calculator not ?)

Well, at least in numarray, take is just the functional form of indexing by an
array:

>>> arr = na.arange(20)*2
>>> na.take(arr, na.array([3, 5, 13]))
array([ 6, 10, 26])
>>> arr[na.array([3, 5, 13])]
array([ 6, 10, 26])

While it's true that Python's builtin lists don't support this directly, list
comprehensions make this pretty easy to reproduce:

>>> lst = [2*x for x in range(20)]
>>> [lst[i] for i in [3, 5, 13]]
[6, 10, 26]

> - Perhaps it's useful to protect messages in some
>   regions with not-so-democratic forms of government
>   against automatic scanning by making the message 
>   machine-unreadable, causing some Orwellian Confusion  ?

Unfortunately, this 'encoding scheme' ;) doesn't work in all languages -- what's
the first 'letter' of a two-character word in, say, Mandarin?  ;)

Steve