[Tutor] About Perl's Integer module

Fri Dec 17 22:49:35 CET 2004

Mohamed Lrhazi wrote:

> Hello all,
> 
> I ported to python a little Perl script that applies some math algorithm 
> that I do not understand... My version seems to give the same results as 
> the Perl version... but just to make sure I am asking the following:
> 
> The Perl version starts by testing whether Perl is in integer mode or 
> not, and if not it exists! Is there an equivalent module in Python?

No, there isn't.  However, the only operation in Python that should 
convert an integer into a float is division, and you should be able to 
use // (a double slash) to indicate integer division.  But there's no 
division in your script, so it shouldn't matter...  (Integers *will* 
automatically be converted to longs if they get too large, but this 
should be harmless unless you specifically need ints to 'wrap around'.)

> Given the lack of "use integer" from my code... can anyone tell these 
> two programs are equivalent?

I don't know perl, so I can't tell for certain, but I think so. 
However, there are many ways in which this could become more idiomatic 
Python code, and more efficient.

> def complex(domain):
>     h=0
>     res=""
>     domain=domain.lower()
>     prefix=['x','x','/','x','x','/']
>     conv=[
>         'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
>         'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
>         'Y', 'Z',
>         'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
>         'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
>         'y', 'z',
>         '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
>         '-', '_'
>     ]

In Python, strings are sequences, so you can index and slice and 
iterate over them.  This means that you can replace these lists of 
single characters with one string, instead.  Not only that, but given 
that you're using all letters and numbers, you can just get them from 
predefined lists in the string module --

.>> import string
.>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
.>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
.>> string.digits
'0123456789'
.>>

With that in mind, the above line can be changed to:

     conv = string.uppercase + string.lowercase + \
                string.digits + '-_'

(There's also a string.letters which contains both uppercase and 
lowercase, but it has the lowercase first while you have the uppercase 
first.  This will yield different, but equally valid, characters.)

> 
>     for i in range(0,len(domain)):
>         h*=129
>         h+=ord(domain[i])
>         h+=987654321

Because you can iterate over strings, you can do this simpler by 
dropping the integer index and the range().  You can even chain the 
earlier call to lower() in-line here:

     for ch in domain.lower():
         h *= 129
         h += ord(ch)
         h += 987654321

>     if h == 0:
>         h=1

>     prefix[0] = conv[ h & 0x3f ]; h = h >> 6
>     prefix[1] = conv[ h & 0x3f ]; h = h >> 6
>     prefix[3] = conv[ h & 0x3f ]; h = h >> 6
>     prefix[4] = conv[ h & 0x3f ];

Here, instead of predefining prefix list and then replacing certain 
elements, I'd just create a blank list and append to it.  Given that 
your operation is identical each time (except when you need a /), I'd 
put it into a loop.

     prefix = []
     for i in range(6):
         if i in (2, 5):
             prefix.append('/')
         else:
             prefix.append(conv[ h & 0x3f ])
             h = h >> 6

>     
>     return "".join(prefix) + domain
>     
> print complex(sys.argv[1])

So, once more, all together:

.>> def complex(domain):
... 	h = 0
... 	conv = string.uppercase + string.lowercase + \
...                  string.digits + '-_'
... 	for ch in domain.lower():
... 		h *= 129
... 		h += ord(ch)
... 		h += 987654321
... 	if h == 0:
... 		h = 1
... 	prefix = []
... 	for i in range(6):
... 		if i in (2, 5):
... 			prefix.append('/')
... 		else:
... 			prefix.append(conv[ h & 0x3f ])
... 			h = h >> 6
... 	return "".join(prefix) + domain
...
.>>

So, let's see how this works:

.>> complex('site.company.com')
'X3/32/site.company.com'
.>> complex('site2.company.com')
'6U/zv/site2.company.com'
.>> complex('www.python.org')
'ZF/4R/www.python.org'
.>>

So, each character is generated by looking at the hash (h), grabbing 
the least-significant six bits, and using the resulting number (which 
will be 0-63) to look up a character in conv.  The hash is then 
shifted six bits to drop the 'used' bits before grabbing the next 
chunk.  Two of these generated characters are used for each of two 
directory names.  Any given sitename will consistently produce the 
same four characters.

Now, here's a 'batteries-included' function that does much the same 
thing (though it'll be different characters).

.>> def complex2(domain):
... 	import md5
... 	digest = md5.new(domain)
... 	digeststring = digest.hexdigest()
... 	names = (digeststring[:2], digeststring[2:4], domain)
... 	return '/'.join(names)
...
.>> complex2('site.company.com')
'b2/37/site.company.com'
.>> complex2('site2.company.com')
'75/5c/site2.company.com'
.>> complex2('www.python.org')
'16/95/www.python.org'
.>>

This uses the md5 module to generate a 'fingerprint' for each domain 
name.  It gets that fingerprint as a long hexadecimal number, and then 
slices off the first few characters to make the directory names.  Now, 
this method is going to be heavily weighted towards numbers instead of 
letters, and it can create a maximum of 16^4 (65,536) different 
directories instead of 64^4 (16,777,216), so you're somewhat more 
likely to have collisions (multiple sites in a single directory), but 
it's still not very likely.  (That chance can be reduced by using 
longer names, too.  Using three-character directory names gives 16^6 
possibilities -- which is equal to 64^4, the same as your perl script.)

Jeff Shannon
Technician/Programmer
Credit International