[Tutor] About Perl's Integer module
Jeff Shannon
jeff at ccvcorp.com
Fri Dec 17 22:49:35 CET 2004
Mohamed Lrhazi wrote:
> Hello all,
>
> I ported to python a little Perl script that applies some math algorithm
> that I do not understand... My version seems to give the same results as
> the Perl version... but just to make sure I am asking the following:
>
> The Perl version starts by testing whether Perl is in integer mode or
> not, and if not it exists! Is there an equivalent module in Python?
No, there isn't. However, the only operation in Python that should
convert an integer into a float is division, and you should be able to
use // (a double slash) to indicate integer division. But there's no
division in your script, so it shouldn't matter... (Integers *will*
automatically be converted to longs if they get too large, but this
should be harmless unless you specifically need ints to 'wrap around'.)
> Given the lack of "use integer" from my code... can anyone tell these
> two programs are equivalent?
I don't know perl, so I can't tell for certain, but I think so.
However, there are many ways in which this could become more idiomatic
Python code, and more efficient.
> def complex(domain):
> h=0
> res=""
> domain=domain.lower()
> prefix=['x','x','/','x','x','/']
> conv=[
> 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
> 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',
> 'Y', 'Z',
> 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
> 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
> 'y', 'z',
> '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
> '-', '_'
> ]
In Python, strings are sequences, so you can index and slice and
iterate over them. This means that you can replace these lists of
single characters with one string, instead. Not only that, but given
that you're using all letters and numbers, you can just get them from
predefined lists in the string module --
.>> import string
.>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
.>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
.>> string.digits
'0123456789'
.>>
With that in mind, the above line can be changed to:
conv = string.uppercase + string.lowercase + \
string.digits + '-_'
(There's also a string.letters which contains both uppercase and
lowercase, but it has the lowercase first while you have the uppercase
first. This will yield different, but equally valid, characters.)
>
> for i in range(0,len(domain)):
> h*=129
> h+=ord(domain[i])
> h+=987654321
Because you can iterate over strings, you can do this simpler by
dropping the integer index and the range(). You can even chain the
earlier call to lower() in-line here:
for ch in domain.lower():
h *= 129
h += ord(ch)
h += 987654321
> if h == 0:
> h=1
> prefix[0] = conv[ h & 0x3f ]; h = h >> 6
> prefix[1] = conv[ h & 0x3f ]; h = h >> 6
> prefix[3] = conv[ h & 0x3f ]; h = h >> 6
> prefix[4] = conv[ h & 0x3f ];
Here, instead of predefining prefix list and then replacing certain
elements, I'd just create a blank list and append to it. Given that
your operation is identical each time (except when you need a /), I'd
put it into a loop.
prefix = []
for i in range(6):
if i in (2, 5):
prefix.append('/')
else:
prefix.append(conv[ h & 0x3f ])
h = h >> 6
>
> return "".join(prefix) + domain
>
> print complex(sys.argv[1])
So, once more, all together:
.>> def complex(domain):
... h = 0
... conv = string.uppercase + string.lowercase + \
... string.digits + '-_'
... for ch in domain.lower():
... h *= 129
... h += ord(ch)
... h += 987654321
... if h == 0:
... h = 1
... prefix = []
... for i in range(6):
... if i in (2, 5):
... prefix.append('/')
... else:
... prefix.append(conv[ h & 0x3f ])
... h = h >> 6
... return "".join(prefix) + domain
...
.>>
So, let's see how this works:
.>> complex('site.company.com')
'X3/32/site.company.com'
.>> complex('site2.company.com')
'6U/zv/site2.company.com'
.>> complex('www.python.org')
'ZF/4R/www.python.org'
.>>
So, each character is generated by looking at the hash (h), grabbing
the least-significant six bits, and using the resulting number (which
will be 0-63) to look up a character in conv. The hash is then
shifted six bits to drop the 'used' bits before grabbing the next
chunk. Two of these generated characters are used for each of two
directory names. Any given sitename will consistently produce the
same four characters.
Now, here's a 'batteries-included' function that does much the same
thing (though it'll be different characters).
.>> def complex2(domain):
... import md5
... digest = md5.new(domain)
... digeststring = digest.hexdigest()
... names = (digeststring[:2], digeststring[2:4], domain)
... return '/'.join(names)
...
.>> complex2('site.company.com')
'b2/37/site.company.com'
.>> complex2('site2.company.com')
'75/5c/site2.company.com'
.>> complex2('www.python.org')
'16/95/www.python.org'
.>>
This uses the md5 module to generate a 'fingerprint' for each domain
name. It gets that fingerprint as a long hexadecimal number, and then
slices off the first few characters to make the directory names. Now,
this method is going to be heavily weighted towards numbers instead of
letters, and it can create a maximum of 16^4 (65,536) different
directories instead of 64^4 (16,777,216), so you're somewhat more
likely to have collisions (multiple sites in a single directory), but
it's still not very likely. (That chance can be reduced by using
longer names, too. Using three-character directory names gives 16^6
possibilities -- which is equal to 64^4, the same as your perl script.)
Jeff Shannon
Technician/Programmer
Credit International
More information about the Tutor
mailing list