[Tutor] [whitelist] Re: regular expressions question

nimrodx nimrodx at slingshot.co.nz
Thu Aug 17 13:04:59 CEST 2006


Hi Alan,

I found a pretty complicated way to do it (Alan's way is way more elegant).
In case someone is searching the archive, maybe they will find something 
in it that is useful.
It uses the regular experessions module.

import re

def dehexlify_websites(fle):
   # get binary data
   inpt = open(fle,'rb')
   dat = inpt.read()
   inpt.close()
   #strip out the hex "0"'s
   pattern = r"\x00"
   res = re.sub(pattern, "", dat)
   #-----------------------------------------
   #it seemed easier to do it in two passes
   #create the pattern regular expression for the stuff we want to keep
   web = re.compile(
                    r"(?P<addr>[/a-zA-Z0-9\.\-:\_%\?&=]+)"
                    )
   #grab them all and put them in temp variable
   res = re.findall(web,res)
   tmp = ""
   #oops need some new lines at the end of each one to mark end of
    #web address,
   #and need it all as one string
   for i in res:
       tmp = tmp + i+'\n'
   #compile reg expr for everything between :// and the newline
   web2 = re.compile(r":/(?P<address>[^\n]+)")
   #find the websites
   #make them into an object we can pass
   res2 = re.findall(web2,tmp)
   #return 'em
   return res2


Thanks Alan,

Matt


Alan Gauld wrote:
>> if you look carefully at the string below, you see
>> that in amongst the "\x" stuff you have the text I want:
>> z tfile://home/alpha
>
> OK, those characters are obviously string data and it looks
> like its using 16 bit characters, so yes some kind of
> unicode string. In between and at the end ;lies the binary
> data in whatever format it is.
>
>>>> Here is the first section of the file:
>>>> '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01\xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x00l' 
>>>>
>
>
>> In a hex editor it turns out to be readable and sensible url's with 
>> spaces between each digit, and a bit of crud at the end of url's, 
>> just as above.
>
> Here's a fairly drastic approach:
>
>>>> s = 
>>>> '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01 
>>>>
> \xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x 
>
> 00l'
>>>> ''.join([c for c in s if c.isalnum() or c in '/: '])
> 'ztfile:/home/al'
>>>>
>
> But it gets close...
>
> Alan g.
>



More information about the Tutor mailing list