[Tutor] [whitelist] Re: regular expressions question

Tue Aug 22 17:36:55 CEST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Nimrodx,

In case you haven't found a solution yet, I developed a program to
encode/decode stuff similar to this.

You may want to take a look at it at

 http://home.townisp.com/~arobert/python/file_encoder.py

nimrodx wrote:
> Hi Alan,
> 
> I found a pretty complicated way to do it (Alan's way is way more elegant).
> In case someone is searching the archive, maybe they will find something 
> in it that is useful.
> It uses the regular experessions module.
> 
> import re
> 
> def dehexlify_websites(fle):
>    # get binary data
>    inpt = open(fle,'rb')
>    dat = inpt.read()
>    inpt.close()
>    #strip out the hex "0"'s
>    pattern = r"\x00"
>    res = re.sub(pattern, "", dat)
>    #-----------------------------------------
>    #it seemed easier to do it in two passes
>    #create the pattern regular expression for the stuff we want to keep
>    web = re.compile(
>                     r"(?P<addr>[/a-zA-Z0-9\.\-:\_%\?&=]+)"
>                     )
>    #grab them all and put them in temp variable
>    res = re.findall(web,res)
>    tmp = ""
>    #oops need some new lines at the end of each one to mark end of
>     #web address,
>    #and need it all as one string
>    for i in res:
>        tmp = tmp + i+'\n'
>    #compile reg expr for everything between :// and the newline
>    web2 = re.compile(r":/(?P<address>[^\n]+)")
>    #find the websites
>    #make them into an object we can pass
>    res2 = re.findall(web2,tmp)
>    #return 'em
>    return res2
> 
> 
> Thanks Alan,
> 
> Matt
> 
> 
> Alan Gauld wrote:
>>> if you look carefully at the string below, you see
>>> that in amongst the "\x" stuff you have the text I want:
>>> z tfile://home/alpha
>> OK, those characters are obviously string data and it looks
>> like its using 16 bit characters, so yes some kind of
>> unicode string. In between and at the end ;lies the binary
>> data in whatever format it is.
>>
>>>>> Here is the first section of the file:
>>>>> '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01\xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x00l' 
>>>>>
>>
>>> In a hex editor it turns out to be readable and sensible url's with 
>>> spaces between each digit, and a bit of crud at the end of url's, 
>>> just as above.
>> Here's a fairly drastic approach:
>>
>>>>> s = 
>>>>> '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01 
>>>>>
>> \xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x 
>>
>> 00l'
>>>>> ''.join([c for c in s if c.isalnum() or c in '/: '])
>> 'ztfile:/home/al'
>> But it gets close...
>>
>> Alan g.
>>
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (MingW32)
Comment: GnuPT 2.7.2

iD8DBQFE6ySXDvn/4H0LjDwRApntAJ0Wd0ecE/KFUSbbKQSRmrV72yyvfwCeOwAQ
Gjg5IK0WG0YT6keGlDw0q94=
=7QB2
-----END PGP SIGNATURE-----