[Tutor] [whitelist] Re: regular expressions question
Andrew Robert
andrew.arobert at gmail.com
Tue Aug 22 17:36:55 CEST 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Nimrodx,
In case you haven't found a solution yet, I developed a program to
encode/decode stuff similar to this.
You may want to take a look at it at
http://home.townisp.com/~arobert/python/file_encoder.py
nimrodx wrote:
> Hi Alan,
>
> I found a pretty complicated way to do it (Alan's way is way more elegant).
> In case someone is searching the archive, maybe they will find something
> in it that is useful.
> It uses the regular experessions module.
>
> import re
>
> def dehexlify_websites(fle):
> # get binary data
> inpt = open(fle,'rb')
> dat = inpt.read()
> inpt.close()
> #strip out the hex "0"'s
> pattern = r"\x00"
> res = re.sub(pattern, "", dat)
> #-----------------------------------------
> #it seemed easier to do it in two passes
> #create the pattern regular expression for the stuff we want to keep
> web = re.compile(
> r"(?P<addr>[/a-zA-Z0-9\.\-:\_%\?&=]+)"
> )
> #grab them all and put them in temp variable
> res = re.findall(web,res)
> tmp = ""
> #oops need some new lines at the end of each one to mark end of
> #web address,
> #and need it all as one string
> for i in res:
> tmp = tmp + i+'\n'
> #compile reg expr for everything between :// and the newline
> web2 = re.compile(r":/(?P<address>[^\n]+)")
> #find the websites
> #make them into an object we can pass
> res2 = re.findall(web2,tmp)
> #return 'em
> return res2
>
>
> Thanks Alan,
>
> Matt
>
>
> Alan Gauld wrote:
>>> if you look carefully at the string below, you see
>>> that in amongst the "\x" stuff you have the text I want:
>>> z tfile://home/alpha
>> OK, those characters are obviously string data and it looks
>> like its using 16 bit characters, so yes some kind of
>> unicode string. In between and at the end ;lies the binary
>> data in whatever format it is.
>>
>>>>> Here is the first section of the file:
>>>>> '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01\xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x00l'
>>>>>
>>
>>> In a hex editor it turns out to be readable and sensible url's with
>>> spaces between each digit, and a bit of crud at the end of url's,
>>> just as above.
>> Here's a fairly drastic approach:
>>
>>>>> s =
>>>>> '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01
>>>>>
>> \xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x
>>
>> 00l'
>>>>> ''.join([c for c in s if c.isalnum() or c in '/: '])
>> 'ztfile:/home/al'
>> But it gets close...
>>
>> Alan g.
>>
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (MingW32)
Comment: GnuPT 2.7.2
iD8DBQFE6ySXDvn/4H0LjDwRApntAJ0Wd0ecE/KFUSbbKQSRmrV72yyvfwCeOwAQ
Gjg5IK0WG0YT6keGlDw0q94=
=7QB2
-----END PGP SIGNATURE-----
More information about the Tutor
mailing list