[Tutor] [whitelist] Re: regular expressions question
nimrodx at slingshot.co.nz
Thu Aug 17 13:04:59 CEST 2006
I found a pretty complicated way to do it (Alan's way is way more elegant).
In case someone is searching the archive, maybe they will find something
in it that is useful.
It uses the regular experessions module.
# get binary data
inpt = open(fle,'rb')
dat = inpt.read()
#strip out the hex "0"'s
pattern = r"\x00"
res = re.sub(pattern, "", dat)
#it seemed easier to do it in two passes
#create the pattern regular expression for the stuff we want to keep
web = re.compile(
#grab them all and put them in temp variable
res = re.findall(web,res)
tmp = ""
#oops need some new lines at the end of each one to mark end of
#and need it all as one string
for i in res:
tmp = tmp + i+'\n'
#compile reg expr for everything between :// and the newline
web2 = re.compile(r":/(?P<address>[^\n]+)")
#find the websites
#make them into an object we can pass
res2 = re.findall(web2,tmp)
Alan Gauld wrote:
>> if you look carefully at the string below, you see
>> that in amongst the "\x" stuff you have the text I want:
>> z tfile://home/alpha
> OK, those characters are obviously string data and it looks
> like its using 16 bit characters, so yes some kind of
> unicode string. In between and at the end ;lies the binary
> data in whatever format it is.
>>>> Here is the first section of the file:
>> In a hex editor it turns out to be readable and sensible url's with
>> spaces between each digit, and a bit of crud at the end of url's,
>> just as above.
> Here's a fairly drastic approach:
>>>> s =
>>>> ''.join([c for c in s if c.isalnum() or c in '/: '])
> But it gets close...
> Alan g.
More information about the Tutor