Extracting patterns after matching a regex

Tue Sep 8 11:33:04 EDT 2009

Mart. wrote:
> On Sep 8, 3:53 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>> Mart. wrote:
>>> On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t... at ubisoft.com> wrote:
>>>>>>> Hi,
>>>>>>> I need to extract a string after a matching a regular expression. For
>>>>>>> example I have the string...
>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
>>>>>>> and once I match "FTPHOST" I would like to extract
>>>>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
>>>>>>> problem, I had been trying to match the string using something like
>>>>>>> this:
>>>>>>> m = re.findall(r"FTPHOST", s)
>>>>>>> But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
>>>>>>> part. Perhaps I need to find the string and then split it? I had some
>>>>>>> help with a similar problem, but now I don't seem to be able to
>>>>>>> transfer that to this problem!
>>>>>>> Thanks in advance for the help,
>>>>>>> Martin
>>>>>> No need for regex.
>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
>>>>>> If "FTPHOST" in s:
>>>>>>     return s[9:]
>>>>>> Cheers,
>>>>>> Drea
>>>>> Sorry perhaps I didn't make it clear enough, so apologies. I only
>>>>> presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
>>>>> thought this easily encompassed the problem. The solution presented
>>>>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
>>>>> when I used this on the actual file I am trying to parse I realised it
>>>>> is slightly more complicated as this also pulls out other information,
>>>>> for example it prints
>>>>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
>>>>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
>>>>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
>>>>> etc. So I need to find a way to stop it before the \r
>>>>> slicing the string wouldn't work in this scenario as I can envisage a
>>>>> situation where the string lenght increases and I would prefer not to
>>>>> keep having to change the string.
>>>> If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
>>> It is an email which contains information before and after the main
>>> section I am interested in, namely...
>>> FINISHED: 09/07/2009 08:42:31
>>> MEDIATYPE: FtpPull
>>> MEDIAFORMAT: FILEFORMAT
>>> FTPHOST: e4ftl01u.ecs.nasa.gov
>>> FTPDIR: /PullDir/0301872638CySfQB
>>> Ftp Pull Download Links:
>>> ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
>>> Down load ZIP file of packaged order:
>>> ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
>>> FTPEXPR: 09/12/2009 08:42:31
>>> MEDIA 1 of 1
>>> MEDIAID:
>>> I have been doing this to turn the email into a string
>>> email = sys.argv[1]
>>> f = open(email, 'r')
>>> s = str(f.readlines())
>> To me that seems a strange thing to do. You could just read the entire
>> file as a string:
>>
>>      f = open(email, 'r')
>>      s = f.read()
>>
>>> so FTPHOST isn't the first element, it is just part of a larger
>>> string. When I turn the email into a string it looks like...
>>> 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
>>> 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
>>> 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
>>> \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
>>> load ZIP file of packaged order:\r\n',
>>> So not sure splitting it like you suggested works in this case.
>>
> 
> Within the file are a list of files, e.g.
> 
> TOTAL FILES: 2
> 		FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
> 		FILESIZE: 11028908
> 
> 		FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
> 		FILESIZE: 18975
> 
> and what i want to do is get the ftp address from the file and collect
> these files to pull down from the web e.g.
> 
> MOD13A2.A2007033.h17v08.005.2007101023605.hdf
> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
> 
> Thus far I have
> 
> #!/usr/bin/env python
> 
> import sys
> import re
> import urllib
> 
> email = sys.argv[1]
> f = open(email, 'r')
> s = str(f.readlines())
> m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
> \....", s)
> 
> ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
> ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
> url = 'ftp://' + ftphost + ftpdir
> 
> for i in xrange(len(m)):
> 
> 	print i, ':', len(m)
> 	file1 = m[i][:-4]		# remove xml bit.
> 	file2 = m[i]
> 
> 	urllib.urlretrieve(url, file1)
> 	urllib.urlretrieve(url, file2)
> 
> which works, clearly my match for the MOD13A2* files isn't ideal I
> guess, but they will always occupt those dimensions, so it should
> work. Any suggestions on how to improve this are appreciated.
> 
Suppose the file contains your example text above. Using 'readlines'
returns a list of the lines:

 >>> f = open(email, 'r')
 >>> lines = f.readlines()
 >>> lines
['TOTAL FILES: 2\n', '\t\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE: 
11028908\n', '\n', '\t\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE: 
18975\n']

Using 'str' on that list then converts it to s string _representation_
of that list:

 >>> str(lines)
"['TOTAL FILES: 2\\n', '\\t\\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE: 
11028908\\n', '\\n', '\\t\\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE: 
18975\\n']"

That just parsing a lot more difficult.

It's much easier to just read the entire file as a single string and
then parse that:

 >>> f = open(email, 'r')
 >>> s = f.read()
 >>> s
'TOTAL FILES: 2\n\t\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE: 
11028908\n\n\t\tFILENAME: 
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n'
 >>> import re
 >>> re.findall(r"FILENAME: (.+)", s)
['MOD13A2.A2007033.h17v08.005.2007101023605.hdf', 
'MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml']