Extracting patterns after matching a regex

Wed Sep 9 07:55:22 EDT 2009

Mart. wrote:
> On Sep 8, 4:33 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>> Mart. wrote:
>>> On Sep 8, 3:53 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>>>> Mart. wrote:
>>>>> On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t... at ubisoft.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>> I need to extract a string after a matching a regular expression. For
>>>>>>>>> example I have the string...
>>>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
>>>>>>>>> and once I match "FTPHOST" I would like to extract
>>>>>>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
>>>>>>>>> problem, I had been trying to match the string using something like
>>>>>>>>> this:
>>>>>>>>> m = re.findall(r"FTPHOST", s)
>>>>>>>>> But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
>>>>>>>>> part. Perhaps I need to find the string and then split it? I had some
>>>>>>>>> help with a similar problem, but now I don't seem to be able to
>>>>>>>>> transfer that to this problem!
>>>>>>>>> Thanks in advance for the help,
>>>>>>>>> Martin
>>>>>>>> No need for regex.
>>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
>>>>>>>> If "FTPHOST" in s:
>>>>>>>>     return s[9:]
>>>>>>>> Cheers,
>>>>>>>> Drea
>>>>>>> Sorry perhaps I didn't make it clear enough, so apologies. I only
>>>>>>> presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
>>>>>>> thought this easily encompassed the problem. The solution presented
>>>>>>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
>>>>>>> when I used this on the actual file I am trying to parse I realised it
>>>>>>> is slightly more complicated as this also pulls out other information,
>>>>>>> for example it prints
>>>>>>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
>>>>>>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
>>>>>>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
>>>>>>> etc. So I need to find a way to stop it before the \r
>>>>>>> slicing the string wouldn't work in this scenario as I can envisage a
>>>>>>> situation where the string lenght increases and I would prefer not to
>>>>>>> keep having to change the string.
>>>>>> If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
>>>>> It is an email which contains information before and after the main
>>>>> section I am interested in, namely...
>>>>> FINISHED: 09/07/2009 08:42:31
>>>>> MEDIATYPE: FtpPull
>>>>> MEDIAFORMAT: FILEFORMAT
>>>>> FTPHOST: e4ftl01u.ecs.nasa.gov
>>>>> FTPDIR: /PullDir/0301872638CySfQB
>>>>> Ftp Pull Download Links:
>>>>> ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
>>>>> Down load ZIP file of packaged order:
>>>>> ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
>>>>> FTPEXPR: 09/12/2009 08:42:31
>>>>> MEDIA 1 of 1
>>>>> MEDIAID:
>>>>> I have been doing this to turn the email into a string
>>>>> email = sys.argv[1]
>>>>> f = open(email, 'r')
>>>>> s = str(f.readlines())
>>>> To me that seems a strange thing to do. You could just read the entire
>>>> file as a string:
>>>>      f = open(email, 'r')
>>>>      s = f.read()
>>>>> so FTPHOST isn't the first element, it is just part of a larger
>>>>> string. When I turn the email into a string it looks like...
>>>>> 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
>>>>> 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
>>>>> 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
>>>>> \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
>>>>> load ZIP file of packaged order:\r\n',
>>>>> So not sure splitting it like you suggested works in this case.
>>> Within the file are a list of files, e.g.
>>> TOTAL FILES: 2
>>>            FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
>>>            FILESIZE: 11028908
>>>            FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
>>>            FILESIZE: 18975
>>> and what i want to do is get the ftp address from the file and collect
>>> these files to pull down from the web e.g.
>>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf
>>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
>>> Thus far I have
>>> #!/usr/bin/env python
>>> import sys
>>> import re
>>> import urllib
>>> email = sys.argv[1]
>>> f = open(email, 'r')
>>> s = str(f.readlines())
>>> m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
>>> \....", s)
>>> ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
>>> ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
>>> url = 'ftp://' + ftphost + ftpdir
>>> for i in xrange(len(m)):
>>>    print i, ':', len(m)
>>>    file1 = m[i][:-4]               # remove xml bit.
>>>    file2 = m[i]
>>>    urllib.urlretrieve(url, file1)
>>>    urllib.urlretrieve(url, file2)
>>> which works, clearly my match for the MOD13A2* files isn't ideal I
>>> guess, but they will always occupt those dimensions, so it should
>>> work. Any suggestions on how to improve this are appreciated.
>> Suppose the file contains your example text above. Using 'readlines'
>> returns a list of the lines:
>>
>>  >>> f = open(email, 'r')
>>  >>> lines = f.readlines()
>>  >>> lines
>> ['TOTAL FILES: 2\n', '\t\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
>> 11028908\n', '\n', '\t\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
>> 18975\n']
>>
>> Using 'str' on that list then converts it to s string _representation_
>> of that list:
>>
>>  >>> str(lines)
>> "['TOTAL FILES: 2\\n', '\\t\\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:
>> 11028908\\n', '\\n', '\\t\\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE:
>> 18975\\n']"
>>
>> That just parsing a lot more difficult.
>>
>> It's much easier to just read the entire file as a single string and
>> then parse that:
>>
>>  >>> f = open(email, 'r')
>>  >>> s = f.read()
>>  >>> s
>> 'TOTAL FILES: 2\n\t\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE:
>> 11028908\n\n\t\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n'
>>  >>> import re
>>  >>> re.findall(r"FILENAME: (.+)", s)
>> ['MOD13A2.A2007033.h17v08.005.2007101023605.hdf',
>> 'MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml']
> 
> If I do it this way I can't seem to not extract the \r at the end of
> the line.
> 
> In [26]: m = re.search(r"FTPHOST: (.+)", s)
> 
> In [27]: m.group(1)
> Out[27]: 'e4ftl01u.ecs.nasa.gov\r'
> 
> but if I insert \\r at the end as was previously suggested.
> 
> In [28]: m = re.search(r"FTPHOST: (.+)\\r", s)
> 
> In [29]: m.group(1)
> 
> AttributeError: 'NoneType' object has no attribute 'group'
> 
> Any thoughts?
> 
Try opening the file with universal line-ending mode:

     f = open(email, 'rU')