Extracting patterns after matching a regex

Wed Sep 9 11:58:43 EDT 2009

Mart. wrote:
> On Sep 8, 4:33 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>> Mart. wrote:
>>> On Sep 8, 3:53 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>>>> Mart. wrote:
>>>>> On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t... at ubisoft.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>> I need to extract a string after a matching a regular expression. For
>>>>>>>>> example I have the string...
>>>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
>>>>>>>>> and once I match "FTPHOST" I would like to extract
>>>>>>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
>>>>>>>>> problem, I had been trying to match the string using something like
>>>>>>>>> this:
>>>>>>>>> m = re.findall(r"FTPHOST", s)
>>>>>>>>> But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
>>>>>>>>> part. Perhaps I need to find the string and then split it? I had some
>>>>>>>>> help with a similar problem, but now I don't seem to be able to
>>>>>>>>> transfer that to this problem!
>>>>>>>>> Thanks in advance for the help,
>>>>>>>>> Martin
>>>>>>>> No need for regex.
>>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
>>>>>>>> If "FTPHOST" in s:
>>>>>>>>     return s[9:]
>>>>>>>> Cheers,
>>>>>>>> Drea
>>>>>>> Sorry perhaps I didn't make it clear enough, so apologies. I only
>>>>>>> presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
>>>>>>> thought this easily encompassed the problem. The solution presented
>>>>>>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
>>>>>>> when I used this on the actual file I am trying to parse I realised it
>>>>>>> is slightly more complicated as this also pulls out other information,
>>>>>>> for example it prints
>>>>>>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
>>>>>>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
>>>>>>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
>>>>>>> etc. So I need to find a way to stop it before the \r
>>>>>>> slicing the string wouldn't work in this scenario as I can envisage a
>>>>>>> situation where the string lenght increases and I would prefer not to
>>>>>>> keep having to change the string.
>>>>>> If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
>>>>> It is an email which contains information before and after the main
>>>>> section I am interested in, namely...
>>>>> FINISHED: 09/07/2009 08:42:31
>>>>> MEDIATYPE: FtpPull
>>>>> MEDIAFORMAT: FILEFORMAT
>>>>> FTPHOST: e4ftl01u.ecs.nasa.gov
>>>>> FTPDIR: /PullDir/0301872638CySfQB
>>>>> Ftp Pull Download Links:
>>>>> ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
>>>>> Down load ZIP file of packaged order:
>>>>> ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
>>>>> FTPEXPR: 09/12/2009 08:42:31
>>>>> MEDIA 1 of 1
>>>>> MEDIAID:
>>>>> I have been doing this to turn the email into a string
>>>>> email = sys.argv[1]
>>>>> f = open(email, 'r')
>>>>> s = str(f.readlines())
>>>> To me that seems a strange thing to do. You could just read the entire
>>>> file as a string:
>>>>      f = open(email, 'r')
>>>>      s = f.read()
>>>>> so FTPHOST isn't the first element, it is just part of a larger
>>>>> string. When I turn the email into a string it looks like...
>>>>> 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
>>>>> 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
>>>>> 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
>>>>> \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
>>>>> load ZIP file of packaged order:\r\n',
>>>>> So not sure splitting it like you suggested works in this case.
>>> Within the file are a list of files, e.g.
>>> TOTAL FILES: 2
>>>            FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
>>>            FILESIZE: 11028908
>>>            FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
>>>            FILESIZE: 18975
>>> and what i want to do is get the ftp address from the file and collect
>>> these files to pull down from the web e.g.
>>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf
>>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
>>> Thus far I have
>>> #!/usr/bin/env python
>>> import sys
>>> import re
>>> import urllib
>>> email = sys.argv[1]
>>> f = open(email, 'r')
>>> s = str(f.readlines())
>>> m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
>>> \....", s)
>>> ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
>>> ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
>>> url = 'ftp://' + ftphost + ftpdir
>>> for i in xrange(len(m)):
>>>    print i, ':', len(m)
>>>    file1 = m[i][:-4]               # remove xml bit.
>>>    file2 = m[i]
>>>    urllib.urlretrieve(url, file1)
>>>    urllib.urlretrieve(url, file2)
>>> which works, clearly my match for the MOD13A2* files isn't ideal I
>>> guess, but they will always occupt those dimensions, so it should
>>> work. Any suggestions on how to improve this are appreciated.
>> Suppose the file contains your example text above. Using 'readlines'
>> returns a list of the lines:
>>
>>  >>> f = open(email, 'r')
>>  >>> lines = f.readlines()
>>  >>> lines
>> ['TOTAL FILES: 2\n', '\t\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
>> 11028908\n', '\n', '\t\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
>> 18975\n']
>>
>> Using 'str' on that list then converts it to s string _representation_
>> of that list:
>>
>>  >>> str(lines)
>> "['TOTAL FILES: 2\\n', '\\t\\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:
>> 11028908\\n', '\\n', '\\t\\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE:
>> 18975\\n']"
>>
>> That just parsing a lot more difficult.
>>
>> It's much easier to just read the entire file as a single string and
>> then parse that:
>>
>>  >>> f = open(email, 'r')
>>  >>> s = f.read()
>>  >>> s
>> 'TOTAL FILES: 2\n\t\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE:
>> 11028908\n\n\t\tFILENAME:
>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n'
>>  >>> import re
>>  >>> re.findall(r"FILENAME: (.+)", s)
>> ['MOD13A2.A2007033.h17v08.005.2007101023605.hdf',
>> 'MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml']
> 
> If I do it this way I can't seem to not extract the \r at the end of
> the line.
> 
> In [26]: m = re.search(r"FTPHOST: (.+)", s)
> 
> In [27]: m.group(1)
> Out[27]: 'e4ftl01u.ecs.nasa.gov\r'
> 
> but if I insert \\r at the end as was previously suggested.
> 
> In [28]: m = re.search(r"FTPHOST: (.+)\\r", s)
> 
> In [29]: m.group(1)
> 
> AttributeError: 'NoneType' object has no attribute 'group'
> 
> Any thoughts?
> 
> Thanks

Just use \r at the end, not \\r. \r is the carriage return character, 
which ends the line. \\r becomes two characters, the character backslash
"\", followed by the character "r".