Extracting patterns after matching a regex
python at mrabarnett.plus.com
Tue Sep 8 16:53:37 CEST 2009
> On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t... at ubisoft.com> wrote:
>>>>> I need to extract a string after a matching a regular expression. For
>>>>> example I have the string...
>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
>>>>> and once I match "FTPHOST" I would like to extract
>>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
>>>>> problem, I had been trying to match the string using something like
>>>>> m = re.findall(r"FTPHOST", s)
>>>>> But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
>>>>> part. Perhaps I need to find the string and then split it? I had some
>>>>> help with a similar problem, but now I don't seem to be able to
>>>>> transfer that to this problem!
>>>>> Thanks in advance for the help,
>>>> No need for regex.
>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
>>>> If "FTPHOST" in s:
>>>> return s[9:]
>>> Sorry perhaps I didn't make it clear enough, so apologies. I only
>>> presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
>>> thought this easily encompassed the problem. The solution presented
>>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
>>> when I used this on the actual file I am trying to parse I realised it
>>> is slightly more complicated as this also pulls out other information,
>>> for example it prints
>>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
>>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
>>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
>>> etc. So I need to find a way to stop it before the \r
>>> slicing the string wouldn't work in this scenario as I can envisage a
>>> situation where the string lenght increases and I would prefer not to
>>> keep having to change the string.
>> If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s.split(":").strip() will work.
> It is an email which contains information before and after the main
> section I am interested in, namely...
> FINISHED: 09/07/2009 08:42:31
> MEDIATYPE: FtpPull
> MEDIAFORMAT: FILEFORMAT
> FTPHOST: e4ftl01u.ecs.nasa.gov
> FTPDIR: /PullDir/0301872638CySfQB
> Ftp Pull Download Links:
> Down load ZIP file of packaged order:
> FTPEXPR: 09/12/2009 08:42:31
> MEDIA 1 of 1
> I have been doing this to turn the email into a string
> email = sys.argv
> f = open(email, 'r')
> s = str(f.readlines())
To me that seems a strange thing to do. You could just read the entire
file as a string:
f = open(email, 'r')
s = f.read()
> so FTPHOST isn't the first element, it is just part of a larger
> string. When I turn the email into a string it looks like...
> 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
> 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
> 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
> \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
> load ZIP file of packaged order:\r\n',
> So not sure splitting it like you suggested works in this case.
More information about the Python-list