[Tutor] Extract strings from a text file

Fri Feb 27 08:22:32 CET 2009

Le Thu, 26 Feb 2009 21:53:43 -0800,
Mohamed Hassan <linuxlover00 at gmail.com> s'exprima ainsi:

> Hi all,
> 
> I am new to Python and still trying to figure out some things. Here is the
> situation:
> 
> There is a text file that looks like this:
> 
> text text text <ID>Joseph</text text text>
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text text text text text text text text text
> text text text <Full name> Joseph Smith</text text text>
> text text text <Rights> 1</text text text>
> text text text <LDAP> 0</text text text>
> 
> 
> This text file is very long, however all the entries in it looks the same at
> the above.
> 
> What I am trying to do is:
> 
> 1. I need to extract the name and the full name from this text file. For
> example: ( ID is Joseph & Full name is Joseph Smith).
> 
> 
> - I am thinking I need to write something that will check the whole text
> file line by line which I have done already.
> - Now what I am trying to figure out is : How can I write a function that
> will check to see if the line contains the word ID between < > then copy the
> letters after > until > and dump it to a text file.
> 
> Can somebody help please. I know this might soudn easy for some people, but
> again I am new to Python and still figuring out things.
> 
> Thank you

This is a typical text parsing job. There are tools for that. However, probably we would need a bit more information about the real text structure, and first of all what you wish to do with it later, to point you to the most appropriate tool. I guess that there is a higher level structure that nests IDs, names, rights etc in a section and that you will need to keep them together for further process.
Anyway for a startup exploration you can use regular expressions (regex) to extract individual data item. For instance:

from re import compile as Pattern
pattern = Pattern(r""".*<ID>(.+)<.+>.*""")
line = "text text text <ID>Joseph</text text text>"
print pattern.findall(line)
text = """\
text text text <ID>Joseph</text text text>
text text text <ID>Jodia</text text text>
text text text <ID>Joobawap</text text text>
"""
print pattern.findall(text)
==>
['Joseph']
['Joseph', 'Jodia', 'Joobawap']

There is a nice tutorial on regexes somewhere (you will easily find). Key points on this example are:

	r""".*<ID>(.+)<.+>.*"""
* the pattern between """...""" expresses the overall format to be matched
* all what is between (...) will be extracted by findall
* '.' mean 'any character'; '*' means zero or more of what is just before; '+' mean one or more of what is just before.

So the pattern will look for chains that contains a sequence formed of:

1. possible start chars
2. <ID> literally
3. one or more chars -- to return
4. something between <...>
5. possible end chars

Denis
------
la vita e estrany