[Tutor] using re to match text and extract info
Dave Angel
davea at ieee.org
Thu Dec 31 19:19:19 CET 2009
Norman Khine wrote:
> hello,
>
>
>>>> import re
>>>> line = "ALSACE 67000 Strasbourg 24 rue de la Division Leclerc 03 88 23 05 66 strasbourg at artisansdumonde.org"
>>>> m = re.search('[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', line)
>>>> emailAddress .search(r"(\d+)", line)
>>>> phoneNumber = re.compile(r'(\d{2}) (\d{2}) (\d{2}) (\d{2}) (\d{2})')
>>>> phoneNumber.search(line)
>>>>
>
> but this jumbles the phone number and also includes the 67000.
>
> how can i split the 'line' into a list?
>
> thanks
> norman
>
>
lst = line.split() will split the line strictly by whitespace.
Before you can write code to parse a line, you have to know for sure the
syntax of that line. This particular one has 15 fields, delimited by
spaces. So you can parse it with str.split(), and use slices to get the
particular set of numbers representing the phone number. (elements 9-14)
If the address portion might be a variable number of words, then you
could still use split and slice, but use negative slice parameters to
get the phone number relative to the end. (elements -6 to -2)
If the email address might have a space within it, then you have to get
fancier.
If the phone number might have more or less than 5 "words", you have to
get fancier.
Without a spec, all the regular expressions in the world are just noise.
DaveA
More information about the Tutor
mailing list