[Tutor] using re to match text and extract info

Thu Dec 31 19:19:19 CET 2009

Norman Khine wrote:
> hello,
>
>   
>>>> import re
>>>> line = "ALSACE 67000 Strasbourg 24 rue de la Division Leclerc 03 88 23 05 66 strasbourg at artisansdumonde.org"
>>>> m = re.search('[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', line)
>>>> emailAddress .search(r"(\d+)", line)
>>>> phoneNumber = re.compile(r'(\d{2}) (\d{2}) (\d{2}) (\d{2}) (\d{2})')
>>>> phoneNumber.search(line)
>>>>         
>
> but this jumbles the phone number and also includes the 67000.
>
> how can i split the 'line' into a list?
>
> thanks
> norman
>
>   
lst = line.split()    will split the line strictly by whitespace.

Before you can write code to parse a line, you have to know for sure the 
syntax of that line.  This particular one has 15 fields, delimited by 
spaces.  So you can parse it with str.split(), and use slices to get the 
particular set of numbers representing the phone number.  (elements 9-14)

If the address portion might be a variable number of words, then you 
could still use split and slice, but use negative slice parameters to 
get the phone number relative to the end. (elements -6 to -2)

If the email address might have a space within it, then you have to get 
fancier.

If the phone number might have more or less than 5 "words", you have to 
get fancier.

Without a spec, all the regular expressions in the world are just noise.

DaveA