local greediness ???
John Machin
sjmachin at lexicon.net
Wed Apr 19 03:11:57 EDT 2006
On 19/04/2006 3:09 PM, tygerc at gmail.com wrote:
> hi, all. I need to process a file with the following format:
> $ cat sample
> [(some text)2.3(more text)4.5(more text here)]
> [(aa bb ccc)-1.2(kdk)12.0(xxxyyy)]
> [(xxx)11.0(bbb\))8.9(end here)]
> .......
>
> my goal here is for each line, extract every '(.*)' (including the
> round
> brackets, put them in a list, and extract every float on the same line
> and put them in a list.. here is my code:
>
> p = re.compile(r'\[.*\]$')
> num = re.compile(r'[-\d]+[.\d]*')
> brac = re.compile(r'\(.*?\)')
>
> for line in ifp:
> if p.match(line):
> x = num.findall(line)
> y = brac.findall(line)
> print x, y len(x), len(y)
>
> Now, this works for most of the lines. however, I'm having problems
> with
> lines such as line 3 above (in the sample file). here, (bbb\)) contains
> an escaped
> ')' and the re I use will match it (because of the non-greedy '?'). But
> I want this to
> be ignored since it's escaped. is there a such thing as local
> greediness??
> Can anyone suggest a way to deal with this here..
> thanks.
>
For a start, your brac pattern is better rewritten to avoid the
non-greedy ? tag: r'\([^)]*\)' -- this says the middle part is zero or
more occurrences of a single character that is not a ')'
To handle the pesky backslash-as-escape, we need to extend that to: zero
or more occurrences of either (a) a single character that is not a ')'
or (b) the two-character string r"\)". This gives us something like this:
#>>> brac = re.compile(r'\((?:\\\)|[^)])*\)')
#>>> tests = r"(xxx)123.4(bbb\))5.6(end\Zhere)7.8()9.0(\))1.2(ab\)cd)"
#>>> brac.findall(tests)
['(xxx)', '(bbb\\))', '(end\\Zhere)', '()', '(\\))', '(ab\\)cd)']
#>>>
Pretty, isn't it? Maybe better done with a hand-coded state machine.
More information about the Python-list
mailing list