Reading plain text file database tables
Darrell
news at dorb.com
Thu Sep 9 21:56:55 EDT 1999
Assuming your data can have the seperator embeded, you have a more
complicated problem.
I saved this note from Tim Petters.
Re: Finding nested expressions in regexs
> I need to extract nested expressions contained within paired symbols e.g
> "{ .. }" or "( .. )"
>
> input => matched
> 123{abc}4567 => {abc}
> 123{abc}456}7 => {abc}
> hij{123{abc}4568}def => {123{abc}4568}
> hij{123{a\{bc}45\}68}def => {123{a\{bc}45\}68}
>
> Note:
> - "\{" and "\}" should be completely ignored
> - Expressions may be nested as above but also repeating:
>
> ab{cd{ef}gh}ijkl{mn{o}}qrstu => {cd{ef}gh}, {mn{o}}
>
> Is there an easy way to do this with regular expressions?
Sorry, not even a hard way. "Regular expressions can't count" -- nested
brackets are beyond their abilities. Easier to see if you simplify it *way*
down, and just try to come up with a regexp that matches these (& only
these) strings:
()
(())
((()))
(((())))
...
"(" * n + ")" * n
> The last point makes it difficult to use a straight-forward
> non-greedy match which would find:
> {cd{ef}
> in the last example.
Actually, it was impossible long before the last point.
I'll attach a program that solves the problem without regexps. As is often
the case, it's much wordier than a solution using hairy regexps. OTOH, it
gives much better error msgs, and also works <wink>.
Note that for your second example it raises an error:
ValueError:
123{abc}456}7
^
Error in findtop: no matching open bracket for }
If you want to try to make some sort of sense out of cases like that, think
twice <0.9 wink>.
no-regexp-no-problem-ly y'rs - tim
def findtoperror(msg, s, i):
emsg = "\n" + s
if s[-1:] != "\n":
emsg = emsg + "\n"
emsg = emsg + " " * i + "^\n"
emsg = emsg + "Error in findtop: " + msg + "\n"
raise ValueError(emsg)
# Return list of (start, stop) indices for top-level bracketed expressions.
def findtop(s, openers="([{", closers=")]}"):
import string
if len(openers) != len(closers):
raise ValueError("openers and closers must be same length")
startstack, answer = [], []
i, n = 0, len(s)
while i < n:
ch = s[i]
if ch == "\\":
i = i + 1
if i >= n:
findtoperror("lone backslash at end", s, i-1)
elif ch in openers:
startstack.append(i)
elif ch in closers:
if len(startstack) == 0:
findtoperror("no matching open bracket for " + ch, s, i)
starti = startstack.pop()
opener = s[starti]
expected_closer = closers[string.find(openers, opener)]
if expected_closer != ch:
findtoperror("expected " + expected_closer, s, i)
if len(startstack) == 0:
answer.append( (starti, i+1) )
i = i + 1
if startstack:
msg = ["unclosed"]
for i in startstack:
msg.append("%s at %d," % (s[i], i))
findtoperror(string.join(msg)[:-1], s, n-1)
return answer
# like findtop, except return list of substrings instead of index pairs
def findtopstring(s, openers="([{", closers=")]}"):
answer = []
for i, j in findtop(s, openers, closers):
answer.append(s[i:j])
return answer
More information about the Python-list
mailing list