[Tutor] strings & splitting

Wed Jan 25 14:18:46 CET 2006

Hi all,

I have a large string which I'm attempting to manipulate, which I find
very convenient to call
large_string.split(" ") on to conveniently tokenise.

Except, however for the double quoted strings within my string, which
contain spaces.

At the moment I'm doing a split by \n, and then looping line by line,
splitting by spaces and then reuniting double quoted strings by
iterating over the split line, looking for mismatched quotation marks,
storing the indexes of each matching pair and then:

for (l,r) in pairs:
.    sub_string = q[l:r+1] #Up to r and including it.
.    rejoined_string = " ".join(sub_string)
.    indices = range(l,r+1)
.    indices.reverse()
.    for i in indices: q.pop(i)
.    q.insert(l, rejoined_string)

I'm doing it split line by split line, extending the resulting line
into a big flat list as I found out that Python doesn't cope overly
well with stuff like the above when it's a 800,000 item list, I think
it was the insert mainly.

My question is, is there a more Pythonic solution to this?

I was thinking of using a regex to pluck qualifying
quoted-space-including sentences out, and then trying to remember
where they went based on context, but that sounds prone to error; so I
thought of perhaps the same thing with a unique token of my own that I
can find once the list is created and then sub the original string
back in, but I wonder if calling index() repeatedly would be any
faster.

I've got it down to 3 seconds now, but I'm trying to get... a stable
solution, if possible an elegant solution.The current one is prone to
breaking based on funny whitespace and is just ugly and prickly
looking.

Regards,

Liam Clarke