[Tutor] strings & splitting

Wed Jan 25 14:38:03 CET 2006

Liam Clarke wrote:
> Hi all,
> 
> I have a large string which I'm attempting to manipulate, which I find
> very convenient to call
> large_string.split(" ") on to conveniently tokenise.
> 
> Except, however for the double quoted strings within my string, which
> contain spaces.
> 
> At the moment I'm doing a split by \n, and then looping line by line,
> splitting by spaces and then reuniting double quoted strings by
> iterating over the split line, looking for mismatched quotation marks,
> storing the indexes of each matching pair and then:

I'm pretty sure you can do this with the csv module. You can configure 
the field delimiter to be a space and I think it will do what you want 
line-by-line, then you just have to join the lines together.

Pyparsing also has built-in support for quoted strings, you could use it.
> 
> for (l,r) in pairs:
> .    sub_string = q[l:r+1] #Up to r and including it.
> .    rejoined_string = " ".join(sub_string)
> .    indices = range(l,r+1)
> .    indices.reverse()
> .    for i in indices: q.pop(i)
> .    q.insert(l, rejoined_string)
> 
> I'm doing it split line by split line, extending the resulting line
> into a big flat list as I found out that Python doesn't cope overly
> well with stuff like the above when it's a 800,000 item list, I think
> it was the insert mainly.

The pops and the inserts will get expensive as the list grows. You could 
do them all in one operation with something like this:
q[l:r+1] = [rejoined_string]

That should speed it up a bit. A list is implemented as an array of 
references. Each time you pop or insert, the elements of the list after 
the point of change have to be copied to add or remove a space in the 
array. Doing it in one operation reduces this to a single copy per 
iteration, instead of two or more.

Kent

> 
> My question is, is there a more Pythonic solution to this?
> 
> I was thinking of using a regex to pluck qualifying
> quoted-space-including sentences out, and then trying to remember
> where they went based on context, but that sounds prone to error; so I
> thought of perhaps the same thing with a unique token of my own that I
> can find once the list is created and then sub the original string
> back in, but I wonder if calling index() repeatedly would be any
> faster.
> 
> I've got it down to 3 seconds now, but I'm trying to get... a stable
> solution, if possible an elegant solution.The current one is prone to
> breaking based on funny whitespace and is just ugly and prickly
> looking.
> 
> Regards,
> 
> Liam Clarke
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
>