[Tutor] strings & splitting

Wed Jan 25 21:42:54 CET 2006

Hi,

Thanks Kent, I'll check out the CSV module. I had a go with Pyparsing
awhile ago, and it's clocking in at the 3 minute mark also.

Alan - the data is of the form -

a = {
b = 1
c = 2
d = { e =  { f  = 4 g = "Ultimate Showdown of Ultimate Destiny" } h =
{ i j k } }
}

Everything is whitespace delimited. I'd like to turn it into

["a", "=", "{", "b", "=", "1",
  "c", "=", "2", "d", "=", "{",
 "e", "=", "{", "f", "=", "4", "g", "=",
"\"Ultimate Showdown of Ultimate Destiny\"",
 "}", "h", "=", "{", "i", "j", "k", "}", "}"]

Regards,

Liam Clarke

On 1/26/06, Alan Gauld <alan.gauld at freenet.co.uk> wrote:
> Hi Liam,
>
> I'm not sure I really understand what you are trying
> to get to here.
>
> Can you provide a short sample of before/after data
> so we can see what we are trying to achieve?
>
> Alan G
>
> ----- Original Message -----
> From: "Liam Clarke" <ml.cyresse at gmail.com>
> To: "Python Tutor" <tutor at python.org>
> Sent: Wednesday, January 25, 2006 1:18 PM
> Subject: [Tutor] strings & splitting
>
>
> Hi all,
>
> I have a large string which I'm attempting to manipulate, which I find
> very convenient to call
> large_string.split(" ") on to conveniently tokenise.
>
> Except, however for the double quoted strings within my string, which
> contain spaces.
>
> At the moment I'm doing a split by \n, and then looping line by line,
> splitting by spaces and then reuniting double quoted strings by
> iterating over the split line, looking for mismatched quotation marks,
> storing the indexes of each matching pair and then:
>
> for (l,r) in pairs:
> .    sub_string = q[l:r+1] #Up to r and including it.
> .    rejoined_string = " ".join(sub_string)
> .    indices = range(l,r+1)
> .    indices.reverse()
> .    for i in indices: q.pop(i)
> .    q.insert(l, rejoined_string)
>
> I'm doing it split line by split line, extending the resulting line
> into a big flat list as I found out that Python doesn't cope overly
> well with stuff like the above when it's a 800,000 item list, I think
> it was the insert mainly.
>
> My question is, is there a more Pythonic solution to this?
>
> I was thinking of using a regex to pluck qualifying
> quoted-space-including sentences out, and then trying to remember
> where they went based on context, but that sounds prone to error; so I
> thought of perhaps the same thing with a unique token of my own that I
> can find once the list is created and then sub the original string
> back in, but I wonder if calling index() repeatedly would be any
> faster.
>
> I've got it down to 3 seconds now, but I'm trying to get... a stable
> solution, if possible an elegant solution.The current one is prone to
> breaking based on funny whitespace and is just ugly and prickly
> looking.
>
> Regards,
>
> Liam Clarke
>
>
>