[Tutor] strings & splitting

Wed Jan 25 23:44:07 CET 2006

Liam wrote:
------------------------
Alan - the data is of the form -

a = {
b = 1
c = 2
d = { e =  { f  = 4 g = "Ultimate Showdown of Ultimate Destiny" } h =
{ i j k } }
}

Everything is whitespace delimited. I'd like to turn it into

["a", "=", "{", "b", "=", "1",
  "c", "=", "2", "d", "=", "{",
 "e", "=", "{", "f", "=", "4", "g", "=",
"\"Ultimate Showdown of Ultimate Destiny\"",
 "}", "h", "=", "{", "i", "j", "k", "}", "}"]
-----------------------

Liam,

I'd probably tackle this on a character by character basis using 
traditional tokenising code. ie build a state machine to determine 
what kind of token I'm in and keep reading chars until the token 
completes. most tokens are single-char tokens, others are quote-tokens.

Set token-type appropriately and read till end-of-token.
I might even use some classes to define the token types, 
but I'd keep them as simple as possible.

Rough pseudo code:

for char in tokenString:
     if token.type == quoted:
        token.append(char)
        if char == ": token.type =None
        continue
     elif token.type = simple
         tokens.append(char)
         token.type = None
     else:   # not in a token
         if char in '\n\t ,.':  # other non token chars
               continue
         elif char == '"':
               token.type = quoted
               token.append(char)
               continue
          else:
               token.type = simple
               token.append(char)

You can tidy that up with functions and a proper state jump table
but it might be faster than trying to build complex pattern matches 
and doing lots of insertions into lists etc. But it does rely on the 
data being as simple as your sample in the variety of token types.

HTH,

Alan G
Author of the learn to program web tutor
http://www.freenetpages.co.uk/hp/alan.gauld