tcl-style string parsing

Guido van Rossum guido at cnri.reston.va.us
Tue Oct 19 17:13:59 EDT 1999


Dev <devnull at fleetingimage.com> writes:

> I have a string consisting of multiple strings---
> 
>   a = '"This, m\'dear,"  is  "an example" "of a parsing problem."'
> 
> I would like to efficiently convert this to a list (or tuple):
> 
>   b = ["This, m'dear,", "is", "an example", "of a parsing problem."]
> 
> In the source string, note that whitespace
> within a quoted string should be retained,
> whitespace outside a quoted string should be ignored,
> and strings without whitespace don't need to be quoted.
> 
> I'm converting some code from TCL, where this is trivial.
> (The original string can be treated and indexed as a list.)
> I've not found a suitable re expression for this, nor a set
> of string replacements.

Your problem seems to be designed with Tcl in mind.  It is a parsing
problem.  It so happens that Tcl stole some of its lexing ideas from
the shell and Python 1.5.2 happens to have a handy module, shlex by
Eric Raymond, that nearly solves your problem:

    >>> import shlex
    >>> import StringIO
    >>> f = StringIO.StringIO(a)
    >>> s = shlex.shlex(f)
    >>> l = []
    >>> while 1:
	    t = s.get_token()
	    if not t: break
	    l.append(t)
	    print `t`


    '"This, m\'dear,"'
    'is'
    '"an example"'
    '"of a parsing problem."'
    >>> print l
    ['"This, m\'dear,"', 'is', '"an example"', '"of a parsing problem."']
    >>>

Note that shlex leaves the string quotes around the quoted tokens;
these are easily removed by adding something like

        if len(t) >= 2 and t[0] == '"' == t[-1]:
            t = t[1:-1]

--Guido van Rossum (home page: http://www.python.org/~guido/)




More information about the Python-list mailing list