[Python-Dev] tokenize string literal problem

Sat Oct 24 08:04:25 CEST 2009

BACKGROUND
    I'm trying to modify the doctest DocTestParser so it will parse docstring code snippets out of a *py file. (Although doctest can parse these with another method out of *pyc, it is missing certain decorated functions and we would also like to insist of import of needed modules rather and that method automatically loads everything from the module containing the code.)

PROBLEM
    I need to find code snippets which are located in docstrings. Docstrings, being string literals should be able to be parsed out with tokenize. But tokenize is giving the wrong results (or I am doing something wrong) for this (pathological) case:

foo.py:
+----
def bar():
    """
    A quoted triple quote is not a closing
    of this docstring:
    >>> print '"""'
    """
    """ # <-- this is the closing quote
    pass
+----

Here is how I tokenize the file:

###
import re, tokenize
DOCSTRING_START_RE = re.compile('\s+[ru]*("""|' + "''')")

o=open('foo.py','r')
for ti in tokenize.generate_tokens(o.next):
    typ = ti[0]
    text = ti[-1]
    if typ == tokenize.STRING:
        if DOCSTRING_START_RE.match(text):
            print "DOCSTRING:",repr(text)
o.close()
###

which outputs:

DOCSTRING: '    """\n    A quoted triple quote is not a closing\n    of this docstring:\n    >>> print \'"""\'\n'
DOCSTRING: '    """\n    """ # <-- this is the closing quote\n'

There should be only one string tokenized, I believe. The PythonWin editor parses (and colorizes) this correctly, but tokenize (or I) are making an error.

Thanks for any help,
Chris