[Doc-SIG] non-ascii docstrings

Fri Mar 17 22:10:17 CET 2006

I've been working on epydoc, and the question has come up of how I 
should treat non-unicode docstrings that contain non-ascii characters. 
An example of such a file is "python2.4/encodings/string_escape.py", 
whose module docstring contains an 'o' with an umlaut.

In particular, the question is whether I should assume that the 
docstring is encoded with the encoding specified by the "-*- coding -*-" 
directive at the top of the file.

The reason why we *wouldn't* use the encoding is that PEP 263 [1], which 
defines the coding directive, says that it does *not* apply to 
non-unicode string literals.  In particular, PEP 263 says that the 
entire file should be read & tokenized using the specified coding, but 
once string objects are created, they should be reencoded back into 
8-bit strings using the file encoding.

So the "correct" fix is for the author of the module to use unicode 
literals instead of string literals for docstrings that contain 
non-ascii characters.  This has the advantage that if a user tries to 
look at the docstring via introspection, it will be correct.

On the other hand, epydoc is often used by people other than the author 
of a module, and requiring them to go through and replace all string 
literal docstrings with unicode literals seems a bit unreasonable.

In a way, this is similar to the mistake I've seen many times of using 
non-escaped backslashes inside docstrings.  e.g.:

def wc(filename):
     """
     Count the number of words in the given file. E.g.:
         >>> wc("c:\test\new.txt")
         100
     """

Which looks fine in the source file, but looks quite broken if you print 
its __doc__:

 >>> print wc.__doc__
     Count the number of words in the given file. E.g.:
          >>> wc("c:     est
ew.txt")
     100

(The right fix in that case is probably to use a raw-string.)

So the question is..  Should epydoc (and other tools like it) be 
compliant with PEP 263 (and consistent with Python); or should they "do 
what I mean, not what I say" and treat non-ascii docstrings as if they 
were encoded using the module's encoding?

-Edward

http://www.python.org/doc/peps/pep-0263/