[Doc-SIG] non-ascii docstrings
goodger at python.org
Fri Mar 24 14:53:47 CET 2006
> I've been working on epydoc, and the question has come up of how I
> should treat non-unicode docstrings that contain non-ascii
> characters. An example of such a file is
> "python2.4/encodings/string_escape.py", whose module docstring
> contains an 'o' with an umlaut.
> In particular, the question is whether I should assume that the
> docstring is encoded with the encoding specified by the "-*- coding
> -*-" directive at the top of the file.
I think that although it's the only possible assumption, it's also
potentially a wrong assumption. IOW, don't assume anything.
> The reason why we *wouldn't* use the encoding is that PEP 263 ,
> which defines the coding directive, says that it does *not* apply to
> non-unicode string literals. In particular, PEP 263 says that the
> entire file should be read & tokenized using the specified coding,
> but once string objects are created, they should be reencoded back
> into 8-bit strings using the file encoding.
One reason is that the module code may expect such string literals to
have their original encoding. String literals can contain arbitrary
8-bit data (strings are bytes, not characters). Attempting to decode
such strings is inviting misinterpretation.
Another reason is simple: "In the face of ambiguity, refuse the
temptation to guess."
> So the "correct" fix is for the author of the module to use unicode
> literals instead of string literals for docstrings that contain
> non-ascii characters. This has the advantage that if a user tries
> to look at the docstring via introspection, it will be correct.
> On the other hand, epydoc is often used by people other than the
> author of a module, and requiring them to go through and replace all
> string literal docstrings with unicode literals seems a bit
Yes, it's unreasonable. But such code is buggy IMO. It's also
unreasonable to expect Epydoc to correctly interpret garbage input.
Don't do it.
> So the question is.. Should epydoc (and other tools like it) be
> compliant with PEP 263 (and consistent with Python); or should they
> "do what I mean, not what I say" and treat non-ascii docstrings as
> if they were encoded using the module's encoding?
Be compliant with PEP 263, issue a warning (PEP 263, Implementation,
step 1), and either ignore such string literals or represent them as
strings of bytes (using "\xYY" notation).
David Goodger <http://python.net/~goodger>
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 249 bytes
Desc: OpenPGP digital signature
Url : http://mail.python.org/pipermail/doc-sig/attachments/20060324/865b4cc0/attachment.pgp
More information about the Doc-SIG