[Doc-SIG] Non-ASCII Characters in reST (was: [Doc-SIG] docutils f eedback )

David Goodger goodger@users.sourceforge.net
Mon, 17 Jun 2002 22:27:29 -0400


Ueli Schlaepfer wrote:
> In a recent thread, Simon posted the following traceback::
>=20
>   Traceback (most recent call last):
...
> "...docutils/writers/__init__.py", line 86, in recordfile
>       output =3D output.encode('raw-unicode-escape')    # @@@ temporary
>   UnicodeError: ASCII decoding error: ordinal not in range(128)
>    =20
> To which David replied:
>=20
>   >> This is a bug in the Docutils code, not a problem with the data,
>   >> so it's appropriate to see a Python traceback.  The code crashed!
>   >>=20
>   >> However, I believe the problem is solved in current CVS.  Try
>   >> installing from the CVS snapshot:
>   >> http://docutils.sf.net/docutils-snapshot.tgz
...
> Now I just checked with yesterdays (june 16) CVS.  The `minimal
> document`_ below still triggers the bug, contrary to David's
> statement.

This is an input encoding issue, the solution to which (you guessed
it!) hasn't been implemented yet.  I'm not even sure what the solution
should be.  Although I've worked with different "character sets" in
the past (such as Japanese SJIS, and Chinese and Korean encodings),
the encoding was always known beforehand.  With Docutils, it won't be.
Anyone with Unicode encoding/decoding experience, I'd appreciate some
advice.

I've got an "--encoding" option in docutils/frontend.py, but it's
commented out as unimplemented.  I've heard that such a command-line
option is a **bad thing**, as are inline magic comments or directives.
There's a comment, "Take default encoding & language from locale?".  I
don't know how best to proceed.  I'd like to make the *right* decision
here, not just "good enough for now".  That's one of the reasons I
haven't implemented the encoding issue yet.  I've seen this debated,
on Python-Dev and elsewhere, but I have yet to be shown "the one true
way" or convinced that it *is* the right way.

Another reason is because there are really two encodings at play here:
the input encoding and the output encoding.  Is it reasonable to
require that both be specified (with UTF-8 as defaults)?  Western
Europeans will want to use Latin-1 as the default, but that's not
friendly to the rest of the world.  Is locale the answer?  Or are
explicit command-line options the best way?  A combination?

My brain hurts.

> I ran into this bug before (I happen to have an '=E4' in my last name,
> so it occured at least once per document) and tracked it to line 98
> in docutils/writers/__init__.py, which reads (note the
> ``temporary``!)::
>=20
>   output =3D output.encode('utf-8') # @@@ temporary; must not hard-code
>=20
> Apparently, ``'something'.encode()`` expects ``'something'`` to
> contain clean 7-bit ASCII text.

My understanding is as follows (please correct me if I'm wrong).
``output.encode('utf-8')`` actually expects ``output`` to be a Unicode
string, *not* 7-bit ASCII.  If it's a regular (8-bit, non-Unicode)
string, the operation will try to coerce it (decode it using the
default encoding) into a Unicode string, then encode that into a
UTF-8-encoded 8-bit string.  It's the *decoding* stage that's raising
the exception, because the default encoding is 7-bit "ascii", and
``output`` is a regular string containing non-ASCII (the a+umlaut).

As an experiment, could you try editing your site.py file?  Try
enabling the locale-detecting mechanism (search for "locale"), and/or
the explicit default encoding above it.  What happens then?

> My quick-and-dirty fix to the problem was to comment out this line.
> The offending chjaracters make it through to the HTML, but as my
> documents are for internal use only, this doesn't matter so far.

As I said, the problem isn't encoding the output, it's decoding the
input.  Since the HTML being produced says it's using UTF-8, if there
are non-ASCII charactes in your output they may be misinterpreted by
your browser.

> Fixing this kind of things can be nasty, I think -- for a simple
> text file, there's (AFAIK) no way to know the encoding but guessing.

Yes.  The question is, how best to guess?  And how best to override
the guesswork?

> One way to handle it would probably be to have such an educated
> guess based on the system docutils is running on (assuming the file
> was typed in a dumb editor on the same machine), and to allow the
> encoding to be stated explicitly with a directive along the lines::
>=20
>   .. encoding: cp1250

That's an alternative (already in the "To Do" list).  It's not my
favorite, since it's ugly and can lie.  When saving from a text editor
a file could be re-encoded without the author noticing, and the
"encoding" directive would become wrong.

--=20
David Goodger  <goodger@users.sourceforge.net>  Open-source projects:
  - Python Docutils: http://docutils.sourceforge.net/
    (includes reStructuredText: http://docutils.sf.net/rst.html)
  - The Go Tools Project: http://gotools.sourceforge.net/