[Doc-SIG] Non-ASCII Characters in reST (was: [Doc-SIG] docutils f eedback )

Schlaepfer, Ueli (ESEC ZG) Ueli.Schlaepfer@esec.com
Mon, 17 Jun 2002 18:48:03 +0200


Hi all,

In a recent thread, Simon posted the following traceback::

  Traceback (most recent call last):
    File "/home/hefti/local_in_archive/src/docutils-0.1/tools/html.py", =
line
26, in ?
      reporter=3Dreporter)
    File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/core.py", =
line
85, in publish
      pub.publish(source, destination)
    File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/core.py", =
line
67, in publish
      self.writer.write(document, destination)
    File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/__=
init
__.py", line 56, in write
      self.record()
    File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/ht=
ml4c
ss1.py", line 36, in record
      self.recordfile(self.output, self.destination)
    File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/__=
init
__.py", line 86, in recordfile
      output =3D output.encode('raw-unicode-escape')    # @@@ temporary
  UnicodeError: ASCII decoding error: ordinal not in range(128)
=09
To which David replied:

  >> This is a bug in the Docutils code, not a problem with the data, =
so
  >> it's appropriate to see a Python traceback.  The code crashed!
  >>=20
  >> However, I believe the problem is solved in current CVS.  Try
  >> installing from the CVS snapshot:
  >> http://docutils.sf.net/docutils-snapshot.tgz

I ran into this bug before (I happen  to have an '=E4' in my last name, =
so
it occured  at least  once per document)  and tracked  it to line  98 =
in
docutils/writers/__init__.py, which reads (note the ``temporary``!)::

  output =3D output.encode('utf-8') # @@@ temporary; must not hard-code

Apparently, ``'something'.encode()``  expects ``'something'`` to =
contain
clean 7-bit  ASCII text.  My quick-and-dirty  fix to the  problem was =
to
comment out this line.  The offending chjaracters make it through to =
the
HTML, but as my documents are for internal use only, this doesn't =
matter
so far.

Now  I  just  checked  with  yesterdays (june  16)  CVS.   The  =
`minimal
document`_ below still triggers the bug, contrary to David's statement.

.. _`minimal document`:

   =3D=3D=3D 8< =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
   =E4
   =3D=3D=3D 8< =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

   ;-)

Fixing this kind  of things can be  nasty, I think -- for  a simple =
text
file, there's  (AFAIK) no way to  know the encoding  but guessing.  =
It's
the raison  d'=EAtre for TeX's "inputenc"  package (and for a  few =
more, I
think). This package provides a way  to state the encoding in the =
source
file.

One way  to handle it would probably  be to have such  an educated =
guess
based on the system docutils is  running on (assuming the file was =
typed
in a dumb editor  on the same machine), and to allow  the encoding to =
be
stated explicitly with a directive along the lines::

  .. encoding: cp1250

That's it for now... (I was hoping to provide a fix, but my current =
work
situation doesn't allow for substantial contributions :-( )


--=20

Ueli