[Doc-SIG] Non-ASCII Characters in reST (was: [Doc-SIG] docutils f
eedback )
Schlaepfer, Ueli (ESEC ZG)
Ueli.Schlaepfer@esec.com
Mon, 17 Jun 2002 18:48:03 +0200
Hi all,
In a recent thread, Simon posted the following traceback::
Traceback (most recent call last):
File "/home/hefti/local_in_archive/src/docutils-0.1/tools/html.py", =
line
26, in ?
reporter=3Dreporter)
File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/core.py", =
line
85, in publish
pub.publish(source, destination)
File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/core.py", =
line
67, in publish
self.writer.write(document, destination)
File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/__=
init
__.py", line 56, in write
self.record()
File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/ht=
ml4c
ss1.py", line 36, in record
self.recordfile(self.output, self.destination)
File
"/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/__=
init
__.py", line 86, in recordfile
output =3D output.encode('raw-unicode-escape') # @@@ temporary
UnicodeError: ASCII decoding error: ordinal not in range(128)
=09
To which David replied:
>> This is a bug in the Docutils code, not a problem with the data, =
so
>> it's appropriate to see a Python traceback. The code crashed!
>>=20
>> However, I believe the problem is solved in current CVS. Try
>> installing from the CVS snapshot:
>> http://docutils.sf.net/docutils-snapshot.tgz
I ran into this bug before (I happen to have an '=E4' in my last name, =
so
it occured at least once per document) and tracked it to line 98 =
in
docutils/writers/__init__.py, which reads (note the ``temporary``!)::
output =3D output.encode('utf-8') # @@@ temporary; must not hard-code
Apparently, ``'something'.encode()`` expects ``'something'`` to =
contain
clean 7-bit ASCII text. My quick-and-dirty fix to the problem was =
to
comment out this line. The offending chjaracters make it through to =
the
HTML, but as my documents are for internal use only, this doesn't =
matter
so far.
Now I just checked with yesterdays (june 16) CVS. The =
`minimal
document`_ below still triggers the bug, contrary to David's statement.
.. _`minimal document`:
=3D=3D=3D 8< =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
=E4
=3D=3D=3D 8< =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
;-)
Fixing this kind of things can be nasty, I think -- for a simple =
text
file, there's (AFAIK) no way to know the encoding but guessing. =
It's
the raison d'=EAtre for TeX's "inputenc" package (and for a few =
more, I
think). This package provides a way to state the encoding in the =
source
file.
One way to handle it would probably be to have such an educated =
guess
based on the system docutils is running on (assuming the file was =
typed
in a dumb editor on the same machine), and to allow the encoding to =
be
stated explicitly with a directive along the lines::
.. encoding: cp1250
That's it for now... (I was hoping to provide a fix, but my current =
work
situation doesn't allow for substantial contributions :-( )
--=20
Ueli