Non-ASCII Characters in reST (was: [Doc-SIG] docutils f eedback )
Hi all, In a recent thread, Simon posted the following traceback:: Traceback (most recent call last): File "/home/hefti/local_in_archive/src/docutils-0.1/tools/html.py", line 26, in ? reporter=reporter) File "/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/core.py", line 85, in publish pub.publish(source, destination) File "/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/core.py", line 67, in publish self.writer.write(document, destination) File "/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/__init __.py", line 56, in write self.record() File "/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/html4c ss1.py", line 36, in record self.recordfile(self.output, self.destination) File "/usr/local/Python-2.2b1/lib/python2.2/site-packages/docutils/writers/__init __.py", line 86, in recordfile output = output.encode('raw-unicode-escape') # @@@ temporary UnicodeError: ASCII decoding error: ordinal not in range(128) To which David replied:
This is a bug in the Docutils code, not a problem with the data, so it's appropriate to see a Python traceback. The code crashed!
However, I believe the problem is solved in current CVS. Try installing from the CVS snapshot: http://docutils.sf.net/docutils-snapshot.tgz
I ran into this bug before (I happen to have an 'ä' in my last name, so it occured at least once per document) and tracked it to line 98 in docutils/writers/__init__.py, which reads (note the ``temporary``!):: output = output.encode('utf-8') # @@@ temporary; must not hard-code Apparently, ``'something'.encode()`` expects ``'something'`` to contain clean 7-bit ASCII text. My quick-and-dirty fix to the problem was to comment out this line. The offending chjaracters make it through to the HTML, but as my documents are for internal use only, this doesn't matter so far. Now I just checked with yesterdays (june 16) CVS. The `minimal document`_ below still triggers the bug, contrary to David's statement. .. _`minimal document`: === 8< ========================================== ä === 8< ========================================== ;-) Fixing this kind of things can be nasty, I think -- for a simple text file, there's (AFAIK) no way to know the encoding but guessing. It's the raison d'être for TeX's "inputenc" package (and for a few more, I think). This package provides a way to state the encoding in the source file. One way to handle it would probably be to have such an educated guess based on the system docutils is running on (assuming the file was typed in a dumb editor on the same machine), and to allow the encoding to be stated explicitly with a directive along the lines:: .. encoding: cp1250 That's it for now... (I was hoping to provide a fix, but my current work situation doesn't allow for substantial contributions :-( ) -- Ueli
Ueli Schlaepfer wrote:
In a recent thread, Simon posted the following traceback::
Traceback (most recent call last): ... "...docutils/writers/__init__.py", line 86, in recordfile output = output.encode('raw-unicode-escape') # @@@ temporary UnicodeError: ASCII decoding error: ordinal not in range(128)
To which David replied:
This is a bug in the Docutils code, not a problem with the data, so it's appropriate to see a Python traceback. The code crashed!
However, I believe the problem is solved in current CVS. Try installing from the CVS snapshot: http://docutils.sf.net/docutils-snapshot.tgz ... Now I just checked with yesterdays (june 16) CVS. The `minimal document`_ below still triggers the bug, contrary to David's statement.
This is an input encoding issue, the solution to which (you guessed it!) hasn't been implemented yet. I'm not even sure what the solution should be. Although I've worked with different "character sets" in the past (such as Japanese SJIS, and Chinese and Korean encodings), the encoding was always known beforehand. With Docutils, it won't be. Anyone with Unicode encoding/decoding experience, I'd appreciate some advice. I've got an "--encoding" option in docutils/frontend.py, but it's commented out as unimplemented. I've heard that such a command-line option is a **bad thing**, as are inline magic comments or directives. There's a comment, "Take default encoding & language from locale?". I don't know how best to proceed. I'd like to make the *right* decision here, not just "good enough for now". That's one of the reasons I haven't implemented the encoding issue yet. I've seen this debated, on Python-Dev and elsewhere, but I have yet to be shown "the one true way" or convinced that it *is* the right way. Another reason is because there are really two encodings at play here: the input encoding and the output encoding. Is it reasonable to require that both be specified (with UTF-8 as defaults)? Western Europeans will want to use Latin-1 as the default, but that's not friendly to the rest of the world. Is locale the answer? Or are explicit command-line options the best way? A combination? My brain hurts.
I ran into this bug before (I happen to have an 'ä' in my last name, so it occured at least once per document) and tracked it to line 98 in docutils/writers/__init__.py, which reads (note the ``temporary``!)::
output = output.encode('utf-8') # @@@ temporary; must not hard-code
Apparently, ``'something'.encode()`` expects ``'something'`` to contain clean 7-bit ASCII text.
My understanding is as follows (please correct me if I'm wrong). ``output.encode('utf-8')`` actually expects ``output`` to be a Unicode string, *not* 7-bit ASCII. If it's a regular (8-bit, non-Unicode) string, the operation will try to coerce it (decode it using the default encoding) into a Unicode string, then encode that into a UTF-8-encoded 8-bit string. It's the *decoding* stage that's raising the exception, because the default encoding is 7-bit "ascii", and ``output`` is a regular string containing non-ASCII (the a+umlaut). As an experiment, could you try editing your site.py file? Try enabling the locale-detecting mechanism (search for "locale"), and/or the explicit default encoding above it. What happens then?
My quick-and-dirty fix to the problem was to comment out this line. The offending chjaracters make it through to the HTML, but as my documents are for internal use only, this doesn't matter so far.
As I said, the problem isn't encoding the output, it's decoding the input. Since the HTML being produced says it's using UTF-8, if there are non-ASCII charactes in your output they may be misinterpreted by your browser.
Fixing this kind of things can be nasty, I think -- for a simple text file, there's (AFAIK) no way to know the encoding but guessing.
Yes. The question is, how best to guess? And how best to override the guesswork?
One way to handle it would probably be to have such an educated guess based on the system docutils is running on (assuming the file was typed in a dumb editor on the same machine), and to allow the encoding to be stated explicitly with a directive along the lines::
.. encoding: cp1250
That's an alternative (already in the "To Do" list). It's not my favorite, since it's ugly and can lie. When saving from a text editor a file could be re-encoded without the author noticing, and the "encoding" directive would become wrong. -- David Goodger <goodger@users.sourceforge.net> Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/
On Mon, Jun 17, 2002, David Goodger wrote:
I've got an "--encoding" option in docutils/frontend.py, but it's commented out as unimplemented. I've heard that such a command-line option is a **bad thing**, as are inline magic comments or directives. There's a comment, "Take default encoding & language from locale?". I don't know how best to proceed. I'd like to make the *right* decision here, not just "good enough for now". That's one of the reasons I haven't implemented the encoding issue yet. I've seen this debated, on Python-Dev and elsewhere, but I have yet to be shown "the one true way" or convinced that it *is* the right way.
See PEP 263. That way, at worst you're compatible with Python.
Another reason is because there are really two encodings at play here: the input encoding and the output encoding. Is it reasonable to require that both be specified (with UTF-8 as defaults)? Western Europeans will want to use Latin-1 as the default, but that's not friendly to the rest of the world. Is locale the answer? Or are explicit command-line options the best way? A combination?
I'd suggest a directive as the answer to output encoding; someone who wants to implement dynamic output encoding can do so with an extension directive of some kind. Just make the API clear. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
On Mon, Jun 17, 2002, David Goodger wrote:
I've got an "--encoding" option in docutils/frontend.py, but it's commented out as unimplemented. I've heard that such a command-line option is a **bad thing**, as are inline magic comments or directives. There's a comment, "Take default encoding & language from locale?". I don't know how best to proceed. I'd like to make the *right* decision here, not just "good enough for now". That's one of the reasons I haven't implemented the encoding issue yet. I've seen this debated, on Python-Dev and elsewhere, but I have yet to be shown "the one true way" or convinced that it *is* the right way.
Aahz wrote:
See PEP 263. That way, at worst you're compatible with Python.
That's what I'm talking about; I'm not convinced by PEP 263. I have strong reservations about the magic comment it proposes. I hesitate to add similar recognition logic to Docutils/reStructuredText. I'm no expert, but it seems flawed somehow. It could very well be the best solution, but it doesn't seem that way to me. Does anybody know of any good, authoritative references on the subject? -- David Goodger <goodger@users.sourceforge.net> Open-source projects: - Python Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) - The Go Tools Project: http://gotools.sourceforge.net/
participants (3)
-
Aahz -
David Goodger -
Schlaepfer, Ueli (ESEC ZG)