RE: [Doc-SIG] Non-ASCII Characters in reST (was: [Doc-SIG] docuti ls f eedback )
Hi again,
David Goodger wrote:
[...]
This is an input encoding issue, the solution to which (you guessed it!) hasn't been implemented yet. I'm not even sure what the solution should be. Although I've worked with different "character sets" in the past (such as Japanese SJIS, and Chinese and Korean encodings), the encoding was always known beforehand. With Docutils, it won't be. Anyone with Unicode encoding/decoding experience, I'd appreciate some advice.
Emacs does some guesswork concerning file encoding -- should we have a look at that for a starter?
I've got an "--encoding" option in docutils/frontend.py, but it's commented out as unimplemented. I've heard that such a command-line option is a **bad thing**, as are inline magic comments or directives. There's a comment, "Take default encoding & language from locale?". I don't know how best to proceed. I'd like to make the *right* decision here, not just "good enough for now". That's one of the reasons I haven't implemented the encoding issue yet. I've seen this debated, on Python-Dev and elsewhere, but I have yet to be shown "the one true way" or convinced that it *is* the right way.
I'm way off base here, but as I mentioned -- one condition is that the result is not "anything-centric", but defaults to a reasonable value. If the locale is wrong -- well, blame the site administrator... The language is much less of an issue, I think. Stating it in the document if it's not what the default would be is easy enough to understand, and such a statement won't turn into a lie as easily as the encoding. A command-line option is a must, though; I don't expect an American to state that his documents are in English, but I won't state it if mine are in German either. So I need a way to tell the frontend what language it should use.
Another reason is because there are really two encodings at play here: the input encoding and the output encoding. Is it reasonable to require that both be specified (with UTF-8 as defaults)? Western Europeans will want to use Latin-1 as the default, but that's not friendly to the rest of the world. Is locale the answer? Or are explicit command-line options the best way? A combination?
A sensible default is definitely required. Explicit command or another mechanism should be there, too, but as a last resort only and for people who (unlike me ;-) know what they're doing.
My brain hurts.
I ran into this bug before (I happen to have an 'รค' in my last name, so it occured at least once per document) and tracked it to line 98 in docutils/writers/__init__.py, which reads (note the ``temporary``!)::
output = output.encode('utf-8') # @@@ temporary; must not hard-code
Apparently, ``'something'.encode()`` expects ``'something'`` to contain clean 7-bit ASCII text.
My understanding is as follows (please correct me if I'm wrong). ``output.encode('utf-8')`` actually expects ``output`` to be a Unicode string, *not* 7-bit ASCII. If it's a regular (8-bit, non-Unicode) string, the operation will try to coerce it (decode it using the default encoding) into a Unicode string, then encode that into a UTF-8-encoded 8-bit string. It's the *decoding* stage that's raising the exception, because the default encoding is 7-bit "ascii", and ``output`` is a regular string containing non-ASCII (the a+umlaut).
As an experiment, could you try editing your site.py file? Try enabling the locale-detecting mechanism (search for "locale"), and/or the explicit default encoding above it. What happens then?
Thanks for the clarification! The experiment you suggested suggests that you're correct. I just tried enabling locale detection, and the error went away... My fix is a lot less dirty now :-)
As I said, the problem isn't encoding the output, it's decoding the input. Since the HTML being produced says it's using UTF-8, if there are non-ASCII charactes in your output they may be misinterpreted by your browser.
I'm aware of that. But since it's running in the same environment, chances are that it'll work out all right anyway, and so far it did.
Fixing this kind of things can be nasty, I think -- for a simple text file, there's (AFAIK) no way to know the encoding but guessing.
Yes. The question is, how best to guess? And how best to override the guesswork?
One way to handle it would probably be to have such an educated guess based on the system docutils is running on (assuming the file was typed in a dumb editor on the same machine), and to allow the encoding to be stated explicitly with a directive along the lines::
.. encoding: cp1250
That's an alternative (already in the "To Do" list). It's not my favorite, since it's ugly and can lie. When saving from a text editor a file could be re-encoded without the author noticing, and the "encoding" directive would become wrong.
Good point! ...and what do you get if you save from a browser window (say, webmail written in reST) ? After the above experiment, I think (but I'm definitely no authority here!!) that using the locale should be the default way to go. -- Ueli
On Tue, Jun 18, 2002, Schlaepfer, Ueli (ESEC ZG) wrote:
David Goodger wrote:
This is an input encoding issue, the solution to which (you guessed it!) hasn't been implemented yet. I'm not even sure what the solution should be. Although I've worked with different "character sets" in the past (such as Japanese SJIS, and Chinese and Korean encodings), the encoding was always known beforehand. With Docutils, it won't be. Anyone with Unicode encoding/decoding experience, I'd appreciate some advice.
Emacs does some guesswork concerning file encoding -- should we have a look at that for a starter?
Again, see PEP 263. As David said, there are a lot of problems with it, but it *does* start with Emacs as its base.
The language is much less of an issue, I think. Stating it in the document if it's not what the default would be is easy enough to understand, and such a statement won't turn into a lie as easily as the encoding. A command-line option is a must, though; I don't expect an American to state that his documents are in English, but I won't state it if mine are in German either. So I need a way to tell the frontend what language it should use.
You edit the document to add the language. Otherwise, what if you're processing multiple documents in a single run, all in different languages? BTW, right-justified text looks ugly in a monospaced font. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/
participants (2)
-
Aahz -
Schlaepfer, Ueli (ESEC ZG)