[Doc-SIG] Non-ASCII Characters in reST (was: [Doc-SIG] docuti
ls f eedback )
Schlaepfer, Ueli (ESEC ZG)
Ueli.Schlaepfer@esec.com
Tue, 18 Jun 2002 09:00:13 +0200
Hi again,
> David Goodger wrote:
[...]
> This is an input encoding issue, the solution to which (you guessed
> it!) hasn't been implemented yet. I'm not even sure what the =
solution
> should be. Although I've worked with different "character sets" in
> the past (such as Japanese SJIS, and Chinese and Korean encodings),
> the encoding was always known beforehand. With Docutils, it won't =
be.
> Anyone with Unicode encoding/decoding experience, I'd appreciate some
> advice.
Emacs does some guesswork concerning file encoding -- should we have =
a
look at that for a starter?
> I've got an "--encoding" option in docutils/frontend.py, but it's
> commented out as unimplemented. I've heard that such a command-line
> option is a **bad thing**, as are inline magic comments or =
directives.
> There's a comment, "Take default encoding & language from locale?". =
I
> don't know how best to proceed. I'd like to make the *right* =
decision
> here, not just "good enough for now". That's one of the reasons I
> haven't implemented the encoding issue yet. I've seen this debated,
> on Python-Dev and elsewhere, but I have yet to be shown "the one true
> way" or convinced that it *is* the right way.
I'm way off base here, but as I mentioned -- one condition is that =
the
result is not "anything-centric", but defaults to a reasonable =
value.
If the locale is wrong -- well, blame the site administrator...
The language is much less of an issue, I think. Stating it in =
the
document if it's not what the default would be is easy enough =
to
understand, and such a statement won't turn into a lie as easily as =
the
encoding. A command-line option is a must, though; I don't expect =
an
American to state that his documents are in English, but I won't =
state
it if mine are in German either. So I need a way to tell the =
frontend
what language it should use.
> Another reason is because there are really two encodings at play =
here:
> the input encoding and the output encoding. Is it reasonable to
> require that both be specified (with UTF-8 as defaults)? Western
> Europeans will want to use Latin-1 as the default, but that's not
> friendly to the rest of the world. Is locale the answer? Or are
> explicit command-line options the best way? A combination?
A sensible default is definitely required. Explicit command or =
another
mechanism should be there, too, but as a last resort only and for =
people
who (unlike me ;-) know what they're doing.
> My brain hurts.
>
> > I ran into this bug before (I happen to have an '=E4' in my last =
name,
> > so it occured at least once per document) and tracked it to line 98
> > in docutils/writers/__init__.py, which reads (note the
> > ``temporary``!)::
> >=20
> > output =3D output.encode('utf-8') # @@@ temporary; must not =
hard-code
> >=20
> > Apparently, ``'something'.encode()`` expects ``'something'`` to
> > contain clean 7-bit ASCII text.
>
> My understanding is as follows (please correct me if I'm wrong).
> ``output.encode('utf-8')`` actually expects ``output`` to be a =
Unicode
> string, *not* 7-bit ASCII. If it's a regular (8-bit, non-Unicode)
> string, the operation will try to coerce it (decode it using the
> default encoding) into a Unicode string, then encode that into a
> UTF-8-encoded 8-bit string. It's the *decoding* stage that's raising
> the exception, because the default encoding is 7-bit "ascii", and
> ``output`` is a regular string containing non-ASCII (the a+umlaut).
> As an experiment, could you try editing your site.py file? Try
> enabling the locale-detecting mechanism (search for "locale"), and/or
> the explicit default encoding above it. What happens then?
Thanks for the clarification! The experiment you suggested =
suggests
that you're correct. I just tried enabling locale detection, and =
the
error went away... My fix is a lot less dirty now :-)
> As I said, the problem isn't encoding the output, it's decoding the
> input. Since the HTML being produced says it's using UTF-8, if there
> are non-ASCII charactes in your output they may be misinterpreted by
> your browser.
I'm aware of that. But since it's running in the same =
environment,
chances are that it'll work out all right anyway, and so far it did.
> > Fixing this kind of things can be nasty, I think -- for a simple
> > text file, there's (AFAIK) no way to know the encoding but =
guessing.
>
> Yes. The question is, how best to guess? And how best to override
> the guesswork?
>
> > One way to handle it would probably be to have such an educated
> > guess based on the system docutils is running on (assuming the file
> > was typed in a dumb editor on the same machine), and to allow the
> > encoding to be stated explicitly with a directive along the lines::
> >=20
> > .. encoding: cp1250
>
> That's an alternative (already in the "To Do" list). It's not my
> favorite, since it's ugly and can lie. When saving from a text =
editor
> a file could be re-encoded without the author noticing, and the
> "encoding" directive would become wrong.
Good point! ...and what do you get if you save from a browser =
window
(say, webmail written in reST) ? After the above experiment, I =
think
(but I'm definitely no authority here!!) that using the locale should =
be
the default way to go.
--=20
Ueli