[Doc-SIG] Non-ASCII Characters in reST (was: [Doc-SIG] docuti ls f eedback )

Schlaepfer, Ueli (ESEC ZG) Ueli.Schlaepfer@esec.com
Tue, 18 Jun 2002 09:00:13 +0200


Hi again,

> David Goodger wrote:

[...]

> This is an input encoding issue, the solution to which (you guessed
> it!) hasn't been implemented yet.  I'm not even sure what the =
solution
> should be.  Although I've worked with different "character sets" in
> the past (such as Japanese SJIS, and Chinese and Korean encodings),
> the encoding was always known beforehand.  With Docutils, it won't =
be.
> Anyone with Unicode encoding/decoding experience, I'd appreciate some
> advice.

Emacs does some  guesswork concerning file encoding --  should we have =
a
look at that for a starter?

> I've got an "--encoding" option in docutils/frontend.py, but it's
> commented out as unimplemented.  I've heard that such a command-line
> option is a **bad thing**, as are inline magic comments or =
directives.
> There's a comment, "Take default encoding & language from locale?".  =
I
> don't know how best to proceed.  I'd like to make the *right* =
decision
> here, not just "good enough for now".  That's one of the reasons I
> haven't implemented the encoding issue yet.  I've seen this debated,
> on Python-Dev and elsewhere, but I have yet to be shown "the one true
> way" or convinced that it *is* the right way.

I'm way off base  here, but as I mentioned -- one  condition is that =
the
result is  not "anything-centric", but  defaults to a  reasonable =
value.
If the locale is wrong -- well, blame the site administrator...

The  language is much  less of  an issue,  I think.   Stating it  in =
the
document  if it's  not  what the  default  would be  is  easy enough  =
to
understand, and such a statement won't  turn into a lie as easily as =
the
encoding.  A  command-line option is a  must, though; I  don't expect =
an
American to state  that his documents are in English,  but I won't =
state
it if mine are  in German either.  So I need a  way to tell the =
frontend
what language it should use.

> Another reason is because there are really two encodings at play =
here:
> the input encoding and the output encoding.  Is it reasonable to
> require that both be specified (with UTF-8 as defaults)?  Western
> Europeans will want to use Latin-1 as the default, but that's not
> friendly to the rest of the world.  Is locale the answer?  Or are
> explicit command-line options the best way?  A combination?

A sensible default is definitely  required.  Explicit command or =
another
mechanism should be there, too, but as a last resort only and for =
people
who (unlike me ;-) know what they're doing.

> My brain hurts.
>
> > I ran into this bug before (I happen to have an '=E4' in my last =
name,
> > so it occured at least once per document) and tracked it to line 98
> > in docutils/writers/__init__.py, which reads (note the
> > ``temporary``!)::
> >=20
> >   output =3D output.encode('utf-8') # @@@ temporary; must not =
hard-code
> >=20
> > Apparently, ``'something'.encode()`` expects ``'something'`` to
> > contain clean 7-bit ASCII text.
>
> My understanding is as follows (please correct me if I'm wrong).
> ``output.encode('utf-8')`` actually expects ``output`` to be a =
Unicode
> string, *not* 7-bit ASCII.  If it's a regular (8-bit, non-Unicode)
> string, the operation will try to coerce it (decode it using the
> default encoding) into a Unicode string, then encode that into a
> UTF-8-encoded 8-bit string.  It's the *decoding* stage that's raising
> the exception, because the default encoding is 7-bit "ascii", and
> ``output`` is a regular string containing non-ASCII (the a+umlaut).

> As an experiment, could you try editing your site.py file?  Try
> enabling the locale-detecting mechanism (search for "locale"), and/or
> the explicit default encoding above it.  What happens then?

Thanks  for the  clarification!  The  experiment you  suggested =
suggests
that you're  correct.  I just  tried enabling locale detection,  and =
the
error went away... My fix is a lot less dirty now :-)

> As I said, the problem isn't encoding the output, it's decoding the
> input.  Since the HTML being produced says it's using UTF-8, if there
> are non-ASCII charactes in your output they may be misinterpreted by
> your browser.

I'm  aware of that.   But since  it's running  in the  same =
environment,
chances are that it'll work out all right anyway, and so far it did.

> > Fixing this kind of things can be nasty, I think -- for a simple
> > text file, there's (AFAIK) no way to know the encoding but =
guessing.
>
> Yes.  The question is, how best to guess?  And how best to override
> the guesswork?
>
> > One way to handle it would probably be to have such an educated
> > guess based on the system docutils is running on (assuming the file
> > was typed in a dumb editor on the same machine), and to allow the
> > encoding to be stated explicitly with a directive along the lines::
> >=20
> >   .. encoding: cp1250
>
> That's an alternative (already in the "To Do" list).  It's not my
> favorite, since it's ugly and can lie.  When saving from a text =
editor
> a file could be re-encoded without the author noticing, and the
> "encoding" directive would become wrong.

Good point!   ...and what do you get  if you save from  a browser =
window
(say, webmail  written in reST) ?   After the above  experiment, I =
think
(but I'm definitely no authority here!!) that using the locale should =
be
the default way to go.


--=20

Ueli