[I18n-sig] encoding support for Docutils: please review

Martin v. Loewis martin@v.loewis.de
29 Jun 2002 21:43:01 +0200

David Goodger <goodger@users.sourceforge.net> writes:

> - Try the encoding specified by a command-line option, if any.
> - Try the locale's encoding.
> - Try UTF-8.
> - Try platform-specific encodings: CP-1252 on Windows, Mac-Roman on
>   MacOS, perhaps Latin-9 (iso-8859-15) otherwise.
> Does this look right, or am I missing something?

I'd reorder this: (try command line). Try ASCII first, then UTF-8. If
ASCII passes, it most likely is ASCII. If not, and UTF-8 passes, it
most likely is UTF-8. Then try the locale's encoding.

> - Does the application have to call
>   ``locale.setlocale(locale.LC_ALL, '')``, and if so, where?  Is it OK
>   to call setlocale from within the decoding function, or should it be
>   left up to the client application?

Atleast on Solaris, you need this to get nl_langinfo to work correctly.

> - Should I use the result of ``locale.getlocale()``?  On
>   Win2K/Python2.2.1, I get this::
>       >>> import locale
>       >>> locale.getlocale()
>       (None, None)
>       >>> locale.getdefaultlocale()
>       ('en_US', 'cp1252')
>   Looks good so far.

No; this is broken beyond repair. On Unix, try nl_langinfo(CODESET)
(requires Python 2.2).  On Windows, try _getdefaultlocale. If either
fails, you may then fall-back to getlocale, but expect it to fail with
exceptions, and to err.

>   How can I use ``locale.getlocale()`` when it doesn't return a
>   known encoding?  Or put another way, how can I get a known
>   encoding out of ``locale.getlocale()``?

[Don't use getlocale]. If nl_langinfo gives an unknown codeset,
produce a warning message, asking the user to report that as a
bug. Keep a list of additional aliases for codesets that occur in the
wild and are aliases to known codecs, also keep a list of known
unsupported codesets (again, restrict yourself to those occurring in
the wild).

> - Does ``locale.getdefaultlocale()[1]`` reliably produce the
>   platform-specific encoding?