[Python-Dev] RE: [Patches] [ python-Patches-410465 ] Allow pre-encoded strings as filenames

Sun, 13 May 2001 22:52:22 -0400

[Mark Hammond]
> ...
> Where should the "real" documentation go?  It seems maybe we need a
> new sub-heading under the "6.1 - os -- Misc. OS Interface" - something
> like:
>
> 6.1.x - Unicode and the file system
>   - general discussion.
>   - Windows specific
>   - Mac specific should that appear.
>   - OS' with no special support (ie, "the rest")
>
> Does that make sense?

So far is it goes, yes.  I think the manual desperately needs a Unicode
section for other reasons, though:  from traffic on c.l.py, it's clear that
few people can figure out how to do *anything* with Unicode now unless their
first name begins with "M" (Mark, Martin, Marc -- definitely not Skip
<wink>).  There's no overview and there are no examples.  The primary string
method doesn't even mention Unicode (here paraphrasing questions that pop
up):

    encode([encoding[,errors]])
    Return an encoded version of the string.

What does "encoded version" mean?  Is that another string?  An encoding
object of some sort?  Etc.

    Default encoding is the current default string encoding.

What's the "current default string encoding"?  How can I find out?  Can't
even guess what *type* it has (string? magic object? little integer?).  If I
don't want the default encoding, how do I specify a different one?  What are
the possible values?  Again, can't even guess the type of the object that
needs to be passed for encoding.

    errors may be given to set a different error handling scheme.
    The default for errors is 'strict', meaning that encoding
    errors raise a ValueError. Other possible values are 'ignore'
    and 'replace'.

So what do 'ignore' and 'replace' mean?

There's more left unsaid here than a single example could clarify, but
there's not even an example -- so people stare at this wholly
uncomprehending.

If they stumble into the unicode() builtin function (in a different part of
the manual, neither referencing nor referenced by the .encode() method), it's
no better:

    unicode(string[, encoding[, errors]])
    Decodes string using the codec for encoding.

What?  Hard to even guess what the function returns.  Maybe, from the name, a
Unicode string?

    Error handling is done according to errors.

What?

    The default behavior is to decode UTF-8 in strict mode,
    meaning that encoding errors raise ValueError.

How do encoding errors arise from a function that *de*codes?

    See also the codecs module.

Which helps, but the relationship between the codecs module and the unicode()
function isn't spelled out there either.  Look up "encdoing" in the index,
and you get pointers to base64, quoted-printable and the mimetypes module,
which only confuses things more.

I don't expect you to fix this <wink>, I'm trying to get across that the
Unicode docs need work even without new gimmicks.  If Fred agrees, I'm sure
he'll think of a good place to put the new info too.

> I have made this change to Misc/NEWS.  Does this look OK
> (obviously once I know what to replace "[????]" with :)

Absolutely, and I don't even have to read it to say so <wink>:  once
*something* is checked in, we're assured it won't get dropped on the floor
come release time, and anyone who has any quibbles with it can check in
changes.  It's not like checking in a NEWS item can break the std test suite
or cause HP-UX to crash.

well-not-really-sure-about-the-latter-ly y'rs  - tim