RE: [Patches] [ python-Patches-410465 ] Allow pre-encoded strings as filenames
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=410465& group_id=5470
Category: core (C code) Group: None
Status: Closed Resolution: Accepted Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: Allow pre-encoded strings as filenames
Initial Comment: This patch enables most filename parameters to use pre- encoded strings. On Windows, the default of "mbcs" is used. On all other platforms, the default filename encoding is the same as the general default encoding, which in reality means there is no functional change. However, other platforms can simply plugin their own encodings. ...
Mark (or anyone else who understands all this), were doc changes included? Can someone please add a briefer user-oriented blurb to Misc/NEWS too?
[Tim]
Mark (or anyone else who understands all this), were doc changes included? Can someone please add a briefer user-oriented blurb to Misc/NEWS too?
- Some operating systems now support the concept of a default Unicode encoding for file system operations. Notably, Windows supports 'mbcs' as the default. The Macintosh will also adopt this concept in the medium term, altough the default encoding for that platform will be other than 'mbcs'. On operating system that support non-ascii filenames, it is common for functions that return filenames (such as os.listdir()) to return Python string objects pre-encoded using the default file system encoding for the platform. As this encoding is likely to be different from Python's default encoding, converting this name to a Unicode object before
No problem. Where should the "real" documentation go? It seems maybe we need a new sub-heading under the "6.1 - os -- Misc. OS Interface" - something like: 6.1.x - Unicode and the file system - general discussion. - Windows specific - Mac specific should that appear. - OS' with no special support (ie, "the rest") Does that make sense? I have made this change to Misc/NEWS. Does this look OK (obviously once I know what to replace "[????]" with :) And-I-will-do-the-registry-docs-at-the-same-time ly, Mark. Index: NEWS =================================================================== RCS file: /cvsroot/python/python/dist/src/Misc/NEWS,v retrieving revision 1.166 diff -r1.166 NEWS 4a5,21 passing
it back to the Operating System would result in a Unicode error, as Python would attempt to use it's default encoding (generally ASCII) rather than the default encoding for the file system. In general, this change simply removes surprises when working with Unicode and the file system, making these operations work as you expect, increasing the transparency of Unicode objects in this context. See [????] for more details, including examples.
[Mark Hammond]
... Where should the "real" documentation go? It seems maybe we need a new sub-heading under the "6.1 - os -- Misc. OS Interface" - something like:
6.1.x - Unicode and the file system - general discussion. - Windows specific - Mac specific should that appear. - OS' with no special support (ie, "the rest")
Does that make sense?
So far is it goes, yes. I think the manual desperately needs a Unicode section for other reasons, though: from traffic on c.l.py, it's clear that few people can figure out how to do *anything* with Unicode now unless their first name begins with "M" (Mark, Martin, Marc -- definitely not Skip <wink>). There's no overview and there are no examples. The primary string method doesn't even mention Unicode (here paraphrasing questions that pop up): encode([encoding[,errors]]) Return an encoded version of the string. What does "encoded version" mean? Is that another string? An encoding object of some sort? Etc. Default encoding is the current default string encoding. What's the "current default string encoding"? How can I find out? Can't even guess what *type* it has (string? magic object? little integer?). If I don't want the default encoding, how do I specify a different one? What are the possible values? Again, can't even guess the type of the object that needs to be passed for encoding. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a ValueError. Other possible values are 'ignore' and 'replace'. So what do 'ignore' and 'replace' mean? There's more left unsaid here than a single example could clarify, but there's not even an example -- so people stare at this wholly uncomprehending. If they stumble into the unicode() builtin function (in a different part of the manual, neither referencing nor referenced by the .encode() method), it's no better: unicode(string[, encoding[, errors]]) Decodes string using the codec for encoding. What? Hard to even guess what the function returns. Maybe, from the name, a Unicode string? Error handling is done according to errors. What? The default behavior is to decode UTF-8 in strict mode, meaning that encoding errors raise ValueError. How do encoding errors arise from a function that *de*codes? See also the codecs module. Which helps, but the relationship between the codecs module and the unicode() function isn't spelled out there either. Look up "encdoing" in the index, and you get pointers to base64, quoted-printable and the mimetypes module, which only confuses things more. I don't expect you to fix this <wink>, I'm trying to get across that the Unicode docs need work even without new gimmicks. If Fred agrees, I'm sure he'll think of a good place to put the new info too.
I have made this change to Misc/NEWS. Does this look OK (obviously once I know what to replace "[????]" with :)
Absolutely, and I don't even have to read it to say so <wink>: once *something* is checked in, we're assured it won't get dropped on the floor come release time, and anyone who has any quibbles with it can check in changes. It's not like checking in a NEWS item can break the std test suite or cause HP-UX to crash. well-not-really-sure-about-the-latter-ly y'rs - tim
Tim Peters wrote:
[Mark Hammond]
... Where should the "real" documentation go? It seems maybe we need a new sub-heading under the "6.1 - os -- Misc. OS Interface" - something like:
6.1.x - Unicode and the file system - general discussion. - Windows specific - Mac specific should that appear. - OS' with no special support (ie, "the rest")
Does that make sense?
So far is it goes, yes. I think the manual desperately needs a Unicode section for other reasons, though: from traffic on c.l.py, it's clear that few people can figure out how to do *anything* with Unicode now unless their first name begins with "M" (Mark, Martin, Marc -- definitely not Skip <wink>). There's no overview and there are no examples. The primary string method doesn't even mention Unicode (here paraphrasing questions that pop up): [...]
True. The main source of documentation for Unicode still is the proposal itself (Misc/unicode.txt). It needs some reordering and a few examples, but does contain all the information needed to grasp what the implementation intends and how it works. If that's still not enough, there are numerous doc-strings in the codecs.py module, more technical docs in the API reference and finally the unicodeobject.h header file itself. Another source for documentation and examples is the i18n-sig page on python.org. -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
"M" == M <mal@lemburg.com> writes:
M> True. The main source of documentation for Unicode still is the M> proposal itself (Misc/unicode.txt). It needs some reordering M> and a few examples, but does contain all the information needed M> to grasp what the implementation intends and how it works. As a first step, why not PEP-ify that document, much like as has been done with the DB-API (version 1 & 2)? It can be an informational PEP. -Barry
I don't know that the Unicode docs need massive work, but the docs that are there simply don't answer the technical questions people have: they're too thin. Let's keep it simple. Contrast the Library manual's: unicode(string[, encoding[, errors]]) Decodes string using the codec for encoding. Error handling is done according to errors. The default behavior is to decode UTF-8 in strict mode, meaning that encoding errors raise ValueError. See also the codecs module. with Andrew's description (from http://www.amk.ca/python/2.0/): unicode(string [, encoding] [, errors]) Creates a Unicode string from an 8-bit string. encoding is a string naming the encoding to use. The errors parameter specifies the treatment of characters that are invalid for the current encoding; passing 'strict' as the value causes an exception to be raised on any encoding error, while 'ignore' causes errors to be silently ignored and 'replace' uses U+FFFD, the official replacement character, in case of any problems. The latter addresses several *fundamental* questions untouched by the former, like whar are the datatypes of the arguments and the result, what values does errors accept, and what do they mean? The first blurb answers some more, like what's the default encoding, and which exception is raised? Neither is complete on its own, but the reference manual should have a complete answer to all such questions. It doesn't have to go on at great length. A round-trip example would be invaluable. If Fred wanted to incorporate a brief overview too, a light rework of Andrew/Moshe's writeup would be an excellent start.
Tim Peters wrote:
I don't know that the Unicode docs need massive work, but the docs that are there simply don't answer the technical questions people have: they're too thin.
As much as I would like to work on this, I simply don't have the time... if someone wants to contribute more detailed docs, though, I'd be glad to review them and answer remaining questions. Note that I will give a talk at the upcoming Bordeaux conference about Python and Unicode. The slides will eventually go online after the conference (in July). BTW, are any python-devs attending the conference (they have some great wine in that part of France ;-) ? -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
Tim Peters writes:
The latter addresses several *fundamental* questions untouched by the former, like whar are the datatypes of the arguments and the result, what values does errors accept, and what do they mean? The first blurb answers some more, like what's the default encoding, and which exception is raised? Neither is complete on its own, but the reference manual should have a complete answer to all such questions. It doesn't have to go on at great length.
I've beefed up the desciption of the unicode() function by merging the information from AMK's document.
A round-trip example would be invaluable.
If Fred wanted to incorporate a brief overview too, a light rework of Andrew/Moshe's writeup would be an excellent start.
I'd love to have a contribution from someone with more knowledge of what's there than me. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Digital Creations
participants (5)
-
barry@digicool.com
-
Fred L. Drake, Jr.
-
M.-A. Lemburg
-
Mark Hammond
-
Tim Peters