<span style="font-family: courier new,monospace;">PEP: XXX</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Title: Easy Text File Decoding</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Version: $Revision$</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Last-Modified: $Date$</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Author: Paul Prescod <<a href="mailto:paul@prescod.net">paul@prescod.net</a>></span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
Status: Draft</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Type: Standards Track</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
Content-Type: text/x-rst</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Created: 09-Sep-2006</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
Post-History: 09-Sep-2006</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Python-Version: 3.0</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">Abstract</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">========</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Python 3000 will use Unicode as the standard string type. This means that text files read from disk will be "decoded" into Unicode code points just as binary files might be decoded into integers and structures. This change brings a few issues to the fore that were
</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">previously ignorable. </span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">For example, in Python 2.x, it was possible to open a text file, read the data into a Python string, filter some lines and print the remaining lines to the console without ever considering what "encoding" the text was in. In Python 3000, the programmer will only get access to
</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Python's powerful string manipulation functions after decoding the data to Unicode code points. This means that either the programmer or the Python runtime must select an decoding algorithm (by naming the encoding algorithm that was used to encode the data in the first place).
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Often the programmer can do so based upon out-of-band knowledge ("this file format is always UCS-2" or "the protocol header says that this data is latin-1"). In other cases, the programmer may be more naive or simply wish to avoid thinking about it and would rather defer the issue to Python.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">This document presents a proposal for algorithms and APIs that Python can use to simplify the programmer's life.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Issues outside the scope of this PEP</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">=====================================</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
Any programmer who wishes to take direct control of the encoding selection may of course ignore the features described in this PEP and choose a decoding explicitly. The PEP is not intended to constrain them in any way.</span>
<br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Bytes received through means other than the file system are not addressed by this PEP. For example, the PEP does not address data directly read from a socket or returned from marshal functions.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Rationale</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
==========</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">The simplest possible use case for Python text processing involves a user maintaining some form of simple database (
e.g. an address book) as a text file and processing it with Python. Unfortunately, this use case is not as simple as it should be because of the variety of encodings in the universe. For example, the file might be UTF-8, ISO-8859-1 or ISO-8859-2.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Professional programmers making widely distributed programs probably have no alternative but to deal with this variability head-on. But programmers working with data that originates and resides primarily on their own computer might wish to avoid dealing with it. They would like Python to just "try to do the right" thing with respect to the file. They would like to think about encodings if and only if Python failed to guess appropriately.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Proposal</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
========</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">The function to open a text file will tenatively be called textfile(), though the function name is not an integral part of this PEP. The function takes three arguments, the filename, the mode ("r", "w", "r+", etc.) and the type.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">The type could be a true encoding or one of a small set of additional symbolic values. The two main symbolic values are:
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> * "site" -- the default value, which invokes a site-specific alogrithm. For example, a Japanese school teacher using Windows might default "site" to Shift-JIS. An organization dealing with a small number of encodings might default "site" to be equivalent to "guess". An organization with a strict internationalization policy might default "site" to "UTF-8". An important open issue is what Python's out-of-box interpretation of "site" should be. This is key because "site" is the default value so Python's out-of-box behaviour is the "default default".
</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"></span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> * "guess" -- the value to be used by encoding-inexpert programmers and experts who feel confident that Python's guessing algorithm will produce sufficient results for their purposes. The guessing algorithm will necessarily be complicated and may change over time. It will take into account the following factors:
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> - the conventions dominant on the operating system of choice</span>
<br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> - any localization-relevant settings available</span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> - a certain number of bytes at the start of the file (perhaps start and end?). This sample will likely be on the order of thousands of bytes.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> - filesystem metadata attached to the file (in strong preference to the above).
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> * "locale" -- the encoding suggested by the operating system's locale concept
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Other symbolic values might allow the programmer to suggest specific encoding detection algorithms like XML [#XML-encoding-detection]_, HTML
</span><span style="font-family: courier new,monospace;">[#HTML-encoding-detection]_</span><span style="font-family: courier new,monospace;"> and the "coding:" comment convention. These would be specified in separate PEPs.
<br><br></span><span style="font-family: courier new,monospace;">The Site Decoding Hook</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">========================</span>
<br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">The "sys" module could have a function called "setdefaultfileencoding". The encoding specified could be a true encoding name or one of the encoding detection scheme names (
e.g. "guess" or "XML").</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">In addition, it should be possible to register new encoding detection schemes using a method like "
sys.registerencodingdetector". This function would take two arguments, a string and a callable. The callable would accept a byte stream argument and return a text stream. The contract for these detection scheme implementations must allow them to peek ahead some bytes to use the content as a hint to the encoding.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Alternatives and Open Issues</span><br style="font-family: courier new,monospace;">
<span style="font-family: courier new,monospace;">==============================</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
1. Guido proposes that the function be called merely "open". His proposal is that the binary open should be the alternative and should be invoked explicitly with a "b" mode switch. The PEP author feels first, that changing the behaviour of an existing function is more confusing and disruptive than creating another. Backporting a change to the "open" function would be difficult and therefore it would be unnecessarily difficult to create file-manipulating libraries that work both on Python
2.x and 3.x.</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">Second, the author feels that the "open" is an unnecessarily cryptic name based only in Unix/C history. For a programmer coming from (for example) Javascript, open() would tend to imply "open window". The PEP author believes that factory functions should say what they are creating.
</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">2. There is substantial disagreement on the behaviour of the function when there is no encoding argument passed and no site override (
i.e the out-of-box default). Current proposals include ASCII (on the basis that it is a nearly universal subset of popular encodings), UTF-8 (on the basis that it is the dominant global standard encompassing all of Unicode), a locale-derived encoding (on the basis that this is what a naive user will generate in a text editor) or the guessing algorithm (on the basis that it is by definition designed to guess right more often than any more specific encoding name).
</span><br style="font-family: courier new,monospace;"><br><span style="font-family: courier new,monospace;">The PEP author
strongly advocates a strict encoding like ASCII, UTF-8 or no default at
all (in which case the lack of an encoding would raise an exception). A
default like iso-8859-1 (even inferred from the environment) will
result in encodings like UTF-8, UCS-2 and even binary files being
"interpreted" as gibberish strings. This could result in document or
database corruption. An encoding with a "guess" default will encourage the widespread creation of very unreliable code.<br><br>The current proposal is to have no out-of-box default until
some point in the future when a small set of auto-detectable encodings are globally
dominant. UTF-8 has gradually been gaining popularity through W3C and
other standards so it is possible that five years from now it will be
the "no-brainer" default. Until we can guess with substantial confidence, absence of both an encoding declaration and a site override should result in a thrown exception. <br><br>References<br>==========<br><br>
</span><span style="font-family: courier new,monospace;">.. [#XML-encoding-detection] XML Encoding Detection algorithm: </span><span style="font-family: courier new,monospace;"><a href="http://www.w3.org/TR/REC-xml/#sec-guessing">
http://www.w3.org/TR/REC-xml/#sec-guessing</a><br>
</span><span style="font-family: courier new,monospace;">.. [#HTML-encoding-detection] HTML Encoding Detection algorithm: </span><span style="font-family: courier new,monospace;"><a href="http://www.w3.org/TR/REC-xml/#sec-guessing">
http://www.w3.org/TR/REC-xml/#sec-guessing</a></span><br style="font-family: courier new,monospace;">
<br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"></span><span style="font-family: courier new,monospace;"></span><span style="font-family: courier new,monospace;">Copyright
</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">=========</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
This document has been placed in the public domain.</span><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">..</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> Local Variables:
</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> mode: indented-text</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
indent-tabs-mode: nil</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> sentence-end-double-space: t</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
fill-column: 70</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;"> coding: utf-8</span><br style="font-family: courier new,monospace;"><span style="font-family: courier new,monospace;">
End:</span><br style="font-family: courier new,monospace;"><br>