[Python-3000] Pre-PEP: Easy Text File Decoding

Paul Prescod paul at prescod.net
Sun Sep 10 21:02:44 CEST 2006


On 9/10/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
>
> Here is a very simple, reasonably (although not completely) safe, and much
> more predictable guessing algorithm, based on a generalization of
> <http://www.w3.org/TR/REC-xml/#sec-guessing>:


Your algorithm is more predictable but will confuse BOM-less UTF-8 with the
system encoding frequently. I haven't decided in my own mind whether that
trade-off is worth making. It will work well for:

 * Windows users, who will often find a BOM in their UTF-8

 * Western Unix/Linux users who will increasingly use UTF-8 as their system
encoding

It will not work well for:

 * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving as"
UTF-8

 * Mac users using UTF-8 apps or saving as UTF-8.

I still haven't decided how I feel about that trade-off.

Maybe the guessing algorithm should read the WHOLE FILE. After all, we've
said repeatedly that it isn't for production use so making it a bit
inefficient is not a big problem and might even emphasize that point.

Modern I/O is astonishingly fast anyhow. On my computer it takes five
seconds to decode a quarter gigabyte of UTF-8 text through Python. That
would be a totally unacceptable waste for a production program, but for a
quick hack it wouldn't be bad. And it would guarantee that you would never
get an exception half-way through your parsing because of a bad character.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060910/0c19d780/attachment.html 


More information about the Python-3000 mailing list