[Python-3000] Pre-PEP: Easy Text File Decoding

Antoine Pitrou solipsis at pitrou.net
Sun Sep 10 21:36:48 CEST 2006

Le dimanche 10 septembre 2006 à 12:02 -0700, Paul Prescod a écrit :
> Your algorithm is more predictable but will confuse BOM-less UTF-8
> with the system encoding frequently.

I don't think it is desirable to acknowledge only some kinds of UTF-8.
It will confuse the hell out of programmers, and users.

I'm not sure full-blown statistical analysis is necessary anyway. There
should be an ordered list of detectable encodings, which realistically
would be [all unicode variants, system default]. Then if you have a file
which is syntactically valid UTF-8, it most likely /is/ UTF-8 and not
ISO-8859-1 (for example).

> Modern I/O is astonishingly fast anyhow. On my computer it takes five
> seconds to decode a quarter gigabyte of UTF-8 text through Python.

Maybe we shouldn't be that presomptuous. Modern I/O is fast but memory
is not infinite. That quarter gigabyte will have swapped out other
data/code in order to make some place in the filesystem cache.
Also, Python is often used on more modest hardware.



More information about the Python-3000 mailing list