On 9/10/06, <b class="gmail_sendername">David Hopwood</b> &lt;<a href="mailto:david.nospam.hopwood@blueyonder.co.uk">david.nospam.hopwood@blueyonder.co.uk</a>&gt; wrote:<div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Here is a very simple, reasonably (although not completely) safe, and much<br>more predictable guessing algorithm, based on a generalization of<br>&lt;<a href="http://www.w3.org/TR/REC-xml/#sec-guessing">http://www.w3.org/TR/REC-xml/#sec-guessing

</a>&gt;:</blockquote><div><br>Your algorithm is more predictable but will confuse BOM-less UTF-8 with the system encoding frequently. I haven't decided in my own mind whether that trade-off is worth making. It will work well for:

<br><br>&nbsp;* Windows users, who will often find a BOM in their UTF-8<br><br>&nbsp;* Western Unix/Linux users who will increasingly use UTF-8 as their system encoding<br><br>It will not work well for:<br><br>&nbsp;* Eastern Unix/Linux users using UTF-8 apps like gedit or apps &quot;saving as&quot; UTF-8

&nbsp;* Mac users using UTF-8 apps or saving as UTF-8. I still haven't decided how I feel about that trade-off. Maybe the guessing algorithm should read the WHOLE FILE. After all, we've said repeatedly that it isn't for production use so making it a bit inefficient is not a big problem and might even emphasize that point.

Modern I/O is astonishingly fast anyhow. On my computer it takes five seconds to decode a quarter gigabyte of UTF-8 text through Python. That would be a totally unacceptable waste for a production program, but for a quick hack it wouldn't be bad. And it would guarantee that you would never get an exception half-way through your parsing because of a bad character.

<br><br>&nbsp;Paul Prescod<br><br></div></div>