On 9/10/06, <b class="gmail_sendername">David Hopwood</b> <<a href="mailto:david.nospam.hopwood@blueyonder.co.uk">david.nospam.hopwood@blueyonder.co.uk</a>> wrote:<div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Here is a very simple, reasonably (although not completely) safe, and much<br>more predictable guessing algorithm, based on a generalization of<br><<a href="http://www.w3.org/TR/REC-xml/#sec-guessing">http://www.w3.org/TR/REC-xml/#sec-guessing
</a>>:</blockquote><div><br>Your algorithm is more predictable but will confuse BOM-less UTF-8 with the system encoding frequently. I haven't decided in my own mind whether that trade-off is worth making. It will work well for:
<br><br> * Windows users, who will often find a BOM in their UTF-8<br><br> * Western Unix/Linux users who will increasingly use UTF-8 as their system encoding<br><br>It will not work well for:<br><br> * Eastern Unix/Linux users using UTF-8 apps like gedit or apps "saving as" UTF-8
<br><br> * Mac users using UTF-8 apps or saving as UTF-8.<br><br>I still haven't decided how I feel about that trade-off.<br><br>Maybe the guessing algorithm should read the WHOLE FILE. After all, we've said repeatedly that it isn't for production use so making it a bit inefficient is not a big problem and might even emphasize that point.
<br><br>Modern I/O is astonishingly fast anyhow. On my computer it takes five seconds to decode a quarter gigabyte of UTF-8 text through Python. That would be a totally unacceptable waste for a production program, but for a quick hack it wouldn't be bad. And it would guarantee that you would never get an exception half-way through your parsing because of a bad character.
<br><br> Paul Prescod<br><br></div></div>