On Mon, Sep 20, 2010 at 6:19 PM, Nick Coghlan <span dir="ltr">&lt;<a href="mailto:ncoghlan@gmail.com">ncoghlan@gmail.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div><div class="h5">

&gt; What are the cases you believe will cause new mojibake?<br>

<br>

</div></div>Calling operations like urlsplit on byte sequences in non-ASCII<br>

compatible encodings and operations like urljoin on byte sequences<br>

that are encoded with different encodings. These errors differ from<br>

the URL escaping errors you cite, since they can produce true mojibake<br>

(i.e. a byte sequence without a single consistent encoding), rather<br>

than merely non-compliant URLs. However, if someone has let their<br>

encodings get that badly out of whack in URL manipulation they&#39;re<br>

probably doomed anyway...<br></blockquote><div><br>FWIW, while I understand the problems non-ASCII-compatible encodings can create, I&#39;ve never encountered them, perhaps because ASCII-compatible encodings are so dominant.<br>


<br>There are ways you can get a URL (HTTP specifically) where there is no notion of Unicode.  I think the use case everyone has in mind here is where you get a URL from one of these sources, and you want to handle it.  I have a hard time imagining the sequence of events that would lead to mojibake.  Naive parsing of a document in bytes couldn&#39;t do it, because if you have a non-ASCII-compatible document your ASCII-based parsing will also fail (e.g., looking for b&#39;href=&quot;(.*?)&quot;&#39;).  I suppose if you did urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())) you could end up with the problem.  <br>


<br>All this is unrelated to the question, though -- a separate byte-oriented function won&#39;t help any case I can think of.  If the programmer is implementing something like urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())), it&#39;s because they *want* to get bytes out.  So if it&#39;s named urlparse.urlsplit_bytes() they&#39;ll just use that, with the same corruption.  Since bytes and text don&#39;t interact well, the choice of bytes in and bytes out will be a deliberate one.  *Or*, bytes will unintentionally come through, but that will just delay the error a while when the bytes out don&#39;t work (e.g., urlparse.urljoin(text_url, urlparse.urlsplit(byte_url).path).  Delaying the error is a little annoying, but a delayed error doesn&#39;t lead to mojibake.<br>


<br>Mojibake is caused by  allowing bytes and text to intermix, and the polymorphic functions as proposed don&#39;t add new dangers in that regard.<br><br clear="all"></div></div>-- <br>Ian Bicking  |  <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a><br>