On Mon, Sep 20, 2010 at 6:19 PM, Nick Coghlan <span dir="ltr"><<a href="mailto:ncoghlan@gmail.com">ncoghlan@gmail.com</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div><div class="h5">
> What are the cases you believe will cause new mojibake?<br>
<br>
</div></div>Calling operations like urlsplit on byte sequences in non-ASCII<br>
compatible encodings and operations like urljoin on byte sequences<br>
that are encoded with different encodings. These errors differ from<br>
the URL escaping errors you cite, since they can produce true mojibake<br>
(i.e. a byte sequence without a single consistent encoding), rather<br>
than merely non-compliant URLs. However, if someone has let their<br>
encodings get that badly out of whack in URL manipulation they're<br>
probably doomed anyway...<br></blockquote><div><br>FWIW, while I understand the problems non-ASCII-compatible encodings can create, I've never encountered them, perhaps because ASCII-compatible encodings are so dominant.<br>
<br>There are ways you can get a URL (HTTP specifically) where there is no notion of Unicode. I think the use case everyone has in mind here is where you get a URL from one of these sources, and you want to handle it. I have a hard time imagining the sequence of events that would lead to mojibake. Naive parsing of a document in bytes couldn't do it, because if you have a non-ASCII-compatible document your ASCII-based parsing will also fail (e.g., looking for b'href="(.*?)"'). I suppose if you did urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())) you could end up with the problem. <br>
<br>All this is unrelated to the question, though -- a separate byte-oriented function won't help any case I can think of. If the programmer is implementing something like urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())), it's because they *want* to get bytes out. So if it's named urlparse.urlsplit_bytes() they'll just use that, with the same corruption. Since bytes and text don't interact well, the choice of bytes in and bytes out will be a deliberate one. *Or*, bytes will unintentionally come through, but that will just delay the error a while when the bytes out don't work (e.g., urlparse.urljoin(text_url, urlparse.urlsplit(byte_url).path). Delaying the error is a little annoying, but a delayed error doesn't lead to mojibake.<br>
<br>Mojibake is caused by allowing bytes and text to intermix, and the polymorphic functions as proposed don't add new dangers in that regard.<br><br clear="all"></div></div>-- <br>Ian Bicking | <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a><br>