Encoding bug in Safari's XMLHttpRequest
Playing around with LiveSite I encountered a bug in Apple's webkit which is used by Safari, OmniWeb and a few other browsers. The XMLHttpRequest object used by those browsers (and probably KHTML based browsers, too, if they've integerated the code) has a bug which causes the response to always be interpreted as ISO-8859-1 instead of in the encoding specified by the content-type, xml document or the parent document. I sent a mail to Apple's webcore mailing list http://lists.apple.com/archives/webcore-dev/2005/Feb/msg00001.html and got a reply saying that the bug has been fixed in an unreleased version of the framework. How would people feel about cluttering liveevil.js with code for checking for WebKit versions up to and including the current one, and attempt to re-interpret the string as UTF-8 if found? A version check and then something along the lines of _from_utf8() on http://homepage3.nifty.com/aokura/jscript/utf8.html should do it, but that would add dozens of lines to the beautifully simple script... / Sincerely, David
On Feb 11, 2005, at 12:24 PM, David Remahl wrote:
How would people feel about cluttering liveevil.js with code for checking for WebKit versions up to and including the current one, and attempt to re-interpret the string as UTF-8 if found?
A version check and then something along the lines of _from_utf8() on http://homepage3.nifty.com/aokura/jscript/utf8.html should do it, but that would add dozens of lines to the beautifully simple script...
My opinion is that livepage should do what it has to, in order to work with existing browsers. I have one suggestion for this case, though: don't do a version check, send a known "magic string" from server->client when first establishing the request. E.g. a single Unicode character. If the browser gets that known character, you're okay. If it shows up as two ISO8859-1 characters that are the UTF-8 encoding of the known character, apply from_utf8. If it's something else entirely, die with an error message. James
On Feb 11, 2005, at 10:42 AM, James Y Knight wrote:
On Feb 11, 2005, at 12:24 PM, David Remahl wrote:
How would people feel about cluttering liveevil.js with code for checking for WebKit versions up to and including the current one, and attempt to re-interpret the string as UTF-8 if found?
A version check and then something along the lines of _from_utf8() on http://homepage3.nifty.com/aokura/jscript/utf8.html should do it, but that would add dozens of lines to the beautifully simple script...
My opinion is that livepage should do what it has to, in order to work with existing browsers.
I have one suggestion for this case, though: don't do a version check, send a known "magic string" from server->client when first establishing the request. E.g. a single Unicode character. If the browser gets that known character, you're okay. If it shows up as two ISO8859-1 characters that are the UTF-8 encoding of the known character, apply from_utf8. If it's something else entirely, die with an error message.
+1 on this; liveevil.js should abstract all of these problems away from the developer. If nobody else generates a patch, I will do one in the manner which James suggests some weekend soon. dp
On Fri, 11 Feb 2005 13:53:50 -0800, Donovan Preston
+1 on this; liveevil.js should abstract all of these problems away from the developer. If nobody else generates a patch, I will do one in the manner which James suggests some weekend soon.
dp
I've created a patch now. The problem turned out to be rather more difficult than originally anticipated. I chose to go with the "magic" method suggested by James. The first time nevow_liveOutput is requested, a second argument is passed. magicEcho is the URI encoded version of "\u9b54\u8853" (Japanese for "magic", clever huh? ;-). nevow_liveOutput prefixes its reply with magicEcho. No extra roundtrips, and very little overhead since the magic is passed only on the first liveOutput query. The problems started when I realized that AppleWebKit does not simply interpret each byte in the stream as \xXX. The encoding it defaults to is not iso-8859-1, it is windows latin-1 (including cp1252). This means that for example \x91 becomes \u2018 (left single quotation mark). I ended up creating a lookup table for going back to something resembling the original stream (which could then be passed to from_utf8). Unfortunately five bytes map to the same character, namely \ufffd (undefined) (\x81, \x8d, \x8f, \x90 and \x9d). This makes it impossible to perfectly reconstruct the original stream if it contained one of those bytes. This affects roughly 10% of unicode characters smaller than 0x10000. In any case, allowing Safari to process 90% of all characters is better than getting erroneous output for 99.9% of them... The only other workaround I can think of is for the client to request a re-send of the latest message in some 7-bit encoded form (base64 or something like that). The advantage is that the interpretation would always be accurate and that we don't have to include the cp1252 conversion table. Disadvantages include that it requires the server to remember the latest message, that it requires an extra xmlhttprequest round trip, that it is relatively space inefficient and that liveevil.js would have to include a base64 decoding function on top of from_utf8(). Magic would still be required to determine whether a re-transmission is necessary (i.e. if the JS implementation is buggy). Does this seem like a reasonable compromise? If so, I'll clean up the patch, create some unit tests and submit it for consideration. / Sincerely, David Remahl
participants (3)
-
David Remahl
-
Donovan Preston
-
James Y Knight