
"Stephen J. Turnbull" <stephen@xemacs.org>:
Just read as bytes and decode piecewise in one way or another. For Oleg's HTML case, there's a well-understood structure that can be used to determine retry points
HTML and XML are interesting examples since their encoding is initially unknown: <?xml version="1.0"?> ^ +--- Now I know it is UTF-8 <?xml version="1.0" encoding="UTF-16"?> ^ +--- Now I know it was UTF-16 all along! Then we have: HTTP/1.1 200 OK Content-Type: text/html; charset=ISO-8859-1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-16"> See how deep you have to parse the TCP stream before you realize the content encoding is UTF-16. Marko