Re: [Python-Dev] Bytes path support

Aug. 23, 2014

      "Stephen J. Turnbull" <stephen@xemacs.org>:
...
Just read as bytes and decode piecewise in one way or another. For
Oleg's HTML case, there's a well-understood structure that can be used
to determine retry points
HTML and XML are interesting examples since their encoding is initially
unknown:

  <?xml version="1.0"?>
                      ^
                      +--- Now I know it is UTF-8

  <?xml version="1.0" encoding="UTF-16"?>
                                      ^
                                      +--- Now I know it was UTF-16
                                           all along!

Then we have:

  HTTP/1.1 200 OK
  Content-Type: text/html; charset=ISO-8859-1

  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  <html>
  <head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-16">

See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.

Marko

Re: [Python-Dev] Bytes path support

Marko Rauhamaa