<div dir="ltr">Oh that's interesting. So it seems to be Python that's the exception here.<div><br></div><div>Would we really be able to add entries to character mappings that haven't changed since Python 2.0?</div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas <<a href="mailto:python-ideas@python.org">python-ideas@python.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div text="#000000" bgcolor="#FFFFFF">
    <p>First of all, many thanks for such a excellently writen letter.
      It was a real pleasure to read.<br>
    </p></div><div text="#000000" bgcolor="#FFFFFF">
    <div class="m_-2585226816780683907moz-cite-prefix">On 10.01.2018 0:15, Rob Speer wrote:<br>
    </div>
    <blockquote type="cite">
      <div dir="ltr">
        <div>
          <div>
            <div>
              <div>
                <div>
                  <div>
                    <div>
                      <div>
                        <div>
                          <div>
                            <div>
                              <div>Hi! I joined this list because I'm
                                interested in filling a gap in Python's
                                standard library, relating to text
                                encodings.<br>
                                <br>
                              </div>
                              There is an encoding with no name of its
                              own. It's supported by every current web
                              browser and standardized by WHATWG. It's
                              so prevalent that if you ask a Web browser
                              to decode "iso-8859-1" or "windows-1252",
                              you will get this encoding _instead_. It
                              is probably the second or third most
                              common text encoding in the world. And
                              Python doesn't quite support it.<br>
                              <br>
                            </div>
                            You can see the character table for this
                            encoding at:<br>
                            <a href="https://encoding.spec.whatwg.org/index-windows-1252.txt" target="_blank">https://encoding.spec.whatwg.org/index-windows-1252.txt</a><br>
                          </div>
                          <div><br>
                          </div>
                          <div>For the sake of discussion, let's call
                            this encoding "web-1252". WHATWG calls it
                            "windows-1252", but notice that it's subtly
                            different from Python's "windows-1252"
                            encoding. Python's windows-1252 has bytes
                            that are undefined:<br>
                          </div>
                          <br>
                          >>> b'\x90'.decode('windows-1252')<br>
                          UnicodeDecodeError: 'charmap' codec can't
                          decode byte 0x90 in position 0: character maps
                          to <undefined><br>
                          <br>
                        </div>
                        In web-1252, the bytes that are undefined
                        according to windows-1252 map to the control
                        characters in those positions in iso-8859-1 --
                        that is, the Unicode codepoints with the same
                        number as the byte. In web-1252, b'\x90' would
                        decode as '\u0090'.<br>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote></div><div text="#000000" bgcolor="#FFFFFF">
    According to <a href="https://en.wikipedia.org/wiki/Windows-1252" target="_blank">https://en.wikipedia.org/wiki/Windows-1252</a>
    , Windows does the same:<br>
    <p>    "According to the information on Microsoft's and the Unicode
      Consortium's websites, positions 81, 8D, 8F, 90, and 9D are
      unused; however, the Windows API <code><a rel="nofollow noopener
          noreferrer" class="m_-2585226816780683907external m_-2585226816780683907text" href="http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx" target="_blank">MultiByteToWideChar</a></code> maps these to
      the corresponding <a href="https://en.wikipedia.org/wiki/C0_and_C1_control_codes" title="" target="_blank">C1 control codes</a>."</p>
    And in ISO-8859-1, the same handling is done for unused code points
    even by the standard ( <a href="https://en.wikipedia.org/wiki/ISO/IEC_8859-1" target="_blank">https://en.wikipedia.org/wiki/ISO/IEC_8859-1</a>
    ) :<br>
    <p>    "<b>ISO-8859-1</b> is the <a href="https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority" title="Internet Assigned Numbers Authority" target="_blank">IANA</a> preferred
      name for this standard when supplemented with the <a href="https://en.wikipedia.org/wiki/C0_and_C1_control_codes" title="" target="_blank">C0 and C1 control codes</a> from <a href="https://en.wikipedia.org/wiki/ISO/IEC_6429" class="m_-2585226816780683907mw-redirect" title="ISO/IEC 6429" target="_blank">ISO/IEC 6429</a>"</p>
    And what would you think -- these "C1 control codes" are also the
    corresponding Unicode points! ( <a href="https://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29" target="_blank">https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)</a>
    )<br>
    <br>
    Since Windows is pretty much the reference implementation for
    "windows-xxxx" encodings, it even makes sense to alter the existing
    encodings rather than add new ones.<br>
    <br>
    <blockquote type="cite"></blockquote></div><div text="#000000" bgcolor="#FFFFFF"><blockquote type="cite">
      <div dir="ltr">
        <div>
          <div>
            <div>
              <div>
                <div>
                  <div>
                    <div>
                      <div><br>
                      </div>
                      This may seem like a silly encoding that
                      encourages doing horrible things with text. That's
                      pretty much the case. But there's a reason every
                      Web browser implements it:<br>
                      <br>
                    </div>
                    - It's compatible with windows-1252<br>
                  </div>
                  - Any sequence of bytes can be round-tripped through
                  it without losing information<br>
                  <br>
                </div>
                It's not just this one encoding. WHATWG's encoding
                standard (<a href="https://encoding.spec.whatwg.org/" target="_blank">https://encoding.spec.whatwg.org/</a>)
                contains modified versions of windows-1250 through
                windows-1258 and windows-874.<br>
                <br>
              </div>
              Support for these encodings matters to me, in part,
              because I maintain a Unicode data-cleaning library,
              "ftfy". One thing it does is to detect and undo
              encoding/decoding errors that cause mojibake, as long as
              they're detectible and reversible. Looking at real-world
              examples of text that has been damaged by mojibake, it's
              clear that lots of text is transferred through what I'm
              calling the "web-1252" encoding, in a way that's
              incompatible with Python's "windows-1252".</div>
            <div><br>
            </div>
            <div>In order to be able to work with and fix this kind of
              text, ftfy registers new codecs -- and I implemented this
              even before I knew that they were standardized in Web
              browsers. When ftfy is imported, you can decode text as
              "sloppy-windows-1252" (the name I chose for this
              encoding), for example.</div>
            <div><br>
            </div>
            <div>ftfy can tell people a sequence of steps that they can
              use in the future to fix text that's like the text they
              provided. Very often, these steps require the
              sloppy-windows-1252 or sloppy-windows-1251 encoding, which
              means the steps only work with ftfy imported, even for
              people who are not using the features of ftfy.<br>
            </div>
            <br>
          </div>
          Support for these encodings also seems highly relevant to
          people who use Python for web scraping, as it would be
          desirable to maximize compatibility with what a Web browser
          would do.<br>
          <br>
          This really seems like it belongs in the standard library
          instead of being an incidental feature of my library. I know
          that code in the standard library has "one foot in the grave".
          I _want_ these legacy encodings to have one foot in the grave.
          But some of them are extremely common, and Python code should
          be able to deal with them.<br>
          <br>
        </div>
        Adding these encodings to Python would be straightforward to
        implement. Does this require a PEP, a pull request, or further
        discussion?<br>
      </div>
      <br>
      <fieldset class="m_-2585226816780683907mimeAttachmentHeader"></fieldset>
      <br>
      </blockquote></div><div text="#000000" bgcolor="#FFFFFF"><blockquote type="cite"><pre>_______________________________________________
Python-ideas mailing list
<a class="m_-2585226816780683907moz-txt-link-abbreviated" href="mailto:Python-ideas@python.org" target="_blank">Python-ideas@python.org</a>
<a class="m_-2585226816780683907moz-txt-link-freetext" href="https://mail.python.org/mailman/listinfo/python-ideas" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a>
Code of Conduct: <a class="m_-2585226816780683907moz-txt-link-freetext" href="http://python.org/psf/codeofconduct/" target="_blank">http://python.org/psf/codeofconduct/</a>
</pre>
    </blockquote>
    <br>
    <pre class="m_-2585226816780683907moz-signature" cols="72">-- 
Regards,
Ivan</pre>
  </div>

_______________________________________________<br>
Python-ideas mailing list<br>
<a href="mailto:Python-ideas@python.org" target="_blank">Python-ideas@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>
Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofconduct/</a><br>
</blockquote></div>