[New-bugs-announce] [issue37071] HTMLParser mistakenly inventing new tags while parsing

Hoang Duy Tran report at bugs.python.org
Mon May 27 21:23:00 EDT 2019


New submission from Hoang Duy Tran <hoangduytran1960 at gmail.com>:

I have been working with some 'difficult' HTML files generated by Sphinx's RST. The following block of text is the RST original content:

----------------------------------------------------
Animation Playback Options
==========================

``-a`` ``<options>`` ``<file(s)>``
   Playback ``<file(s)>``, only operates this way when not running in background.

   ``-p`` ``<sx>`` ``<sy>``
      Open with lower left corner at ``<sx>``, ``<sy>``.
   ``-m``
      Read from disk (Do not buffer).
   ``-f`` ``<fps>`` ``<fps-base>``
      Specify FPS to start with.
   ``-j`` ``<frame>``
      Set frame step to ``<frame>``.
   ``-s`` ``<frame>``
      Play from ``<frame>``.
   ``-e`` ``<frame>``
      Play until ``<frame>``.
----------------------------------------------------

This is the HTML block that is generated by Sphinx:

----------------------------------------------------
<section ids="animation-playback-options" names="animation\ playback\ options"><title>Animation Playback Options</title><definition_list><definition_list_item><term><literal>-a</literal> <literal><options></literal> <literal><file(s)></literal></term><definition><paragraph>Playback <literal><file(s)></literal>, only operates this way when not running in background.</paragraph><definition_list><definition_list_item><term><literal>-p</literal> <literal><sx></literal> <literal><sy></literal></term><definition><paragraph>Open with lower left corner at <literal><sx></literal>, <literal><sy></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-m</literal></term><definition><paragraph>Read from disk (Do not buffer).</paragraph></definition></definition_list_item><definition_list_item><term><literal>-f</literal> <literal><fps></literal> <literal><fps-base></literal></term><definition><paragraph>Specify FPS to start with.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-j</literal> <literal><frame></literal></term><definition><paragraph>Set frame step to <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-s</literal> <literal><frame></literal></term><definition><paragraph>Play from <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-e</literal> <literal><frame></literal></term><definition><paragraph>Play until <literal><frame></literal>.</paragraph></definition></definition_list_item></definition_list></definition></definition_list_item></definition_list></section>
----------------------------------------------------

I then use the BeautifulSoup, which uses the HTMLParser, to beautify and parse the HTML document and I've noticed that every incident of data that leads with a "<" and ends with ">", for example:

<options>
<file(s)>
....

has been misunderstood by the HTMLParser's library as a TAG, and then it INVENTS a CLOSED TAGS for it

ie.

      <literal>
       <options>
       </options>
      </literal>

and

       <literal>
        <file(s)>
        </file(s)>
       </literal>

which when reversing, ie. turning from HTML back to normal text, muted out the original data, leading to TRUNCATION/LOST of DATA.

Here is the content of the beautify generated data, issue lines are marked with '#**************************' to make it easier for you to identify.

----------------------------------------------------
  <section ids="animation-playback-options" names="animation\ playback\ options">
   <title>
    Animation Playback Options
   </title>
   <definition_list>
    <definition_list_item>
     <term>
      <literal>
       -a
      </literal>
      <literal>
       <options> #**************************
       </options> #**************************
      </literal>
      <literal>
       <file(s)> #**************************
       </file(s)> #**************************
      </literal>
     </term>
     <definition>
      <paragraph>
       Playback
       <literal>
        <file(s)> #**************************
        </file(s)> #**************************
       </literal>
       , only operates this way when not running in background.
      </paragraph>
      <definition_list>
       <definition_list_item>
        <term>
         <literal>
          -p
         </literal>
         <literal>
          <sx> #**************************
          </sx> #**************************
         </literal>
         <literal>
          <sy> #**************************
          </sy> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Open with lower left corner at
          <literal>
           <sx> #**************************
           </sx> #**************************
          </literal>
          ,
          <literal>
           <sy> #**************************
           </sy> #**************************
          </literal>
          .
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -m
         </literal>
        </term>
        <definition>
         <paragraph>
          Read from disk (Do not buffer).
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -f
         </literal>
         <literal>
          <fps> #**************************
          </fps> #**************************
         </literal>
         <literal>
          <fps-base> #**************************
          </fps-base> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Specify FPS to start with.
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -j
         </literal>
         <literal>
          <frame/> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Set frame step to
          <literal>
           <frame/> #**************************
          </literal>
          .
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -s
         </literal>
         <literal>
          <frame/> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Play from
          <literal>
           <frame/> #**************************
          </literal>
          .
         </paragraph>
        </definition>
       </definition_list_item>
       <definition_list_item>
        <term>
         <literal>
          -e
         </literal>
         <literal>
          <frame/> #**************************
         </literal>
        </term>
        <definition>
         <paragraph>
          Play until
          <literal>
           <frame/> #**************************
          </literal>
          .
         </paragraph>
        </definition>
       </definition_list_item>
      </definition_list>
     </definition>
    </definition_list_item>
   </definition_list>
  </section>
----------------------------------------------------
I enclosed the HTML file generated by Sphinx to allow you test this issue with the actual data.

Here is the URL of the HTML file:

https://docs.blender.org/manual/en/dev/advanced/command_line/arguments.html


Kind Regards,
Hoang Tran

----------
components: Library (Lib)
files: arguments.html
messages: 343724
nosy: htran
priority: normal
severity: normal
status: open
title: HTMLParser mistakenly inventing new tags while parsing
type: behavior
versions: Python 3.6
Added file: https://bugs.python.org/file48367/arguments.html

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue37071>
_______________________________________


More information about the New-bugs-announce mailing list