[New-bugs-announce] [issue37071] HTMLParser mistakenly inventing new tags while parsing
Hoang Duy Tran
report at bugs.python.org
Mon May 27 21:23:00 EDT 2019
New submission from Hoang Duy Tran <hoangduytran1960 at gmail.com>:
I have been working with some 'difficult' HTML files generated by Sphinx's RST. The following block of text is the RST original content:
----------------------------------------------------
Animation Playback Options
==========================
``-a`` ``<options>`` ``<file(s)>``
Playback ``<file(s)>``, only operates this way when not running in background.
``-p`` ``<sx>`` ``<sy>``
Open with lower left corner at ``<sx>``, ``<sy>``.
``-m``
Read from disk (Do not buffer).
``-f`` ``<fps>`` ``<fps-base>``
Specify FPS to start with.
``-j`` ``<frame>``
Set frame step to ``<frame>``.
``-s`` ``<frame>``
Play from ``<frame>``.
``-e`` ``<frame>``
Play until ``<frame>``.
----------------------------------------------------
This is the HTML block that is generated by Sphinx:
----------------------------------------------------
<section ids="animation-playback-options" names="animation\ playback\ options"><title>Animation Playback Options</title><definition_list><definition_list_item><term><literal>-a</literal> <literal><options></literal> <literal><file(s)></literal></term><definition><paragraph>Playback <literal><file(s)></literal>, only operates this way when not running in background.</paragraph><definition_list><definition_list_item><term><literal>-p</literal> <literal><sx></literal> <literal><sy></literal></term><definition><paragraph>Open with lower left corner at <literal><sx></literal>, <literal><sy></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-m</literal></term><definition><paragraph>Read from disk (Do not buffer).</paragraph></definition></definition_list_item><definition_list_item><term><literal>-f</literal> <literal><fps></literal> <literal><fps-base></literal></term><definition><paragraph>Specify FPS to start with.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-j</literal> <literal><frame></literal></term><definition><paragraph>Set frame step to <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-s</literal> <literal><frame></literal></term><definition><paragraph>Play from <literal><frame></literal>.</paragraph></definition></definition_list_item><definition_list_item><term><literal>-e</literal> <literal><frame></literal></term><definition><paragraph>Play until <literal><frame></literal>.</paragraph></definition></definition_list_item></definition_list></definition></definition_list_item></definition_list></section>
----------------------------------------------------
I then use the BeautifulSoup, which uses the HTMLParser, to beautify and parse the HTML document and I've noticed that every incident of data that leads with a "<" and ends with ">", for example:
<options>
<file(s)>
....
has been misunderstood by the HTMLParser's library as a TAG, and then it INVENTS a CLOSED TAGS for it
ie.
<literal>
<options>
</options>
</literal>
and
<literal>
<file(s)>
</file(s)>
</literal>
which when reversing, ie. turning from HTML back to normal text, muted out the original data, leading to TRUNCATION/LOST of DATA.
Here is the content of the beautify generated data, issue lines are marked with '#**************************' to make it easier for you to identify.
----------------------------------------------------
<section ids="animation-playback-options" names="animation\ playback\ options">
<title>
Animation Playback Options
</title>
<definition_list>
<definition_list_item>
<term>
<literal>
-a
</literal>
<literal>
<options> #**************************
</options> #**************************
</literal>
<literal>
<file(s)> #**************************
</file(s)> #**************************
</literal>
</term>
<definition>
<paragraph>
Playback
<literal>
<file(s)> #**************************
</file(s)> #**************************
</literal>
, only operates this way when not running in background.
</paragraph>
<definition_list>
<definition_list_item>
<term>
<literal>
-p
</literal>
<literal>
<sx> #**************************
</sx> #**************************
</literal>
<literal>
<sy> #**************************
</sy> #**************************
</literal>
</term>
<definition>
<paragraph>
Open with lower left corner at
<literal>
<sx> #**************************
</sx> #**************************
</literal>
,
<literal>
<sy> #**************************
</sy> #**************************
</literal>
.
</paragraph>
</definition>
</definition_list_item>
<definition_list_item>
<term>
<literal>
-m
</literal>
</term>
<definition>
<paragraph>
Read from disk (Do not buffer).
</paragraph>
</definition>
</definition_list_item>
<definition_list_item>
<term>
<literal>
-f
</literal>
<literal>
<fps> #**************************
</fps> #**************************
</literal>
<literal>
<fps-base> #**************************
</fps-base> #**************************
</literal>
</term>
<definition>
<paragraph>
Specify FPS to start with.
</paragraph>
</definition>
</definition_list_item>
<definition_list_item>
<term>
<literal>
-j
</literal>
<literal>
<frame/> #**************************
</literal>
</term>
<definition>
<paragraph>
Set frame step to
<literal>
<frame/> #**************************
</literal>
.
</paragraph>
</definition>
</definition_list_item>
<definition_list_item>
<term>
<literal>
-s
</literal>
<literal>
<frame/> #**************************
</literal>
</term>
<definition>
<paragraph>
Play from
<literal>
<frame/> #**************************
</literal>
.
</paragraph>
</definition>
</definition_list_item>
<definition_list_item>
<term>
<literal>
-e
</literal>
<literal>
<frame/> #**************************
</literal>
</term>
<definition>
<paragraph>
Play until
<literal>
<frame/> #**************************
</literal>
.
</paragraph>
</definition>
</definition_list_item>
</definition_list>
</definition>
</definition_list_item>
</definition_list>
</section>
----------------------------------------------------
I enclosed the HTML file generated by Sphinx to allow you test this issue with the actual data.
Here is the URL of the HTML file:
https://docs.blender.org/manual/en/dev/advanced/command_line/arguments.html
Kind Regards,
Hoang Tran
----------
components: Library (Lib)
files: arguments.html
messages: 343724
nosy: htran
priority: normal
severity: normal
status: open
title: HTMLParser mistakenly inventing new tags while parsing
type: behavior
versions: Python 3.6
Added file: https://bugs.python.org/file48367/arguments.html
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue37071>
_______________________________________
More information about the New-bugs-announce
mailing list