LXML getchildren() method does not return a child div tag with style="display:none"

I was using beautifulsoup to parse the page at: https://irs.thsrc.com.tw/IMINT/ In the rendering, there is a confirmation box popping out in front of all other tags asking for a confirmation. The tag is inside the sole form tag with xpath: /html/body/div[1]/form/div[2]. Both this tag and the previous tag at /html/body/div[1]/form/div[1] are with style attribute "display:none". The weird thing is that when I used getchildren() to get the child tags of the form tag, the second DIV tag in the form tag is not in the returned list. In fact, LXML will skip the second tag. I tried beautifulsoup. But the problem is the same. E.children attribute does not contain the second div tag in the form tag. I copied part of the html code and the whole method that recursively scan all tags in the following. Some unicode data that cannot pass the stackoverflow filter is changed to ascii. I will appreciate it very much if anyone can tell me how to get the tag which is the pop-out tag waiting for the confirmation. Thanks <html> <head> <title> taiwan hsrc </title> </head> <body topmargin="0" rightmargin="0" bottommargin="0" bgcolor="#FFFFFF" leftmargin="0"> <!----- error message ends -----> <form action="/IMINT/;jsessionid=4A74C40B8D68474DF0B6F49E953DD825?wicket:interface=:0:BookingS1Form::IFormSubmitListener" id="BookingS1Form" method="post"> <div style="display:none"> <input type="hidden" name="BookingS1Form:hf:0" id="BookingS1Form:hf:0" /> </div> <div style="display:none; padding:3px 10px 5px;text-align:center;" id="dialogCookieInfo" title="Taiwan high-speed rail" wicket:message="title=bookingdialog_3"> <div class="JCon"> <div class="TCon"> <div class="overDiffText"> <div style="text-align: left;"> <span>for better service <a target="_blank" class="c" style="color:#FF9900;" href="https://www.thsrc.com.tw/tw/Article/ArticleContent/d1fa3bcb-a016-47e2-88c6-7b7cbed00ed5?tabIndex=1"> privacy </a> 。 </span> </div> </div> <div class="action"> <table border="0" cellpadding="0" cellspacing="0" align="center"> <tr> <td> <input hidefocus="true" name="confirm" id="btn-confirm" type="button" class="button_main" value="我同意"/> </td> </tr> </table> </div> </div> </div> </div> <div id="content" class="content"> <!----- marquee starts -----> <marquee id="marqueeShow" behavior="scroll" scrollamount="1" direction="left" width="755"> </marquee> <!----- marquee ends -----> <div class="tit"> <span>一般訂票</span> </div> </form> |</div> </body> </html> My code with LXML for scanning the html is the following. def actionableLXML(cls, e): global count print ("rec[", count, "], xpath: ", xmlTree.getpath(e)) countLabelActionableInside += 1 flagActionableInside = False if e.tag in cls._clickable_tags \ or e.tag == 'input' or e.tag == 'select': flagActionableInside = True else: flagActionableInside = False for c in e.getchildren(): flagActionableInside |= cls.actionableLXML(c) if e.attrib and 'style' in e.attrib \ and 'display:' in e.attrib['style'] \ and 'none' in e.attrib['style']: if not flagActionableInside: e.getparent().remove(e) return flagActionableInside

王凡 schrieb am 28.02.19 um 02:49:
I tried parsing the HTML snippet that you showed and it parses ok, with all divs included. Meaning: I cannot reproduce this. "style" attributes have no impact on the parser in particular, they are just like any other attribute. "display:none" is just ordinary text and has no special meaning to the parser either. Regarding your code:
for c in e.getchildren(): flagActionableInside |= cls.actionableLXML(c)
No need to call ".getchildren()" here. Just iterate over the element.
if e.attrib and 'style' in e.attrib \
It's probably faster to just test "'style' in e.attrib" and not check if the attrib dict is empty before. I'd actually suggest using the even shorter "e.get('style', '')". Stefan

王凡 schrieb am 28.02.19 um 02:49:
I tried parsing the HTML snippet that you showed and it parses ok, with all divs included. Meaning: I cannot reproduce this. "style" attributes have no impact on the parser in particular, they are just like any other attribute. "display:none" is just ordinary text and has no special meaning to the parser either. Regarding your code:
for c in e.getchildren(): flagActionableInside |= cls.actionableLXML(c)
No need to call ".getchildren()" here. Just iterate over the element.
if e.attrib and 'style' in e.attrib \
It's probably faster to just test "'style' in e.attrib" and not check if the attrib dict is empty before. I'd actually suggest using the even shorter "e.get('style', '')". Stefan
participants (2)
-
Stefan Behnel
-
王凡