Assistance Needed: Python Code Crashes After Multiple Iterations When Extracting HTML Elements Caixa de entrada
Hello, everyone! I am encountering an issue with my Python code that opens multiple HTML pages and extracts elements from a specific class, if they exist. The code runs fine for a few iterations but crashes after enough iterations, always on the same HTML page. Interestingly, if I process this page individually, the code works without any problems. This issue has been bothering me for some time, and I have tried many approaches. When debugging, I noticed that the code always crashes within the *lxml* package, which is used by *requests_html*. This could be a bug in the library, but what could explain the crash only after several iterations? <https://mail.google.com/mail/u/0?ui=2&ik=50de717578&attid=0.1&permmsgid=msg-a:r8588811151197014763&th=190e667ff637bb2e&view=att&disp=safe&realattid=f_lz0ai9pe0> <https://mail.google.com/mail/u/0?ui=2&ik=50de717578&attid=0.1&permmsgid=msg-a:r8588811151197014763&th=190e667ff637bb2e&view=att&disp=safe&realattid=f_lz0ai9pe0> A few more information: Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0) lxml.etree : (5, 2, 2, 0) libxml used : (2, 12, 6) libxml compiled : (2, 12, 6) libxslt used : (1, 1, 39) libxslt compiled : (1, 1, 39) I'm attaching the code I mentioned. You can see that the code is quite simple. import pandas as pd import requests_html def main(): df_links = pd.read_csv('./links.csv') session = requests_html.HTMLSession() for i in range(0, len(df_links.index)): url = df_links.iloc[i]['hyperlink'] print(f"[{i}/{len(df_links.index)}]: {url}", flush=True) try: response = session.get(url) if response.status_code == 200: response_html = response.html dateList = response_html.find('relative-time') except Exception as e: print(f"Something went wrong: {e}", flush=True) if __name__ == "__main__": main() The crash always happens at the following line: value = etree.fromstring(html, parser, **kw) in the function document_fromstring in *lxml/html/__init__.py*. Thank you for your time and help. Best regards, Kadu
participants (1)
-
Carlos Eduardo de Schuller Banjar