Get html DOM tree by only basic builtin moudles
Wesley
nispray at gmail.com
Fri Jun 5 19:24:31 EDT 2015
> On Fri, Jun 5, 2015 at 12:10 PM, Wesley <nispray at gmail.com> wrote:
> > Hi Laura,
> > Sure, I got special requirement that just parse html file into DOM tree, by only general basic modules, and based on my DOM tree structure, draft an bitmap.
> >
> > So, could you give me an direction how to get the DOM tree?
> > Currently, I just think out to use something like stack, I mean, maybe read the file line by line, adding to a stack data structure(list for example), and, then, got the parent/child relation .etc
> >
> > I don't know if what I said is easy to achieve, I am just trying.
> > Any better suggestions will be great appreciated.
>
> If you want to recreate the same DOM structure that would be created
> by a browser, the standardized algorithm to do so is very complicated,
> but you can find it at
> http://www.w3.org/TR/2011/WD-html5-20110113/parsing.html.
>
> If you're not necessarily seeking perfect fidelity, I would encourage
> you to try to find some way to incorporate beautifulsoup into your
> project. It likely won't produce the same structure that a real
> browser would, but it should do well enough to scrape from even badly
> malformed html.
>
> I recommend against using an XML parser, because HTML isn't XML, and
> such a parser may choke even on perfectly valid HTML such as this:
>
> <!DOCTYPE html>
> <html>
> <head><title>Document</title></head>
> <body>
> First line
> <br>
> Second line
> </body>
> </html>
Hi,
Hmm, it's really complex.
Currently, I don't need to involve all error handling,and assume html is well formatted, then, generate the DOM tree.
Html sample below:
<!DOCTYPE html>
<!-- saved from url=(0026)http://www.opera.com/about -->
<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="description" content="Opera is an independent Scandinavian company that's been in the business of making web browsers since 1994. Read more about Opera Software here.">
<title>About - Opera Software</title>
<link rel="apple-touch-icon" sizes="57x57" href="http://d2jc9zwbrclgz3.cloudfront.net/static-heap/da/dafd15591b35d4f81ca96cf7de6582d705850ff0/apple-touch-icon-57x57.png">
</head>
<body screen_capture_injected="true"><div style="position: fixed; top: 0px; left: 0px; height: 0px; width: 0px; z-index: 9999999;"><div style="position: fixed; top: 100%; height: 0px;"><div style="position: relative;"></div></div></div>
<!-- Google Tag Manager -->
<nav class="business-menu">
<ul>
<li><a data-action-id="header_item" href="http://operamediaworks.com/">Opera Mediaworks</a></li>
</ul>
</nav>
<main role="main" class="generic_landing_page">
<h1>Who we are, what we do</h1> <figure class="visuals">
<img src="./About - Opera Software_files/pro-kompaniyu.jpg" alt="" width="900" height="424">
</figure>
<ul class="blocks col3">
<li>
<h3>Vision</h3>
<p>We strive to develop superior products and services for our users around the world, through state-of-the-art technology, innovation, leadership and partnerships.</p><p><a href="http://www.operasoftware.com/company/vision" target="_self">Find out more</a>.</p>
</li>
<li>
</ul>
</main>
<footer class="ns--hf">
<aside>
<div class="hf--extra">
<h2 class="hf--visuallyhidden">Page language</h2>
<div id="language" class="hf--language hf--hover-enabled hf--popup-container">
<input id="language-toggle" class="hf--popup-toggle hf--visuallyhidden" type="checkbox" aria-haspopup="true">
<label for="language-toggle" class="hf--popup-toggle-label" tabindex="0">
<span class="hf--hide-overflow">
<span class="">Select your language:</span>
<span class="">English</span>
</span>
</label>
</div>
</div>
</aside>
<div class="hf--meta hf--clearfix">
<small class="hf--company">Copyright ? 2014 Opera Software ASA. All rights reserved.
<a data-action-id="footer_item" href="http://www.opera.com/privacy">Privacy.</a> <a data-action-id="footer_item" href="http://www.opera.com/terms">Terms of Use.</a>
</small>
</div>
</footer>
</body></html>
More information about the Python-list
mailing list