[Tutor] parsing

Ismael Garrido ismaelgf at adinet.com.uy
Thu Jul 13 02:04:15 CEST 2006


Сергій wrote:
> First, excuse me my English... English is not my native language, but 
> I hope
> that I will be able to describe my problem.
>
> I am new in python for web, but I want to do such thing:
>
> Suppose I have a html-page, like this:
> """
> <title>TITLE</title>
> <body>
> body_1
> <h1>1_1</h1>
> <h2>2_1</h2>
> <div id=one>div_one_1</div>
> <p>p_1</p>
> <p>p_2</p>
> <div id=one>div_one_2</div>
> <span class=sp_1>
> sp_text
> <div id=one>div_one_2</div>
> <div id=one>div_one_3</div>
> </span>
> <h3>3_1</h3>
> <h2>2_2</h2>
> <p>p_3</p>
> body_2
> <h1>END</h1>
> <table>
> <tr><td>td_1</td>
> <td class=sp_2>td_2</td>
> <td>td_3</td>
> <td>td_4</td></tr>
> ...
> </body>
>
> """
>
> I want to get all info from this html in a dictionary that looks like 
> this:
>
> rezult = [{'title':['TITLE'],
> {'body':['body_1', 'body_2']},
> {'h1':['1_1', 'END']},
> {'h2':['2_1', '2_2']},
> {'h3':['3_1']},
> {'p':['p_1', 'p_2']},
> {'id_one':['div_one_1', 'div_one_2', 'div_one_3']},
> {'span_sp_1':['sp_text']},
> {'td':['td_1', 'td_3', 'td_4']},
> {'td_sp_2':['td_2']},
> ....
> ]
>
> Huh, hope you understand what I need.
> Can you advise me what approaches exist to solve tasks of such type... 
> and
> may be show some practical examples....
> Thanks in advance for help of all kind...

Try ElementTree or Amara.
http://effbot.org/zone/element-index.htm
http://uche.ogbuji.net/tech/4suite/amara/

If you only cared about contents, BeautifulSoup is the answer.

Ismael


More information about the Tutor mailing list