[Tutor] parsing

Сергій kyxaxa at gmail.com
Wed Jul 12 23:41:06 CEST 2006


First, excuse me my English... English is not my native language, but I hope

that I will be able to describe my problem.

I am new in python for web, but I want to do such thing:

Suppose I have a html-page, like this:
"""
<title>TITLE</title>
<body>
body_1
<h1>1_1</h1>
<h2>2_1</h2>
<div id=one>div_one_1</div>
<p>p_1</p>
<p>p_2</p>
<div id=one>div_one_2</div>
<span class=sp_1>
sp_text
<div id=one>div_one_2</div>
<div id=one>div_one_3</div>
</span>
<h3>3_1</h3>
<h2>2_2</h2>
<p>p_3</p>
body_2
<h1>END</h1>
<table>
<tr><td>td_1</td>
<td class=sp_2>td_2</td>
<td>td_3</td>
<td>td_4</td></tr>
...
</body>

"""

I want to get all info from this html in a dictionary that looks like this:

rezult = [{'title':['TITLE'],
{'body':['body_1', 'body_2']},
{'h1':['1_1', 'END']},
{'h2':['2_1', '2_2']},
{'h3':['3_1']},
{'p':['p_1', 'p_2']},
{'id_one':['div_one_1', 'div_one_2', 'div_one_3']},
{'span_sp_1':['sp_text']},
{'td':['td_1', 'td_3', 'td_4']},
{'td_sp_2':['td_2']},
....
]

Huh, hope you understand what I need.
Can you advise me what approaches exist to solve tasks of such type... and
may be show some practical examples....
Thanks in advance for help of all kind...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20060713/5ea1df60/attachment.htm 


More information about the Tutor mailing list