[Tutor] parse text file

Norman Khine norman at khine.net
Tue Feb 2 10:16:15 CET 2010


thanks denis,

On Tue, Feb 2, 2010 at 9:30 AM, spir <denis.spir at free.fr> wrote:
> On Mon, 1 Feb 2010 16:30:02 +0100
> Norman Khine <norman at khine.net> wrote:
>
>> On Mon, Feb 1, 2010 at 1:19 PM, Kent Johnson <kent37 at tds.net> wrote:
>> > On Mon, Feb 1, 2010 at 6:29 AM, Norman Khine <norman at khine.net> wrote:
>> >
>> >> thanks, what about the whitespace problem?
>> >
>> > \s* will match any amount of whitespace includin newlines.
>>
>> thank you, this worked well.
>>
>> here is the code:
>>
>> ###
>> import re
>> file=open('producers_google_map_code.txt', 'r')
>> data =  repr( file.read().decode('utf-8') )
>>
>> block = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
>> b = block.findall(data)
>> block_list = []
>> for html in b:
>>       namespace = {}
>>       t = re.compile(r"""<strong>(.*)<\/strong>""")
>>       title = t.findall(html)
>>       for item in title:
>>               namespace['title'] = item
>>       u = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
>>       url = u.findall(html)
>>       for item in url:
>>               namespace['url'] = item
>>       g = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")
>>       lat = g.findall(html)
>>       for item in lat:
>>               namespace['LatLng'] = item
>>       block_list.append(namespace)
>>
>> ###
>>
>> can this be made better?
>
> The 3 regex patterns are constants: they can be put out of the loop.
>
> You may also rename b to blocks, and find a more a more accurate name for block_list; eg block_records, where record = set of (named) fields.
>
> A short desc and/or example of the overall and partial data formats can greatly help later review, since regex patterns alone are hard to decode.

here are the changes:

import re
file=open('producers_google_map_code.txt', 'r')
data =  repr( file.read().decode('utf-8') )

get_record = re.compile(r"""openInfoWindowHtml\(.*?\\ticon: myIcon\\n""")
get_title = re.compile(r"""<strong>(.*)<\/strong>""")
get_url = re.compile(r"""a href=\"\/(.*)\">En savoir plus""")
get_latlng = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\\n\s*(\-?\d+\.\d*)\)""")

records = get_record.findall(data)
block_record = []
for record in records:
	namespace = {}
	titles = get_title.findall(record)
	for title in titles:
		namespace['title'] = title
	urls = get_url.findall(record)
	for url in urls:
		namespace['url'] = url
	latlngs = get_latlng.findall(record)
	for latlng in latlngs:
		namespace['latlng'] = latlng
	block_record.append(namespace)

print block_record
>
> The def of "namespace" would be clearer imo in a single line:
>    namespace = {title:t, url:url, lat:g}

i am not sure how this will fit into the code!

> This also reveals a kind of name confusion, doesn't it?
>
>
> Denis
>
>
>
>
> ________________________________
>
> la vita e estrany
>
> http://spir.wikidot.com/
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list