[Tutor] parse text file

Tue Feb 2 23:36:26 CET 2010

On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine <norman at khine.net> wrote:
> On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson <kent37 at tds.net> wrote:

>> Try this version:
>>
>> data = file.read()
>>
>> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon:
>> myIcon\n""", re.DOTALL).findall
>> get_titles = re.compile(r"""<strong>(.*)<\/strong>""").findall
>> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
>> get_latlngs = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall
>>
>> then as before.
>>
>> Your repr() call is essentially removing newlines from the input by
>> converting them to literal '\n' pairs. This allows your regex to work
>> without the DOTALL modifier.
>>
>> Note you will get slightly different results with my version - it will
>> give you correct utf-8 text for the titles whereas yours gives \
>> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your
>> version returns
>>
>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
>> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}
>>
>> Mine gives
>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
>> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}
>>
>> This is showing the repr() of the title so they both have \ but note
>> that yours has two \\ indicating that the \ is in the text; mine has
>> only one \.
>
> i am no expert, but there seems to be a bigger difference.
>
> with repr(), i get:
> Sat\\xe9re Maw\\xe9
>
> where as you get
>
> Sat\xc3\xa9re Maw\xc3\xa9
>
> repr()'s
> é == \\xe9
> whereas on your version
> é == \xc3\xa9

Right. Your version has four actual characters in the result - \, x,
e, 9. This is the escaped representation of the unicode representation
of e-acute. (The \ is doubled in the repr display.)

My version has two bytes in the result, with the values c3 and a9.
This is the utf-8 representation of e-acute.

If you want to accurately represent (i.e. print) the title at some
later time you probably want the utf-8 represetation.
>
>>
>> Kent
>>
>
> also, i still get an empty list when i run the code as suggested.

You didn't change the regexes. You have to change \\t and \\n to \t
and \n because the source text now has actual tabs and newlines, not
the escaped representations.

I know this is confusing, I'm sorry I don't have time or patience to
explain more.

Kent