[Tutor] parse text file

Norman Khine norman at khine.net
Wed Feb 3 09:35:04 CET 2010


On Tue, Feb 2, 2010 at 11:36 PM, Kent Johnson <kent37 at tds.net> wrote:
> On Tue, Feb 2, 2010 at 4:56 PM, Norman Khine <norman at khine.net> wrote:
>> On Tue, Feb 2, 2010 at 10:11 PM, Kent Johnson <kent37 at tds.net> wrote:
>
>>> Try this version:
>>>
>>> data = file.read()
>>>
>>> get_records = re.compile(r"""openInfoWindowHtml\(.*?\ticon:
>>> myIcon\n""", re.DOTALL).findall
>>> get_titles = re.compile(r"""<strong>(.*)<\/strong>""").findall
>>> get_urls = re.compile(r"""a href=\"\/(.*)\">En savoir plus""").findall
>>> get_latlngs = re.compile(r"""GLatLng\((\-?\d+\.\d*)\,\n\s*(\-?\d+\.\d*)\)""").findall
>>>
>>> then as before.
>>>
>>> Your repr() call is essentially removing newlines from the input by
>>> converting them to literal '\n' pairs. This allows your regex to work
>>> without the DOTALL modifier.
>>>
>>> Note you will get slightly different results with my version - it will
>>> give you correct utf-8 text for the titles whereas yours gives \
>>> escapes. For example one of the titles is "CGTSM (Satére Mawé)". Your
>>> version returns
>>>
>>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
>>> '-79.649735'), 'title': 'CGTSM (Sat\\xe9re Maw\\xe9)'}
>>>
>>> Mine gives
>>> {'url': 'cgtsm-satere-mawe.html', 'lating': ('-2.77804',
>>> '-79.649735'), 'title': 'CGTSM (Sat\xc3\xa9re Maw\xc3\xa9)'}
>>>
>>> This is showing the repr() of the title so they both have \ but note
>>> that yours has two \\ indicating that the \ is in the text; mine has
>>> only one \.
>>
>> i am no expert, but there seems to be a bigger difference.
>>
>> with repr(), i get:
>> Sat\\xe9re Maw\\xe9
>>
>> where as you get
>>
>> Sat\xc3\xa9re Maw\xc3\xa9
>>
>> repr()'s
>> é == \\xe9
>> whereas on your version
>> é == \xc3\xa9
>
> Right. Your version has four actual characters in the result - \, x,
> e, 9. This is the escaped representation of the unicode representation
> of e-acute. (The \ is doubled in the repr display.)
>
> My version has two bytes in the result, with the values c3 and a9.
> This is the utf-8 representation of e-acute.
>
> If you want to accurately represent (i.e. print) the title at some
> later time you probably want the utf-8 represetation.
>>
>>>
>>> Kent
>>>
>>
>> also, i still get an empty list when i run the code as suggested.
>
> You didn't change the regexes. You have to change \\t and \\n to \t
> and \n because the source text now has actual tabs and newlines, not
> the escaped representations.
>
> I know this is confusing, I'm sorry I don't have time or patience to
> explain more.

thanks for your time, i did realise after i posted the email that the
regex needed to be changed.

>
> Kent
>


More information about the Tutor mailing list