[Tutor] parse text file

Norman Khine norman at khine.net
Mon Feb 1 00:43:59 CET 2010


Hello,
I am still unable to get this to work correctly!

In [1]: file=open('producers_google_map_code.txt', 'r')

In [2]: data =  repr( file.read().decode('utf-8') )

In [3]: from BeautifulSoup import BeautifulStoneSoup

In [4]: soup = BeautifulStoneSoup(data)

In [6]: soup

http://paste.lisp.org/display/94195

In [7]: import re

In [8]: p = re.compile(r"""GLatLng\((\d+\.\d*)\, \n (\d+\.\d*)\)""")

In [9]: r = p.findall(data)

In [10]: r
Out[10]: []

see http://paste.lisp.org/+20BO/1

i can't seem to get the regex correct

(r"""GLatLng\((\d+\.\d*)\, \n (\d+\.\d*)\)""")

the problem is that, each for example is:

GLatLng(27.729912,\\n                                  85.31559)
GLatLng(-18.889851,\\n                                  -66.770897)

i have a big whitespace, plus the group can have a negative value, so
if i do this:

In [31]: p = re.compile(r"""GLatLng\((\d+\.\d*)\,\\n
               (\d+\.\d*)\)""")

In [32]: r = p.findall(data)

In [33]: r
Out[33]:
[('27.729912', '85.31559'),
 ('9.696333', '122.985992'),
 ('17.964625', '102.60040'),
 ('21.046439', '105.853043'),

but this does not take into account of data which has negative values,
also i am unsure how to pull it all together. i.e. to return a CSV
file such as:

"ACP", "acp.html", "9.696333", "122.985992"
"ALTER TRADE CORPORATION", "alter-trade-corporation.html",
"-18.889851", "-66.770897"

Thanks


On Sat, Jan 23, 2010 at 12:50 AM, spir <denis.spir at free.fr> wrote:
> On Sat, 23 Jan 2010 00:22:41 +0100
> Norman Khine <norman at khine.net> wrote:
>
>> Hi
>>
>> On Fri, Jan 22, 2010 at 7:44 PM, spir <denis.spir at free.fr> wrote:
>> > On Fri, 22 Jan 2010 14:11:42 +0100
>> > Norman Khine <norman at khine.net> wrote:
>> >
>> >> but my problem comes when i try to list the GLatLng:
>> >>
>> >> GLatLng(9.696333, 122.985992);
>> >>
>> >> >>> StartingWithGLatLng = soup.findAll(re.compile('GLatLng'))
>> >> >>> StartingWithGLatLng
>> >> []
>> >
>> > Don't about soup's findall. But the regex pattern string should rather be something like (untested):
>> >   r"""GLatLng\(\(d+\.\d*)\, (d+\.\d*)\) """
>> > capturing both integers.
>> >
>> > Denis
>> >
>> > PS: finally tested:
>> >
>> > import re
>> > s = "GLatLng(9.696333, 122.985992)"
>> > p = re.compile(r"""GLatLng\((\d+\.\d*)\, (\d+\.\d*)\)""")
>> > r = p.match(s)
>> > print r.group()         # --> GLatLng(9.696333, 122.985992)
>> > print r.groups()        # --> ('9.696333', '122.985992')
>> >
>> > s = "xGLatLng(1.1, 11.22)xxxGLatLng(111.111, 1111.2222)x"
>> > r = p.findall(s)
>> > print r                         # --> [('1.1', '11.22'), ('111.111', '1111.2222')]
>>
>> Thanks for the help, but I can't seem to get the RegEx to work correctly.
>>
>> Here is my input and output:
>>
>> http://paste.lisp.org/+20BO/1
>
> See my previous examples...
> If you use match:
>
> In [6]: r = p.match(data)
>
> Then the result is a regex match object (unlike when using findall). To get the string(s) matched; you need to use the group() and/or groups() methods.
>
>>>> import re
>>>> p = re.compile('x')
>>>> print p.match("xabcx")
> <_sre.SRE_Match object at 0xb74de6e8>
>>>> print p.findall("xabcx")
> ['x', 'x']
>
> Denis
> ________________________________
>
> la vita e estrany
>
> http://spir.wikidot.com/
>


More information about the Tutor mailing list