Strange re problem

Paul McGuire ptmcg at austin.rr.com
Fri Jun 20 10:35:52 EDT 2008


On Jun 20, 6:01 am, TYR <a.harrow... at gmail.com> wrote:
> OK, this ought to be simple. I'm parsing a large text file (originally
> a database dump) in order to process the contents back into a SQLite3
> database. The data looks like this:
>
> 'AAA','PF',-17.416666666667,-145.5,'Anaa, French Polynesia','Pacific/
> Tahiti','Anaa';'AAB','AU',-26.75,141,'Arrabury, Queensland,
> Australia','?','?';'AAC','EG',31.133333333333,33.8,'Al Arish,
> Egypt','Africa/Cairo','El Arish International';'AAE','DZ',
> 36.833333333333,8,'Annaba','Africa/Algiers','Rabah Bitat';
>
> which goes on for another 308 lines. As keen and agile minds will no
> doubt spot, the rows are separated by a ; so it should be simple to
> parse it using a regex. So, I establish a db connection and cursor,
> create the table, and open the source file.

Using pyparsing, you can skip all that "what happens if there is a
semicolon or comma inside a quoted string?" noise, and get the data in
a trice.  If you add results names (as I've done in the example), then
loading each record into your db should be equally simple.

Here is a pyparsing extractor for you.  The parse actions already do
the conversions to floats, and stripping off of quotation marks.

-- Paul

data = """
'AAA','PF',-17.416666666667,-145.5,'Anaa, French Polynesia','Pacific/
Tahiti','Anaa';'AAB','AU',-26.75,141,'Arrabury, Queensland,
Australia','?','?';'AAC','EG',31.133333333333,33.8,'Al Arish,
Egypt','Africa/Cairo','El Arish International';'AAE','DZ',
36.833333333333,8,'Annaba','Africa/Algiers','Rabah Bitat';
""".splitlines()
data = "".join(data)

from pyparsing import *

num = Regex(r'-?\d+(\.\d+)?')
num.setParseAction(lambda t: float(t[0]))
qs = sglQuotedString.setParseAction(removeQuotes)
CMA = Suppress(',')
SEMI = Suppress(';')
dataRow = qs("field1") + CMA + qs("field2") + CMA + \
    num("long") + CMA + num("lat") + CMA + qs("city") + CMA + \
    qs("tz") + CMA + qs("field7") + SEMI

for dr in dataRow.searchString(data):
    print dr.dump()
    print dr.city,dr.long,dr.lat

Prints:

['AAA', 'PF', -17.416666666666998, -145.5, 'Anaa, French Polynesia',
'Pacific/ Tahiti', 'Anaa']
- city: Anaa, French Polynesia
- field1: AAA
- field2: PF
- field7: Anaa
- lat: -145.5
- long: -17.4166666667
- tz: Pacific/ Tahiti
Anaa, French Polynesia -17.4166666667 -145.5
['AAB', 'AU', -26.75, 141.0, 'Arrabury, Queensland, Australia', '?',
'?']
- city: Arrabury, Queensland, Australia
- field1: AAB
- field2: AU
- field7: ?
- lat: 141.0
- long: -26.75
- tz: ?
Arrabury, Queensland, Australia -26.75 141.0
['AAC', 'EG', 31.133333333332999, 33.799999999999997, 'Al Arish,
Egypt', 'Africa/Cairo', 'El Arish International']
- city: Al Arish, Egypt
- field1: AAC
- field2: EG
- field7: El Arish International
- lat: 33.8
- long: 31.1333333333
- tz: Africa/Cairo
Al Arish, Egypt 31.1333333333 33.8
['AAE', 'DZ', 36.833333333333002, 8.0, 'Annaba', 'Africa/Algiers',
'Rabah Bitat']
- city: Annaba
- field1: AAE
- field2: DZ
- field7: Rabah Bitat
- lat: 8.0
- long: 36.8333333333
- tz: Africa/Algiers
Annaba 36.8333333333 8.0



More information about the Python-list mailing list