[SciPy-User] Column extraction from line based text

Fri May 2 05:24:41 EDT 2014

Hello Patrick

In general, you can use "Regular Expressions" to extract your fields and
re-format them in any way you like.

I don't think that this is a "homogeneous" structure though. The postal
address seems to have been formatted following a specific template of

number (maybe character, maybe numeric), street name (could go up to 3
space separated tokens) coma (sometimes) city, postcode

Furthermore: street name -> street (any number of space separated tokens)
coma(always?) location (Floor, Suite, other? followed by a number (always?))

The city and postcode fields are the easiest to extract. For example, you
could establish a regular expression like ".*? [,]? ("Los Angeles"|"New
York"), [A-Z]{2}[0-9]{5}" which is roughly translated to "Match in a
non-greedy way everything up to an optional coma followed by something that
looks like a city name followed by a coma followed by a postcode".

You can extract such fields with FROMREGEXP (
http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromregex.html)
or pyparsing (http://pyparsing.wikispaces.com/)

As you can see, some things have to be kept or assumed "fixed".

For example, you might have to run a quick check of how many cities you
have in your dataset or at least, how many distinct words you expect a city
to occupy to be able to describe roughly how does a city name (or other
field) looks like. The same applies when you try to fix the way that the
location is being described. Is it always Floor? Could it be for example
"block, Block, Floor, floor, Apartment, Apt, apt, apartment, room, Room,
RM, rm, ....". Could it be 1st Floor? Maybe Floor 1, Floor 2 and so on.

Hope this helps.

All the best.

On Fri, May 2, 2014 at 9:56 AM, Patrick Westphal <
patrick.westphal at informatik.uni-leipzig.de> wrote:

> Hello Scipy users,
>
> I have to deal with messy line based text files that have certain
> fields, e.g.
>
> One Bowling Green, 3rd Floor New York, NY 10004
> 228 Park Ave S New York, NY 10003-1502
> 85 West Street at Albany Street  New York, NY 10006
> 13031 West Jefferson Boulevard, Suite 200 Los Angeles, CA 90094
> 550 S Flower St, Los Angeles, CA 90071
>
> and I'd like to have these columns extracted
>
>       | One Bowling Green, 3rd Floor        | New York    | NY | 10004
>   228| Park Ave S                          | New York    | NY | 10003-1502
>    85| West Street at Albany Street        | New York    | NY | 10006
> 13031 | West Jefferson Boulevard, Suite 200 | Los Angeles | CA | 90094
>   550 | S Flower St                         | Los Angeles | CA | 90071
>
> I searched quite a lot and found projects like Asciitable [0], but these
> were not built to find homogeneous column structures by combining
> multiple line tokens, as e.g. in 'One Bowling Green, 3rd Floor', but
> rather try to generate one column per token.
>
> My question is, if you are aware of any projects that better suit my
> task, or if you could give me some hints where and what to search.
>
> Thanks in advance,
> Patrick
>
> [0] http://cxc.cfa.harvard.edu/contrib/asciitable/
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20140502/4e51d0a1/attachment.html>