[Tutor] Suggestions Please

Martin A. Brown martin at linux-ip.net
Tue Oct 7 06:17:15 CEST 2014


Greetings Phillip,

> I am trying to decide if Python is the right toolset for me.

It might be.  That depends on you and also the environment in which 
you operate.

> I do a lot of data analytics. Over the years I have used a lot of 
> SQL and VBA, but the data sources are getting bigger.

Data sets have a very annoying habit of growing.

> I am thinking Python may be what I need to use, but I am in the 
> early stages of getting to know Python.  Can you point me to a 
> really good, intuitive resource for getting up to speed with data. 
> Something that would help with the following.

Ooof.  So, the killer in your question is the word 'intuitive'. 
Intuitive for whom?  That depends entirely on you.  There are a 
variety of tutorials, and I point out one below, but I do not know 
if it will be intuitive for you.  Something that is intuitive for 
one person is opaque to the next.

> I have one text file that is 500,000 + records..

There's always somebody who has dealt with bigger files.  500k 
records (at only 300 chars per line)?  I'd read that straight into 
memory and do something with it.  Given recent CPUs and amounts of 
available RAM, the biggest cost I see here is disk (and possibly 
algorithm).

> I need to read the file, move "structured" data around and then 
> write it to a new file. The txt file has several data elements and 
> is 300 characters per line. I am only interested in the first two 
> fields. The first data element is 19 characters. The second data 
> element is 6 characters. I want to rearrange the data by moving 
> the 6 characters data in front of the 19 characters data and then 
> write the 25 character data to a new file.

Dave Angel gave you some suggestions on how to do start.  I'll make 
an attempt at translating his English into a Python block for you. 
Specifically, in Python, he suggested something like this:

   with open('yourfile.txt') as f:
       for line in f:
           first, second = line[:19], line[19:19+6]
           print second, first    # Python-2.x
           #print(second, first)  # Python-3.x

Try the above.  Does that do what you expected?  If not, then have a 
look at substring operations that you can do once you have open()'d 
up a file and have read a line of data into a variable, "line" 
above, which is of type string:

   https://docs.python.org/2/library/stdtypes.html#string-methods

If you can control the serialized format of data on disk, that might help open
up a much richer set of tools for you.  Plain text has the benefit and
liability that it is amazingly flexible.  If you are accustomed to performing
data analytics with SQL and VBA, then here are some tools to examine.  For
people accustomed to working with data analytics in R, the Python pandas
toolkit is a great fit:

   http://pandas.pydata.org/pandas-docs/stable/tutorials.html
   http://pandas.pydata.org/

This sounds much more like strict text-handling than like data 
analytics, though.  Some of us may be able to help you more if you 
have a specific known-format you deal with regularly.  For example, 
Python has modules for handling JSON and CSV (or TSV) out of the 
box:

   https://docs.python.org/2/library/json.html
   https://docs.python.org/2/library/csv.html

Given that many SQL implementations (e.g. MySQL, Postgres, Oracle, 
SQLite) can produce outputs in CSV format, you may find generating 
exported data from your SQL engine of choice and then importing 
using the csv library is easier than parsing a fixed format.

Why did you quote the word "structured"?  It almost seems that you do not like
your (peculiar) fixed-width field format.

If you have a follow-up question, please post a small relevant snippet of the
Python, the data and explain what you expected.

Anyway, good luck--I have found Python is a fantastic tool, readable and it
grew with my sophistication.  I still reach for SQL for many reasons, but I
like the flexibility, richness and tools that Python offers me.

-Martin

P.S. The Python online documentation is pretty good.....though the
      language changed very little between Python-2.x and Python-3.x,
      there are a few stumbling blocks, so run 'python -V' to make
      sure you are reading the correct documentation:
        https://docs.python.org/2/  # -- Python-2.7.x
        https://docs.python.org/3/  # -- Python-3.4.x

-- 
Martin A. Brown
http://linux-ip.net/


More information about the Tutor mailing list