[Tutor] Suggestions Please
Martin A. Brown
martin at linux-ip.net
Tue Oct 7 06:17:15 CEST 2014
Greetings Phillip,
> I am trying to decide if Python is the right toolset for me.
It might be. That depends on you and also the environment in which
you operate.
> I do a lot of data analytics. Over the years I have used a lot of
> SQL and VBA, but the data sources are getting bigger.
Data sets have a very annoying habit of growing.
> I am thinking Python may be what I need to use, but I am in the
> early stages of getting to know Python. Can you point me to a
> really good, intuitive resource for getting up to speed with data.
> Something that would help with the following.
Ooof. So, the killer in your question is the word 'intuitive'.
Intuitive for whom? That depends entirely on you. There are a
variety of tutorials, and I point out one below, but I do not know
if it will be intuitive for you. Something that is intuitive for
one person is opaque to the next.
> I have one text file that is 500,000 + records..
There's always somebody who has dealt with bigger files. 500k
records (at only 300 chars per line)? I'd read that straight into
memory and do something with it. Given recent CPUs and amounts of
available RAM, the biggest cost I see here is disk (and possibly
algorithm).
> I need to read the file, move "structured" data around and then
> write it to a new file. The txt file has several data elements and
> is 300 characters per line. I am only interested in the first two
> fields. The first data element is 19 characters. The second data
> element is 6 characters. I want to rearrange the data by moving
> the 6 characters data in front of the 19 characters data and then
> write the 25 character data to a new file.
Dave Angel gave you some suggestions on how to do start. I'll make
an attempt at translating his English into a Python block for you.
Specifically, in Python, he suggested something like this:
with open('yourfile.txt') as f:
for line in f:
first, second = line[:19], line[19:19+6]
print second, first # Python-2.x
#print(second, first) # Python-3.x
Try the above. Does that do what you expected? If not, then have a
look at substring operations that you can do once you have open()'d
up a file and have read a line of data into a variable, "line"
above, which is of type string:
https://docs.python.org/2/library/stdtypes.html#string-methods
If you can control the serialized format of data on disk, that might help open
up a much richer set of tools for you. Plain text has the benefit and
liability that it is amazingly flexible. If you are accustomed to performing
data analytics with SQL and VBA, then here are some tools to examine. For
people accustomed to working with data analytics in R, the Python pandas
toolkit is a great fit:
http://pandas.pydata.org/pandas-docs/stable/tutorials.html
http://pandas.pydata.org/
This sounds much more like strict text-handling than like data
analytics, though. Some of us may be able to help you more if you
have a specific known-format you deal with regularly. For example,
Python has modules for handling JSON and CSV (or TSV) out of the
box:
https://docs.python.org/2/library/json.html
https://docs.python.org/2/library/csv.html
Given that many SQL implementations (e.g. MySQL, Postgres, Oracle,
SQLite) can produce outputs in CSV format, you may find generating
exported data from your SQL engine of choice and then importing
using the csv library is easier than parsing a fixed format.
Why did you quote the word "structured"? It almost seems that you do not like
your (peculiar) fixed-width field format.
If you have a follow-up question, please post a small relevant snippet of the
Python, the data and explain what you expected.
Anyway, good luck--I have found Python is a fantastic tool, readable and it
grew with my sophistication. I still reach for SQL for many reasons, but I
like the flexibility, richness and tools that Python offers me.
-Martin
P.S. The Python online documentation is pretty good.....though the
language changed very little between Python-2.x and Python-3.x,
there are a few stumbling blocks, so run 'python -V' to make
sure you are reading the correct documentation:
https://docs.python.org/2/ # -- Python-2.7.x
https://docs.python.org/3/ # -- Python-3.4.x
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list