[Tutor] Suggestions Please (Martin A. Brown)
Dino Bektešević
ljetibo at gmail.com
Tue Oct 7 13:30:58 CEST 2014
Hey,
Most of your answers you already got. I'm just willing to share a
short story of mine considering big data and python.
I've done computer vision on images ~15Mb big each. There's 6 filters
of which each can reach 7TB in total size. I ran detection algorithms
per 2 filters at a time that totalled around 14TB. I've also had to
read a lot of "side-data" in from SQL db's.
I chose python for easy of writing and it's
force-to-write-pretty-or-die syntax. I also chose it because it's more
easily portable anywhere than any other languages are (some of them
dependant even on the 3rd number of compiler version code!)
I did all per-pixel operations with wrapped C code. That means I only
did loops from 0 to width/height of the image with minimal
calculations in C because the numerics there is just inherently faster
than loops in python.
With the new Python >3 versions a lot has been improved in that area
and you no longer have to think about the implementation differences
behind things like zip/izip, range/xrange, read/realine/readlines in
example.
Output files were easily killing 10GB in size. I used various
generator expressions and iterators, that are heavily supported and
encouraged in python>3 unlike python<3, and I have never run into
space problems. And I have never had issues with RAM (swapping comes
to mind).
A short clarifications: generators are functions that behave like
iterators, that means, for a generator reading a file you only
allocate the memory the size of that 1 single line but are still able
to preform calculations on them. These expressions are hardly ever
longer than a line of code. And even better is that today you rarely
have to write them yourself even, they've been written for you and
hidden in the underbelly of python implementations.
Combine the amazing compatibility python offers with other programs
(how it's relatively easy to wrap and call outside programs in
different languages, execute them and catch their return) and how easy
it is to communicate with the OS in question (hardly any worries if
you use python functions and not subprocess module) and how clean that
code looks (compared i.e. if you ever had to pass a function as an
object in C, or writing C documentation) and you're fairly well armed
to combat anything.
I've had my code run on Win, Ubuntu, Suse, I've run it on clusters
(hybrids/IBMs) and I've never ever regretted that I chose to write my
code in python.
Now that we hopefully have the motivation question out of the way.
There's only one way to learn anything. Set aside some free time,
start coding, if you like it it's good for you. If you don't move on.
One thing you will just have to learn when doing python, is how to
learn really fast. All these things have pretty much been pre-written
for you, but you have to get the info of where (numpy/scipy/swig in my
case) and you're going to have to keep track of modules, after all
that's where python really shines.
I recommend books: "Learning python the hard way" and "Dive into python".
also as a disclaimer, I've ran over a lot of things and some I've
really bludgeoned in short don't hold any of the half-assed
explanations it against me.
> To: Phillip Pugh <pughpl at me.com>
> Cc: Python Tutor Mailing List <tutor at python.org>
> Subject: Re: [Tutor] Suggestions Please
> Message-ID: <alpine.LNX.2.00.1410061956320.1349 at dagger.wonderfrog.net>
> Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
>
>I am trying to decide if Python is the right toolset for me. I do a lot of data >analytics. Over the years I have used a lot of SQL and VBA, but the data >sources are getting bigger. I am thinking Python may be what I need to >use, but I am in the early stages of getting to know Python. Can you point >me to a really good, intuitive resource for getting up to speed with data. >Something that would help with the following.
>
>I have one text file that is 500,000 + records.. I need to read the file, move >"structured" data around and then write it to a new file. The txt file has >several data elements and is 300 characters per line. I am only interested >in the first two fields. The first data element is 19 characters. The second >data element is 6 characters. I want to rearrange the data by moving the 6 >characters data in front of the 19 characters data and then write the 25 >character data to a new file.
>
>I have spent some time digging for the correct resource, However being >new to Python and the syntax for the language makes it slow going. I >would like to see if I can speed up the learning curve. Hopefully can get >some results quickly then I will know how deep I want to go into Python.
>
>Thanks in advance
>
>Phillip
More information about the Tutor
mailing list