[Chicago] Chicago Digest, Vol 121, Issue 6
Joshua Herman
zitterbewegung at gmail.com
Tue Sep 8 21:57:45 CEST 2015
Dear Luellen
Parallel computing , big data and a talk about ocaml are beyond the scope
of the talk.
I know about functional programming and ocaml in general can you email me
personally about these topics ? I would like to help you understand these
but it would be best to do this off of the list
Sincerely
Joshua herman
On Tue, Sep 8, 2015 at 2:55 PM Lewit, Douglas <d-lewit at neiu.edu> wrote:
> Thanks Tanya! Yes, someone at Northeastern told me the three keys to
> success in big data are 1) Python, 2) R and finally 3) Hadoop, which he
> said is really an extension of SQL. I'm sure that next week the three keys
> to success will be something else! Technology is great, but it's changing
> faster than I can keep pace with it. I wonder how many people feel that
> way???
>
> What time does the presentation officially begin on Thursday evening? 6
> pm? What time does it end? Is it structures like a classroom or is it
> more open-ended? Just one group? Or does it get split up into many
> different sub-groups?
>
> I don't know a lot about "parallel computing" other than this is what Java
> programmers call "Threads". I think (not too sure really) that when you
> "thread" an algorithm, you allocate one core for one part of the algorithm
> and then another core for another part of the algorithm, and so on and so
> forth, and then at the end it all has to get magically pieced back
> together. ( Sounds like a merge sort problem to me! ) What about Python?
> Does Python support this multicore approach or "threading"? I really don't
> know. Jon Haroop, the author of "Ocaml For Scientists" told me that the
> Ocaml programming language became much less popular in the early 2000's
> because its main developer underestimated the impact of multicore
> technology on modern programming.
>
>
> On Tue, Sep 8, 2015 at 8:28 AM, Tanya Schlusser <tanya at tickel.net> wrote:
>
>>
>>
>>> Where is the ChiPy meeting this Thursday evening? I checked the website,
>>> but the location of the meeting had not yet been decided.
>>>
>>
>> It's at Braintree again (8th floor, Merchandise Mart) -- sorry for taking
>> so long to get it up!
>>
>>
>>
>>> When I think of "big data analysis" I think of something like, "Okay,
>>> read
>>> all these data from an Excel spreadsheet into a huge Python array or
>>> matrix, and then construct various Q-Q plots to see if the data are
>>> normally distributed, exponentially distributed or something else, and
>>> then
>>> determine the parameters of the distribution". In other words, when I
>>> hear
>>> "big data" I'm really thinking of a mixture of statistics and computer
>>> programming. Is that correct or is my "definition" a little too narrow?
>>>
>>
>>
>> It's pretty correct. The 'analysis' part is correct -- it's still
>> statistics / machine learning. The 'big data' part really was a catchall
>> phrase for "anything that can't be done right now in a standard database".
>> So 'big data' can mean a handful of things:
>>
>> - Workarounds to key generation because inserts are happening faster
>> than a standard database can deal with them. This was the 'Velocity' part
>> of the big data marketers' advertising campaigns. Twitter's Snowflake
>> <https://blog.twitter.com/2010/announcing-snowflake> is a good
>> example of working around this.
>>
>> - NOSQL (Not Only SQL) -- storing and doing computation over images,
>> MRIs, genetic data, PDFs, entire Log Files, et cetera... This is the
>> 'Variety' in the big data marketing. Hadoop's Distributed File System and
>> MongoDB are good examples of databases that can store these sorts of files.
>>
>> -
>> - Parallel computation on a(n inexpensive) cluster because it would
>> take too long or the data would not fit on one computer. This means the
>> algorithms had to be rewritten for parallel execution. This was the
>> 'Volume' part of big data marketing.
>> - Apache Mahout
>> <https://mahout.apache.org/users/basics/algorithms.html> (in java)
>> was I think one of the first open-source implementation of parallelized
>> machine learning algorithms.
>> - The hottest things for this now are the Spark Machine Learning
>> library <http://spark.apache.org/docs/latest/mllib-guide.html> (
>> -- Pycon 2015 presentation of spark+python
>> <http://pyvideo.org/video/3407/introduction-to-spark-with-python>)
>> . There is also a Chicago Spark Meetup
>> <http://www.meetup.com/Chicago-Spark-Users/>.
>> - And the newcomer Apache Flink, also in Java, bypasses Java's
>> garbage collection for speed, optimizes SQL queries (unlike Hive), and
>> claims to provide a truly streaming analytics option without some of the
>> hangups of Storm. It also has Python bindings
>> <http://mvnrepository.com/artifact/org.apache.flink/flink-python>.
>> There is a Chicago Flink meetup
>> <http://www.meetup.com/Chicago-Apache-Flink-Meetup/> -- I think
>> it's the 3rd Flink user group in North America.
>>
>>
>> hope it was useful...see you @ Braintree Thursday!
>>
>> _______________________________________________
>> Chicago mailing list
>> Chicago at python.org
>> https://mail.python.org/mailman/listinfo/chicago
>>
>>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20150908/464f76bb/attachment.html>
More information about the Chicago
mailing list