[Chicago] Chicago Digest, Vol 121, Issue 6

Tue Sep 8 21:57:45 CEST 2015

Dear Luellen
Parallel computing , big data and a talk about ocaml are beyond the scope
of the talk.
I know about functional programming and ocaml in general can you email me
personally about these topics ? I would like to help you understand these
but it would be best to do this off of the list
Sincerely
Joshua herman
On Tue, Sep 8, 2015 at 2:55 PM Lewit, Douglas <d-lewit at neiu.edu> wrote:

> Thanks Tanya!  Yes, someone at Northeastern told me the three keys to
> success in big data are 1) Python, 2) R and finally 3) Hadoop, which he
> said is really an extension of SQL.  I'm sure that next week the three keys
> to success will be something else!  Technology is great, but it's changing
> faster than I can keep pace with it.  I wonder how many people feel that
> way???
>
> What time does the presentation officially begin on Thursday evening?  6
> pm?  What time does it end?  Is it structures like a classroom or is it
> more open-ended?  Just one group?  Or does it get split up into many
> different sub-groups?
>
> I don't know a lot about "parallel computing" other than this is what Java
> programmers call "Threads".  I think (not too sure really) that when you
> "thread" an algorithm, you allocate one core for one part of the algorithm
> and then another core for another part of the algorithm, and so on and so
> forth, and then at the end it all has to get magically pieced back
> together.  ( Sounds like a merge sort problem to me! )  What about Python?
> Does Python support this multicore approach or "threading"?  I really don't
> know.  Jon Haroop, the author of "Ocaml For Scientists" told me that the
> Ocaml programming language became much less popular in the early 2000's
> because its main developer underestimated the impact of multicore
> technology on modern programming.
>
>
> On Tue, Sep 8, 2015 at 8:28 AM, Tanya Schlusser <tanya at tickel.net> wrote:
>
>>
>>
>>> Where is the ChiPy meeting this Thursday evening?  I checked the website,
>>> but the location of the meeting had not yet been decided.
>>>
>>
>> It's at Braintree again (8th floor, Merchandise Mart) -- sorry for taking
>> so long to get it up!
>>
>>
>>
>>> When I think of "big data analysis" I think of something like, "Okay,
>>> read
>>> all these data from an Excel spreadsheet into a huge Python array or
>>> matrix, and then construct various Q-Q plots to see if the data are
>>> normally distributed, exponentially distributed or something else, and
>>> then
>>> determine the parameters of the distribution".  In other words, when I
>>> hear
>>> "big data" I'm really thinking of a mixture of statistics and computer
>>> programming.  Is that correct or is my "definition" a little too narrow?
>>>
>>
>>
>> It's pretty correct. The 'analysis' part is correct -- it's still
>> statistics / machine learning. The 'big data' part really was a catchall
>> phrase for "anything that can't be done right now in a standard database".
>> So 'big data' can mean a handful of things:
>>
>>    - Workarounds to key generation because inserts are happening faster
>>    than a standard database can deal with them. This was the 'Velocity' part
>>    of the big data marketers' advertising campaigns. Twitter's Snowflake
>>    <https://blog.twitter.com/2010/announcing-snowflake> is a good
>>    example of working around this.
>>
>>    - NOSQL (Not Only SQL) -- storing and doing computation over images,
>>    MRIs, genetic data, PDFs, entire Log Files, et cetera... This is the
>>    'Variety' in the big data marketing. Hadoop's Distributed File System and
>>    MongoDB are good examples of databases that can store these sorts of files.
>>
>>    -
>>    - Parallel computation on a(n inexpensive) cluster because it would
>>    take too long or the data would not fit on one computer. This means the
>>    algorithms had to be rewritten for parallel execution. This was the
>>    'Volume' part of big data marketing.
>>    - Apache Mahout
>>       <https://mahout.apache.org/users/basics/algorithms.html> (in java)
>>       was I think one of the first open-source implementation of parallelized
>>       machine learning algorithms.
>>       - The hottest things for this now are the Spark Machine Learning
>>       library <http://spark.apache.org/docs/latest/mllib-guide.html> (
>>       -- Pycon 2015 presentation of spark+python
>>       <http://pyvideo.org/video/3407/introduction-to-spark-with-python>)
>>       . There is also a Chicago Spark Meetup
>>       <http://www.meetup.com/Chicago-Spark-Users/>.
>>       - And the newcomer Apache Flink, also in Java, bypasses Java's
>>       garbage collection for speed, optimizes SQL queries (unlike Hive), and
>>       claims to provide a truly streaming analytics option without some of the
>>       hangups of Storm. It also has Python bindings
>>       <http://mvnrepository.com/artifact/org.apache.flink/flink-python>.
>>       There is a Chicago Flink meetup
>>       <http://www.meetup.com/Chicago-Apache-Flink-Meetup/> -- I think
>>       it's the 3rd Flink user group in North America.
>>
>>
>> hope it was useful...see you @ Braintree Thursday!
>>
>> _______________________________________________
>> Chicago mailing list
>> Chicago at python.org
>> https://mail.python.org/mailman/listinfo/chicago
>>
>>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20150908/464f76bb/attachment.html>