[Chicago] Chicago Digest, Vol 121, Issue 6

Tue Sep 8 15:28:03 CEST 2015

> Where is the ChiPy meeting this Thursday evening?  I checked the website,
> but the location of the meeting had not yet been decided.
>

It's at Braintree again (8th floor, Merchandise Mart) -- sorry for taking
so long to get it up!

> When I think of "big data analysis" I think of something like, "Okay, read
> all these data from an Excel spreadsheet into a huge Python array or
> matrix, and then construct various Q-Q plots to see if the data are
> normally distributed, exponentially distributed or something else, and then
> determine the parameters of the distribution".  In other words, when I hear
> "big data" I'm really thinking of a mixture of statistics and computer
> programming.  Is that correct or is my "definition" a little too narrow?
>

It's pretty correct. The 'analysis' part is correct -- it's still
statistics / machine learning. The 'big data' part really was a catchall
phrase for "anything that can't be done right now in a standard database".
So 'big data' can mean a handful of things:

   - Workarounds to key generation because inserts are happening faster
   than a standard database can deal with them. This was the 'Velocity' part
   of the big data marketers' advertising campaigns. Twitter's Snowflake
   <https://blog.twitter.com/2010/announcing-snowflake> is a good example
   of working around this.

   - NOSQL (Not Only SQL) -- storing and doing computation over images,
   MRIs, genetic data, PDFs, entire Log Files, et cetera... This is the
   'Variety' in the big data marketing. Hadoop's Distributed File System and
   MongoDB are good examples of databases that can store these sorts of files.

   -
   - Parallel computation on a(n inexpensive) cluster because it would take
   too long or the data would not fit on one computer. This means the
   algorithms had to be rewritten for parallel execution. This was the
   'Volume' part of big data marketing.
   - Apache Mahout <https://mahout.apache.org/users/basics/algorithms.html>
      (in java) was I think one of the first open-source implementation of
      parallelized machine learning algorithms.
      - The hottest things for this now are the Spark Machine Learning
      library <http://spark.apache.org/docs/latest/mllib-guide.html> ( -- Pycon
      2015 presentation of spark+python
      <http://pyvideo.org/video/3407/introduction-to-spark-with-python>) .
      There is also a Chicago Spark Meetup
      <http://www.meetup.com/Chicago-Spark-Users/>.
      - And the newcomer Apache Flink, also in Java, bypasses Java's
      garbage collection for speed, optimizes SQL queries (unlike Hive), and
      claims to provide a truly streaming analytics option without some of the
      hangups of Storm. It also has Python bindings
      <http://mvnrepository.com/artifact/org.apache.flink/flink-python>.
      There is a Chicago Flink meetup
      <http://www.meetup.com/Chicago-Apache-Flink-Meetup/> -- I think it's
      the 3rd Flink user group in North America.

hope it was useful...see you @ Braintree Thursday!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20150908/8a9ef7c7/attachment.html>