[BangPypers] BangPypers meeting February 2011

Sun Feb 13 08:49:58 CET 2011

On Sun, Feb 13, 2011 at 8:33 AM, Ramdas S <ramdaz at gmail.com> wrote:

>
>
> Thought I will pick your brains on this.
>
> We are archiving a lot of information, some message format very similar to
> email in structure, through its not an RFC complaint format. Presently we
> are storing some basic seachable details in a data base, and the physical
> file  is in a SAN box, with the location of file also in the database. It's
> fine now, but we are expecting the client to generate a few TB of
> information over the next 2 years.
>
> Does this make a good case of using NoSQL. Also I remember someone saying
> that NOSQL  stuff like MongoDB does a miss a document once in a while.
>

Wow, interesting discussion. I missed the crux of it I believe. Anyway
here are my 2 cents on this.

Document storage, query and retrieval always has 3 aspects - Data
consistency, Data availability and Data distribution (partitioning).

Traditional RDBMS (aka SQL) Databases focus mainly on the
first one, i.e data consistency - hence we got the terms 'ACID'
compliant and the like. On top of such structured data which always
promise you consistency, they provide a structured query language,
aka SQL. In this world, Availability and Partitioning are always
add-ons, that painful process the DBM has to perform with
DB replication, mirroring, modifying schema for partitioning,
Clustered DB etc.

The new generation DBs instead chose to focus on the latter two,
i.e availability and distribution, while making some assumptions on
the consistency part. They are natural evolutions from the data grid
or cloud architecture where data is massively striped and scaled
on to multiple nodes in multiple data centers thereby lowering
your data retrieval latency to the scale of micro seconds. Hence
they are a natural fit if you don't mind some inconsistency in data
retrieval from say 2 clients across different geographical locations
at the same time, but you are more concerned about how quick
the data is stored and retrieved. These DBs also choose not to use
SQL, hence the "NoSQL" term. The reason is that they don't need to
use SQL since the focus is not so much on queries that span and join
across multiple tables as in a fast fetch, given a key

There are some problems which fit the NoSQL world and some
which fit the the SQL one. If you are bank, you won't dare to dream
not having data consistency, since data correctness and atomic
transactions are so much essential in the financial world. But
a twitter or Google can live with some minor inconsistencies, but
they need fast response time, so map/reduce and NoSQL
DBs is a natural fit there.

So an approximate rule of thumb would be,

1. If your data is highly structured and you have complex queries
and your clients expect consistent results, stick to RDBMs.
2. If your data is more like a simple key-value store and you
are more worried about query/response times rather than the
consistency of the data, perhaps a Document storage (no
sql) design is the correct one.

To me both these worlds are complimentary to each other.
I don't believe in the so-called 'sql' vs 'nosql' wars. That is
simply a media hype.

--Anand