Monday, March 17, 2014

Nosql vs Hadoop .. Demystifying the Delima


The rapid growth of Big Data has led to a huge number of innovations on the techniques to handle the volume, velocity variety and Veracity of data - the classic 4 pillars of Big Data. The Big Data eco system, though still evolving has spanned a huge number of products and platforms. Hadoop is one the most popular Big Data platforms enjoying the greatest popularity. The term NoSQL refers to a class of data storage platforms which are not queried using SQL and do not necessarily store data in tables or strictly follow relational algebra. The hadoop eco system also contains nosql stores.


( Image Copyright: IBM)

What is Hadoop
Hadoop is primarily a distributed file system, called Hadoop Distributed File System ( HDFS) and takes advantage of the Map Reduce paradigm to do parallel processing on huge data sets. HDFS allows storage of data and processing at the same place, inherently providing controls for prevention of data corruption, security, autonomic capabilities etc, that comes out of the box. There are several products that are built on top of the hdfs each taking advantage of hdfs and map reduce but specialized to perform different activities, called the Hadoop Eco System Components. Some of the most popular eco system components are Hive, Pig, Sqoop, Hbase, Avro, Flume etc.

( Image Copyright: Hortonworks)


What is NoSQL
NoSQL primarily refers to a class of data stores that do not use SQL to query and use the data, they do not follow relational Algebra and do not necessarily use tables to store the data. NoSQL data stores provides freedom from the Relational Data stores and its complexity. Very simply nosql stores are data stores which can handle all the requirements of big data but at the same time in a much more simpler manner. The hadoop Eco system also includes a nosql store- Hbase. The nosql stores are also distributed in nature and may or may not use the HDFS for their functionality. There are various types of nosql stores currently in production, but the most popular ones till date are MongoDB, Cassandra, Riak, Redis, Neo4j etc.


(Image copyright: oreilly)

Demystifying  Hadoop and Nosql
The short introduction on Hadoop and NoSQL should have made it clear by now that they are not the same, however each can take advantage of each other. Most importantly its better to go with NoSQL based approaches, when there is no need to use Map Reduce or less requirement to set up a hadoop cluster or there is need for a lesser expensive solution. This link provides details on the several use choices of why and when to use NoSQL

1 comment:

  1. Hi Raj,

    Well written article that clarifies the applicability of SQL and No-SQL in a lucid manner specially in a complex world of Big Data. This has helped me get an understanding of application usability of No-SQL. Thank you, Tushar

    ReplyDelete