Everything Thats Big Data: August 2012

Monday, August 27, 2012

The Curious Case of Polyglot Persistence in NOSQL Data Stores

The term Polyglot Persistence means, the ability to store data over multiple stores(data stores). In fact Polyglot Persistence has been in existence for a few more years now, but what is interesting is the ability to map this into the big data space- more particularly in the various nosql stores. Lets take a few moments to understand this.

Normally the NOSQL stores have been classified into 4 main categories

Key-Value Stores ( Riak, Redis)
Column Oriented Stores (Hbase, Cassandra)
Document Oriented Stores ( MongoDB, CouchDB)
Graph Oriented Stores.(Neo4j)

Now each of the these data stores have specific use cases where one would use one of them over another. For example, if someone is trying to do really fast calculations - performance is on the cards, he would normally go for key-value stores due to the inherent data structure supporting high performance and some one would chose a graph DB, when there are lots of recursive decisions.

In a typical large enterprise, there may be multiple use cases, where more than one or multiple of the type of nosql data stores may be required, for e.g. a retail enterprise may require a graph db to store the relations of its customers with other customers, a KV store to do near real time calculations, a column store to do click stream analysis etc. But what is critical is, that there must be some way to store data, or a single data, through its various modifications/transformations on all of these various data stores. The ability to do this is what Polyglot Persistence means in the big data context.

Polyglot Persistence is quite complex and difficult to implement in the nosql- big data context, since we are dealing with poly structured data which has both volume and velocity. So how do we do manage the persistence? A few suggested approaches are

We follow Abstract Factory Pattern and create a Dynamic DAO

The Dynamic DAO creates factories for each individual data stores and stores the data(read/write)
Each of the factories are dynamic themselves because based on the data that they get they create varying number of rows & columns at run time. Think of Hector here.

A platform similar to the one that is provided by DataNucleus. This is a very interesting and effective platform, but it depends to be seen to what extent can it provide run time polyglot persistence and I have not yet tested this fully. So I will reserve my judgement.

Finally Polyglot Persistence is very important from an enterprise's point of view because the way Big Data is growing and the impact it has on the way we conduct business, its imperative that enterprises will soon require multiple types of nosql data stores to manage and work with their data and track the relationships of data with various other data from multiple data sources.

Wednesday, August 1, 2012

NOSQL Use Cases

I have often been asked when should someone go for a NOSQL solution over age old RDBMS. These are my recommendations for using NOSQL:

You have data which is "poly-structured" in nature. By poly-structured data I mean data comes in various formats.
Data comes in great velocity
Data comes in great volumes.
You require High Availability from your data store
You want to control and have trade offs between Availability, Consistency and Partition Tolerance
You want to write very quickly than reads ( since data comes in with great velocity and you would want to store it)
You want to be able to process data in parallel, i.e. be able to run the very popular Map Reduce on the data that is stored on the data store
You do not want to be tied down by a fixed schema but you want to have a schema which is free and flexible. You want your data store to support "Elastic Schema Strategy"
Your data store would be able to support on demand horizontal scalability
You want to be able to provide configurable replication capabilities
You want to have distributed architectures which are autonomous and you do not have to spend time plumbing your application and system. but you get out of the box support from the nosql store.
You want fast programmatic approach to your data and don't want to get stuck with ORM and their dependencies on Relational Data Stores.
You want a low cost solution which is easy to maintain, has good monitoring and management capability and finally, enjoys support from open source.