Everything Thats Big Data

Monday, March 17, 2014

Nosql vs Hadoop .. Demystifying the Delima

The rapid growth of Big Data has led to a huge number of innovations on the techniques to handle the volume, velocity variety and Veracity of data - the classic 4 pillars of Big Data. The Big Data eco system, though still evolving has spanned a huge number of products and platforms. Hadoop is one the most popular Big Data platforms enjoying the greatest popularity. The term NoSQL refers to a class of data storage platforms which are not queried using SQL and do not necessarily store data in tables or strictly follow relational algebra. The hadoop eco system also contains nosql stores.

( Image Copyright: IBM)

What is Hadoop
Hadoop is primarily a distributed file system, called Hadoop Distributed File System ( HDFS) and takes advantage of the Map Reduce paradigm to do parallel processing on huge data sets. HDFS allows storage of data and processing at the same place, inherently providing controls for prevention of data corruption, security, autonomic capabilities etc, that comes out of the box. There are several products that are built on top of the hdfs each taking advantage of hdfs and map reduce but specialized to perform different activities, called the Hadoop Eco System Components. Some of the most popular eco system components are Hive, Pig, Sqoop, Hbase, Avro, Flume etc.

( Image Copyright: Hortonworks)

What is NoSQL
NoSQL primarily refers to a class of data stores that do not use SQL to query and use the data, they do not follow relational Algebra and do not necessarily use tables to store the data. NoSQL data stores provides freedom from the Relational Data stores and its complexity. Very simply nosql stores are data stores which can handle all the requirements of big data but at the same time in a much more simpler manner. The hadoop Eco system also includes a nosql store- Hbase. The nosql stores are also distributed in nature and may or may not use the HDFS for their functionality. There are various types of nosql stores currently in production, but the most popular ones till date are MongoDB, Cassandra, Riak, Redis, Neo4j etc.

(Image copyright: oreilly)

Demystifying Hadoop and Nosql
The short introduction on Hadoop and NoSQL should have made it clear by now that they are not the same, however each can take advantage of each other. Most importantly its better to go with NoSQL based approaches, when there is no need to use Map Reduce or less requirement to set up a hadoop cluster or there is need for a lesser expensive solution. This link provides details on the several use choices of why and when to use NoSQL

Monday, November 4, 2013

Quick Hadoop Command Line Guide

I have found it really helpful to have have a ready guide for the most commonly used command line commands for Hadoop. These are certainly there and available on every distribution of hadoop, but most new comers in this area struggle a lot. Here I have compiled a list of all the commands that a developer or a system admin may need to test, run and configure hadoop distributions. Please do note that even though most of these are generic, some distributions may add some modifications to it.

hadoop fs [-fs <local | file system URI>] [-conf <configuration file>]
        [-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>]
        [-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm [-skipTrash] <src>]
        [-rmr [-skipTrash] <src>] [-put <localsrc> ... <dst>] [-copyFromLocal <localsrc> ... <dst>]
        [-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst>
        [-getmerge <src> <localdst> [addnl]] [-cat <src>]
        [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>]
        [-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>]
        [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>]
        [-tail [-f] <path>] [-text <path>]
        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
        [-chown [-R] [OWNER][:[GROUP]] PATH...]
        [-chgrp [-R] GROUP PATH...]
        [-count[-q] <path>]
        [-help [cmd]]

-fs [local | <file system URI>]:        Specify the file system to use.
                If not specified, the current configuration is used,
                taken from the following, in increasing precedence:
                        core-default.xml inside the hadoop jar file
                        core-site.xml in $HADOOP_CONF_DIR
                'local' means use the local file system as your DFS.
                <file system URI> specifies a particular file system to
                contact. This argument is optional but if used must appear
                appear first on the command line. Exactly one additional
                argument must be specified.

-ls <path>:     List the contents that match the specified file pattern. If
                path is not specified, the contents of /user/<currentUser>
                will be listed. Directory entries are of the form
                        dirName (full path) <dir>
                and file entries are of the form
                        fileName(full path) <r n> size
                where n is the number of replicas specified for the file
                and size is the size of the file, in bytes.

-lsr <path>:    Recursively list the contents that match the specified
                file pattern. Behaves very similarly to hadoop fs -ls,
                except that the data is shown for all the entries in the
                subtree.

-du <path>:     Show the amount of space, in bytes, used by the files that
                match the specified file pattern. Equivalent to the unix
                command "du -sb <path>/*" in case of a directory,
                and to "du -b <path>" in case of a file.
                The output is in the form
                        name(full path) size (in bytes)

-dus <path>:    Show the amount of space, in bytes, used by the files that
                match the specified file pattern. Equivalent to the unix
                command "du -sb" The output is in the form
                        name(full path) size (in bytes)

-mv <src> <dst>:   Move files that match the specified file pattern <src>
                to a destination <dst>. When moving multiple files, the
                destination must be a directory.

-cp <src> <dst>:   Copy files that match the file pattern <src> to a
                destination. When copying multiple files, the destination
                must be a directory.

-rm [-skipTrash] <src>:         Delete all files that match the specified file pattern.
                Equivalent to the Unix command "rm <src>"
                -skipTrash option bypasses trash, if enabled, and immediately
deletes <src>
-rmr [-skipTrash] <src>:        Remove all directories which match the specified file
                pattern. Equivalent to the Unix command "rm -rf <src>"
                -skipTrash option bypasses trash, if enabled, and immediately
deletes <src>
-put <localsrc> ... <dst>:      Copy files from the local file system
                into fs.

-copyFromLocal <localsrc> ... <dst>: Identical to the -put command.

-moveFromLocal <localsrc> ... <dst>: Same as -put, except that the source is
                deleted after it's copied.

-get [-ignoreCrc] [-crc] <src> <localdst>: Copy files that match the file pattern <src>
                to the local name. <src> is kept. When copying mutiple,
                files, the destination must be a directory.

-getmerge <src> <localdst>: Get all the files in the directories that
                match the source file pattern and merge and sort them to only
                one file on local fs. <src> is kept.

-cat <src>:     Fetch all files that match the file pattern <src>
                and display their content on stdout.

-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>: Identical to the -get command.

-moveToLocal <src> <localdst>: Not implemented yet

-mkdir <path>: Create a directory in specified location.

-setrep [-R] [-w] <rep> <path/file>: Set the replication level of a file.
                The -R flag requests a recursive change of replication level
                for an entire tree.

-tail [-f] <file>: Show the last 1KB of the file.
                The -f option shows apended data as the file grows.

-touchz <path>: Write a timestamp in yyyy-MM-dd HH:mm:ss format
                in a file at <path>. An error is returned if the file exists with non-zero length

-test -[ezd] <path>: If file { exists, has zero length, is a directory
                then return 0, else return 1.

-text <src>:    Takes a source file and outputs the file in text format.
                The allowed formats are zip and TextRecordInputStream.

[ Very Interesting: Can be used to display lines of a text file stored on HDFS]

-stat [format] <path>: Print statistics about the file/directory at <path>
                in the specified format. Format accepts filesize in blocks (%b), filename (%n),
                block size (%o), replication (%r), modification date (%y, %Y)

-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...
                Changes permissions of a file.
                This works similar to shell's chmod with a few exceptions.

        -R      modifies the files recursively. This is the only option
                currently supported.

        MODE    Mode is same as mode used for chmod shell command.
                Only letters recognized are 'rwxX'. E.g. a+r,g-w,+rwx,o=r

        OCTALMODE Mode specifed in 3 digits. Unlike shell command,
                this requires all three digits.
                E.g. 754 is same as u=rwx,g=rx,o=r

                If none of 'augo' is specified, 'a' is assumed and unlike
                shell command, no umask is applied.

-chown [-R] [OWNER][:[GROUP]] PATH...
                Changes owner and group of a file.
                This is similar to shell's chown with a few exceptions.

        -R      modifies the files recursively. This is the only option
                currently supported.

                If only owner or group is specified then only owner or
                group is modified.

                The owner and group names may only cosists of digits, alphabet,
                and any of '-_.@/' i.e. [-_.@/a-zA-Z0-9]. The names are case
                sensitive.

                WARNING: Avoid using '.' to separate user name and group though
                Linux allows it. If user names have dots in them and you are
                using local file system, you might see surprising results since
                shell command 'chown' is used for local files.

-chgrp [-R] GROUP PATH...
                This is equivalent to -chown ... :GROUP ...

-count[-q] <path>: Count the number of directories, files and bytes under the paths
                that match the specified file pattern. The output columns are:
                DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME or
                QUOTA REMAINING_QUATA SPACE_QUOTA REMAINING_SPACE_QUOTA
                      DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
-help [cmd]:    Displays help for given command or all commands if none
                is specified.

Monday, August 27, 2012

The Curious Case of Polyglot Persistence in NOSQL Data Stores

The term Polyglot Persistence means, the ability to store data over multiple stores(data stores). In fact Polyglot Persistence has been in existence for a few more years now, but what is interesting is the ability to map this into the big data space- more particularly in the various nosql stores. Lets take a few moments to understand this.

Normally the NOSQL stores have been classified into 4 main categories

Key-Value Stores ( Riak, Redis)
Column Oriented Stores (Hbase, Cassandra)
Document Oriented Stores ( MongoDB, CouchDB)
Graph Oriented Stores.(Neo4j)

Now each of the these data stores have specific use cases where one would use one of them over another. For example, if someone is trying to do really fast calculations - performance is on the cards, he would normally go for key-value stores due to the inherent data structure supporting high performance and some one would chose a graph DB, when there are lots of recursive decisions.

In a typical large enterprise, there may be multiple use cases, where more than one or multiple of the type of nosql data stores may be required, for e.g. a retail enterprise may require a graph db to store the relations of its customers with other customers, a KV store to do near real time calculations, a column store to do click stream analysis etc. But what is critical is, that there must be some way to store data, or a single data, through its various modifications/transformations on all of these various data stores. The ability to do this is what Polyglot Persistence means in the big data context.

Polyglot Persistence is quite complex and difficult to implement in the nosql- big data context, since we are dealing with poly structured data which has both volume and velocity. So how do we do manage the persistence? A few suggested approaches are

We follow Abstract Factory Pattern and create a Dynamic DAO

The Dynamic DAO creates factories for each individual data stores and stores the data(read/write)
Each of the factories are dynamic themselves because based on the data that they get they create varying number of rows & columns at run time. Think of Hector here.

A platform similar to the one that is provided by DataNucleus. This is a very interesting and effective platform, but it depends to be seen to what extent can it provide run time polyglot persistence and I have not yet tested this fully. So I will reserve my judgement.

Finally Polyglot Persistence is very important from an enterprise's point of view because the way Big Data is growing and the impact it has on the way we conduct business, its imperative that enterprises will soon require multiple types of nosql data stores to manage and work with their data and track the relationships of data with various other data from multiple data sources.

Wednesday, August 1, 2012

NOSQL Use Cases

I have often been asked when should someone go for a NOSQL solution over age old RDBMS. These are my recommendations for using NOSQL:

You have data which is "poly-structured" in nature. By poly-structured data I mean data comes in various formats.
Data comes in great velocity
Data comes in great volumes.
You require High Availability from your data store
You want to control and have trade offs between Availability, Consistency and Partition Tolerance
You want to write very quickly than reads ( since data comes in with great velocity and you would want to store it)
You want to be able to process data in parallel, i.e. be able to run the very popular Map Reduce on the data that is stored on the data store
You do not want to be tied down by a fixed schema but you want to have a schema which is free and flexible. You want your data store to support "Elastic Schema Strategy"
Your data store would be able to support on demand horizontal scalability
You want to be able to provide configurable replication capabilities
You want to have distributed architectures which are autonomous and you do not have to spend time plumbing your application and system. but you get out of the box support from the nosql store.
You want fast programmatic approach to your data and don't want to get stuck with ORM and their dependencies on Relational Data Stores.
You want a low cost solution which is easy to maintain, has good monitoring and management capability and finally, enjoys support from open source.