By 2018, the United States alone could face a shortage
of 140,000to 190,000 people with deep analytical skills as well as 1.5 million
managers and analysts with the know-how to use the analysis of big data to make
effective decisions" Source:
“Big Data: the next frontier for Innovation, competition and
productivity". McKinsey, May 2011
Big data is moving from a relational to a chaotic
world. Today, we already have a huge amount of data stored in a structured
format in traditional relational databases but unstructured complex data from
mixed sources and multiple formats text files, logs, binary, XML etc poses a
huge problem. It becomes a huge challenge when it is complemented with the
volume of data moving from terrabytes (called "Terror Bytes" sometime
ago due to the size) to petabytes. To add to the above, organizations today
have a HUGE data management problem with data in silos and scattered
everywhere. The ability to stitch together multiple sources of data is going to
be the game changer.
The world desperately needed answers to these
challenges where data can be stored, processed and computed irrespective of
size, format, structure or schemas in a cheaper and faster way.
Apache
Hadoop
The Apache™ Hadoop® project develops open-source software
for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local computation and
storage. Rather than rely on hardware to deliver high-availability, the library
itself is designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures.
MapReduce: At the core, MapReduce has the ability to run
a query over a dataset, distribute it and run it parallel over multiple nodes.
Distributing the query solves the issue of size and capacity. MapReduce can
also be found inside MPP and NoSQL databases, such as Vertica or MongoDB.
Hadoop Distributed File
System (HDFS™): For that computation to take
place, each server must have access to the data. HDFS ensures data is
replicated with redundancy across the cluster. On completion of a calculation,
a node will write its results back into HDFS. There are no restrictions on the
data that HDFS stores. Data may be unstructured and schemaless.
PIG: Pig is a programming
language that simplifies the tasks of loading data, transforming data and
storing the final results. Pig’s built-in operations can make sense of
semi-structured data, such as log files, and the language is extensible using
Java to add support for custom data types and transformations.
Pig gives the developer more agility for the
exploration of large datasets, allowing the development of succinct scripts for
transforming data flows for incorporation into larger applications as well as
drastically cuts the amount of code needed compared to direct use of Hadoop’s
Java APIs.
A complete list of Hadoop
modules:
Ambari
|
Deployment, configuration and monitoring
|
Flume
|
Collection and import of log and event data
|
HBase
|
Column-oriented database scaling to billions of rows
|
HCatalog
|
Schema and data type sharing over Pig, Hive and MapReduce
|
HDFS
|
Distributed redundant file system for Hadoop
|
Hive
|
Data warehouse with SQL-like access
|
Mahout
|
Library of machine learning and data mining algorithms
|
MapReduce
|
Parallel computation on server clusters
|
Pig
|
High-level programming language for Hadoop computations
|
Oozie
|
Orchestration and workflow management
|
Sqoop
|
Imports data from relational databases
|
Whirr
|
Cloud-agnostic deployment of clusters
|
Zookeeper
|
Configuration management and coordination
|
Who
should use Hadoop?
Typically, any organization with
more than 2 terabytes of data should consider Hadoop. "Anything more than
100 [terabytes], you absolutely want to be looking at Hadoop," said Josh
Sullivan, a Vice President at Booz Allen Hamilton and founder of the Hadoop-DC
Meetup group.
Case
: Twitter
“Twitter users generate 12 terrabytes of data a day -
about four petabytes per year. And that amount is multiplying every year.”
With this massive amount of user generated data Twitter
has to store data on clusters rather than storing it in a single hard drive.
Twitter uses Cloudera's Hadoop distribution to power its clusters.
Twitter uses all the data it collects to answer
multiple questions. From simple computations such as to figure out the number
of requests and searches it serves every day to complex comparative user
analysis such as determining how different users use their service or if
certain features contribute to casual users becoming frequent users. Several
other interesting analyses such as determining which tweets get retweeted,
differentiating between humans and bots etc are areas of deep interest.
Frequently asked Questions:
Programming
using R
Revolution Analytics has developed “ConnectR for
Hadoop,” a collection of capabilities that bring the power of advanced R
analytics to Hadoop distributions
including from our partners Cloudera,
HortonWorks, IBM BigInsights and Intel. ConnectR for Hadoop provides the ability to
manipulate Hadoop data stores directly from HDFS and HBASE—and give R
programmers the ability to write MapReduce jobs in R using Hadoop Streaming.
With RevoConnectR for
Hadoop and Revolution R Enterprise 6, R users can:
- ·
Interface directly with the HDFS filesystem from R.
- ·
Import big-data tables into R from Hadoop filestores via HBASE.
- ·
Create big-data analytics by writing map-reduce tasks directly in the R
language
Programming
using SAS
SAS' support for Hadoop is centered on a singular
goal: helping you know more – faster – so you can make better decisions. Beyond
accessing this tidal wave of data, SAS products and services create seamless and
transparent access to more Hadoop capabilities such as the Pig and Hive
languages and the MapReduce framework. SAS provides
the framework for a richer visual and interactive Hadoop experience, making it
easier to gain insights and discover trends.